diff mbox series

mce: prevent concurrent polling of MCE events

Message ID 20230515143225.GC4090740@cathedrallabs.org (mailing list archive)
State New, archived
Headers show
Series mce: prevent concurrent polling of MCE events | expand

Commit Message

Aristeu Rozanski May 15, 2023, 2:32 p.m. UTC
Error injection in modern HP machines with CMCI disabled will cause the
injected MCE to be found only by polling. Because these newer machines have a
big number of CPUs per package, it makes a lot more likely for multiple
CPUs polling IMC registers (that are shared in the same package) at same time,
causing multiple reports of the same MCE.

Signed-off-by: Aristeu Rozanski <aris@ruivo.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: linux-edac@vger.kernel.org

Comments

Borislav Petkov May 15, 2023, 2:52 p.m. UTC | #1
On Mon, May 15, 2023 at 10:32:25AM -0400, Aristeu Rozanski wrote:
> Error injection in modern HP machines with CMCI disabled will cause the
> injected MCE to be found only by polling. Because these newer machines have a
> big number of CPUs per package, it makes a lot more likely for multiple
> CPUs polling IMC registers (that are shared in the same package) at same time,

So the MCE code already has some notion of shared banks and tries to pay
attention to it.

I think it'll be cleaner if machine_check_poll() paid attention to that
and polled only on the smallest CPU number which shares the bank instead
of doing an unconditional wholesale spinlock grabbing which is not
really needed...

Thx.
Tony Luck May 15, 2023, 5:18 p.m. UTC | #2
> So the MCE code already has some notion of shared banks and tries to pay
> attention to it.

The MCE code only knows sharing scope if CMCI is enabled. The algorithm to
determine ownership of a shared bank is "first to find the bank is the owner"
as determined by successful setting of bit 30 in IA32_MCi_CTL2. There
isn't any other practical way to determine scope of bank sharing.

In Aristeu's case with the HP systems CMCI is disabled. I saw the same
on a system here when booted with "mce=no_cmci". Twelve CPUs (out
of 144) were polling on the same jiffy and reported the same error.

-Tony
Borislav Petkov May 15, 2023, 6:30 p.m. UTC | #3
On Mon, May 15, 2023 at 05:18:06PM +0000, Luck, Tony wrote:
> The MCE code only knows sharing scope if CMCI is enabled. The algorithm to
> determine ownership of a shared bank is "first to find the bank is the owner"
> as determined by successful setting of bit 30 in IA32_MCi_CTL2. There
> isn't any other practical way to determine scope of bank sharing.

I'm not talking about this ownership - I'm talking about how many
logical cores get to see the same error info because the bank is
replicated across all of them in *hardware*.

> In Aristeu's case with the HP systems CMCI is disabled. I saw the same
> on a system here when booted with "mce=no_cmci". Twelve CPUs (out
> of 144) were polling on the same jiffy and reported the same error.

Is CMCI disabled a valid use case or is this someone doing testing?

Thx.
Tony Luck May 15, 2023, 7:08 p.m. UTC | #4
> I'm not talking about this ownership - I'm talking about how many
> logical cores get to see the same error info because the bank is
> replicated across all of them in *hardware*.

The hardware doesn't enumerate that banks are shared, or which
logical CPUs share each bank. The sharing scope can, and has,
changed from one generation to another. E.g. Banks 0 and 1 were
core scoped from Nehalem until Cascade Lake. But Icelake and
subsequent (so far) CPUs made them thread scoped.

-Tony
Borislav Petkov May 15, 2023, 7:44 p.m. UTC | #5
On Mon, May 15, 2023 at 07:08:01PM +0000, Luck, Tony wrote:
> > I'm not talking about this ownership - I'm talking about how many
> > logical cores get to see the same error info because the bank is
> > replicated across all of them in *hardware*.
> 
> The hardware doesn't enumerate that banks are shared, or which
> logical CPUs share each bank. The sharing scope can, and has,
> changed from one generation to another. E.g. Banks 0 and 1 were
> core scoped from Nehalem until Cascade Lake. But Icelake and
> subsequent (so far) CPUs made them thread scoped.

So this unconditional serialization is just gross.  This is a per-cpu
timer grabbing a global lock. It is going to make unrelated threads all
wait on a single lock while polling, for nothing.

On a huge box with gazillion cores and when the timers fire just at the
right time, they're all going to hit that lock real hard.

There has to be a better way...

Also, from my previous mail: Is CMCI disabled a valid use case or is
this someone doing testing?
Tony Luck May 15, 2023, 8:07 p.m. UTC | #6
> So this unconditional serialization is just gross.  This is a per-cpu
> timer grabbing a global lock. It is going to make unrelated threads all
> wait on a single lock while polling, for nothing.
>
> On a huge box with gazillion cores and when the timers fire just at the
> right time, they're all going to hit that lock real hard.

Linux does try to avoid all the CPUs taking the polling interrupt at the
same time. See my earlier e-mail:

https://lore.kernel.org/all/ZF18kdWCWKqFc23p@agluck-desk3.sc.intel.com/

but there are often more CPUs than HZ ticks to spread things out. So still
get a few collisions.

> There has to be a better way...

Options:
1) Divide up the lock.
	1a) One lock per machine check bank
	1b) One lock per socket
	1c) Combo of 1a & 1b
2) Enumerate the bank sharing with a big table per CPU model number
	[No, we really don't want to do that!]
3) Allow multiple CPUs to read the same error. Add a filter to suppress duplicates in the logging path.

> Also, from my previous mail: Is CMCI disabled a valid use case or is
> this someone doing testing?

I disabled CMCI for debugging reasons. Aristeu needs to comment on
his use case.

-Tony
Aristeu Rozanski May 15, 2023, 8:20 p.m. UTC | #7
On Mon, May 15, 2023 at 08:07:10PM +0000, Luck, Tony wrote:
> I disabled CMCI for debugging reasons. Aristeu needs to comment on
> his use case.

Partner has a toggle in BIOS and reported the issue. I suspect it has to do
with a big mutual customer that wants to reduce CE storms caused CPU spikes
to a minimal, but I'll get more details.
Tony Luck May 15, 2023, 8:27 p.m. UTC | #8
> Partner has a toggle in BIOS and reported the issue. I suspect it has to do
> with a big mutual customer that wants to reduce CE storms caused CPU spikes
> to a minimal, but I'll get more details.

If CMCI storms are their problem ... perhaps this is an answer:

https://lore.kernel.org/all/20230411173841.70491-1-tony.luck@intel.com/

-Tony
Aristeu Rozanski May 15, 2023, 8:32 p.m. UTC | #9
On Mon, May 15, 2023 at 08:27:19PM +0000, Luck, Tony wrote:
> If CMCI storms are their problem ... perhaps this is an answer:
> 
> https://lore.kernel.org/all/20230411173841.70491-1-tony.luck@intel.com/

Apparently they're disabling CMCI for CE and UCNA events when OS-first is
enabled for two entire lines of products. So there might be more behind
this than I suspected. I just sent a message to their engineers to find out
what's going on.

Now on the CMCI storm mitigation, wouldn't a system that got a storm be
subject to this same issue of duplicated events since it switches to polling?
Tony Luck May 15, 2023, 8:40 p.m. UTC | #10
> Now on the CMCI storm mitigation, wouldn't a system that got a storm be
> subject to this same issue of duplicated events since it switches to polling?

Current code switches to polling. But I think that it makes use of the bank ownership
information that it derived when CMCI was originally enabled. It wouldn't surprise me
if things go wrong if you mix in CPU hotplug with a storm and offline the CPU that
"owns" a bank. Perhaps the ownership will get passed to another CPU in the same
scope. But I'm not sure.

Those new patches don't switch to polling. They just reset the threshold in MCi_CTL2
to a very big value to throttle the number of interrupts for corrected errors, while still
processing the UCNA errors that are also signaled using CMCI right away.

-Tony
Borislav Petkov May 16, 2023, 5:08 p.m. UTC | #11
On Mon, May 15, 2023 at 04:32:20PM -0400, Aristeu Rozanski wrote:
> On Mon, May 15, 2023 at 08:27:19PM +0000, Luck, Tony wrote:
> > If CMCI storms are their problem ... perhaps this is an answer:
> > 
> > https://lore.kernel.org/all/20230411173841.70491-1-tony.luck@intel.com/
> 
> Apparently they're disabling CMCI for CE and UCNA events when OS-first is
> enabled for two entire lines of products. So there might be more behind
> this than I suspected. I just sent a message to their engineers to find out
> what's going on.

Might also ask them to test Tony's patches, see whether they help.

Thx.
Aristeu Rozanski May 23, 2023, 2:15 p.m. UTC | #12
On Mon, May 15, 2023 at 08:07:10PM +0000, Luck, Tony wrote:
> I disabled CMCI for debugging reasons. Aristeu needs to comment on
> his use case.

Got an answer from them. They disable it in two lines of products
(one uses IceLake, the other uses Sapphire Rapids) and in these lines
they can use thresholding to signal UCNA without signaling any corrected
events and it won't work with CMCI enabled. They did make Intel aware of
it (maybe you heard details about it? If not I can get them in contact with
you) but it's not certain if this can be fixed or not.
Aristeu Rozanski June 4, 2023, 4:04 p.m. UTC | #13
On Tue, May 23, 2023 at 10:15:23AM -0400, Aristeu Rozanski wrote:
> On Mon, May 15, 2023 at 08:07:10PM +0000, Luck, Tony wrote:
> > I disabled CMCI for debugging reasons. Aristeu needs to comment on
> > his use case.
> 
> Got an answer from them. They disable it in two lines of products
> (one uses IceLake, the other uses Sapphire Rapids) and in these lines
> they can use thresholding to signal UCNA without signaling any corrected
> events and it won't work with CMCI enabled. They did make Intel aware of
> it (maybe you heard details about it? If not I can get them in contact with
> you) but it's not certain if this can be fixed or not.

Tony?
Tony Luck June 5, 2023, 3:33 p.m. UTC | #14
>> > I disabled CMCI for debugging reasons. Aristeu needs to comment on
>> > his use case.
>> 
>> Got an answer from them. They disable it in two lines of products
>> (one uses IceLake, the other uses Sapphire Rapids) and in these lines
>> they can use thresholding to signal UCNA without signaling any corrected
>> events and it won't work with CMCI enabled. They did make Intel aware of
>> it (maybe you heard details about it? If not I can get them in contact with
>> you) but it's not certain if this can be fixed or not.

Since mce=no_cmci is a supported mode, and end-users are making use
of it on production systems ... I think Linux needs a patch to prevent multiple
CPUs logging the same error.

I don't have any other ideas on how to do that besides Aristeu's patch to add
a spinlock to prevent simultaneous access.  This isn't a hot path, and in most
cases not contended, so I don't see much downside.

-Tony
Borislav Petkov June 5, 2023, 5:41 p.m. UTC | #15
On Mon, Jun 05, 2023 at 03:33:48PM +0000, Luck, Tony wrote:
> Since mce=no_cmci is a supported mode, and end-users are making use
> of it on production systems ... I think Linux needs a patch to prevent multiple
> CPUs logging the same error.
> 
> I don't have any other ideas

I have some:

1. One can teach mce_gen_pool_add() to scan what's already on the list
   and drop duplicate errors. Not sure if it is worth it.

or

2. Carve out that abysmal locking into a function called
   poll_cmci_disabled() or so, run this function only on CMCI machines
   and make sure the use case is properly explained in the commit
   message.

I have no clue what "they can use thresholding to signal UCNA without
signaling any corrected events and it won't work with CMCI enabled."

I thought HP use firmware first and all that magic EFI gunk to do RAS
and don't rely on OS facilities. Oh well...
Tony Luck June 5, 2023, 5:58 p.m. UTC | #16
>> I don't have any other ideas
>
> I have some:
>
> 1. One can teach mce_gen_pool_add() to scan what's already on the list
>    and drop duplicate errors. Not sure if it is worth it.

That would need to take care to only strip out truly duplicated reports, but
leave in genuine repeats of errors from the same location. I agree this path
isn't worth taking.

> or
>
> 2. Carve out that abysmal locking into a function called
>    poll_cmci_disabled() or so, run this function only on CMCI machines
>    and make sure the use case is properly explained in the commit
>    message.

How much code do you want to duplicate? machine_check_poll() is ~100
lines, most of in the racy section from:

                m.status = mce_rdmsrl(mca_msr_reg(i, MCA_STATUS));
to
                mce_wrmsrl(mca_msr_reg(i, MCA_STATUS), 0);

That looks like long term maintenance pain to keep the CMCI on/off
versions in step.

> I have no clue what "they can use thresholding to signal UCNA without
> signaling any corrected events and it won't work with CMCI enabled."

I can't parse that either. Maybe rooted in the fact that UCNA ignores the
threshold in MCi_CTL2 ... but maybe there's some missing description
of what the SMI handler is doing.

> I thought HP use firmware first and all that magic EFI gunk to do RAS
> and don't rely on OS facilities. Oh well...

HPE are big fans of firmware first for corrected errors. It's more complicated
for uncorrected ones as the OS really needs to know about those.

-Tony
Aristeu Rozanski June 5, 2023, 7:08 p.m. UTC | #17
On Mon, Jun 05, 2023 at 07:41:04PM +0200, Borislav Petkov wrote:
> I have no clue what "they can use thresholding to signal UCNA without
> signaling any corrected events and it won't work with CMCI enabled."

That's pretty much exactly what was passed to me. I'll ask who the
Intel contact was and inform Tony so they can talk internally.

--
Aristeu
Borislav Petkov June 5, 2023, 7:30 p.m. UTC | #18
On Mon, Jun 05, 2023 at 05:58:40PM +0000, Luck, Tony wrote:
> That would need to take care to only strip out truly duplicated reports, but
> leave in genuine repeats of errors from the same location.

Yeah, timestamps will make them unique.

> I agree this path isn't worth taking.

I have a feeling we might need to dedup at some point but we can do that
in luserspace.

> How much code do you want to duplicate?

The usual. See untested diff below.

> I can't parse that either. Maybe rooted in the fact that UCNA ignores the
> threshold in MCi_CTL2 ... but maybe there's some missing description
> of what the SMI handler is doing.

If you don't know what this does either, then this is going nowhere
before it is properly explained.

> HPE are big fans of firmware first for corrected errors. It's more complicated
> for uncorrected ones as the OS really needs to know about those.

Bah, the firmware is so powerful and perfect. What does it need the OS
for? </sarcasm>

---

Btw, why was that spinlock raw?

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 22dfcb2adcd7..bcdf5b52003c 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1595,6 +1595,13 @@ static unsigned long mce_adjust_timer_default(unsigned long interval)
 
 static unsigned long (*mce_adjust_timer)(unsigned long interval) = mce_adjust_timer_default;
 
+static void poll_cmci_disabled(void)
+{
+	machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));
+}
+
+static void (*poll_cmci_disabled)(void) = poll_cmci_disabled_default;
+
 static void __start_timer(struct timer_list *t, unsigned long interval)
 {
 	unsigned long when = jiffies + interval;
@@ -1618,7 +1625,7 @@ static void mce_timer_fn(struct timer_list *t)
 	iv = __this_cpu_read(mce_next_interval);
 
 	if (mce_available(this_cpu_ptr(&cpu_info))) {
-		machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));
+		poll_cmci_disabled();
 
 		if (mce_intel_cmci_poll()) {
 			iv = mce_adjust_timer(iv);
diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c
index 95275a5e57e0..4722a57baf1e 100644
--- a/arch/x86/kernel/cpu/mce/intel.c
+++ b/arch/x86/kernel/cpu/mce/intel.c
@@ -71,6 +71,8 @@ enum {
 	CMCI_STORM_SUBSIDED,
 };
 
+static DEFINE_SPINLOCK(timer_fn_lock);
+
 static atomic_t cmci_storm_on_cpus;
 
 static int cmci_supported(int *banks)
@@ -426,12 +428,22 @@ void cmci_disable_bank(int bank)
 	raw_spin_unlock_irqrestore(&cmci_discover_lock, flags);
 }
 
+static void poll_cmci_disabled_intel(void)
+{
+	raw_spin_lock(&timer_fn_lock);
+	machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));
+	raw_spin_unlock(&timer_fn_lock);
+}
+
 void intel_init_cmci(void)
 {
 	int banks;
 
-	if (!cmci_supported(&banks))
+	if (!cmci_supported(&banks)) {
+		if (mca_cfg.cmci_disabled)
+			poll_cmci_disabled = poll_cmci_disabled_intel;
 		return;
+	}
 
 	mce_threshold_vector = intel_threshold_interrupt;
 	cmci_discover(banks);
Tony Luck June 5, 2023, 7:37 p.m. UTC | #19
>> How much code do you want to duplicate?
>
> The usual. See untested diff below.

That looks like it covers the "mce=cmci_off" case. What about the case where
CMCI is disabled temporarily because of a storm?

-Tony
Borislav Petkov June 5, 2023, 7:43 p.m. UTC | #20
On Mon, Jun 05, 2023 at 07:37:37PM +0000, Luck, Tony wrote:
> That looks like it covers the "mce=cmci_off" case. What about the case where
> CMCI is disabled temporarily because of a storm?

Then the decision which function to call will have to be dynamic and not
with function pointers assigned once, statically.

This was just an example about how it could be done.
Aristeu Rozanski June 5, 2023, 8:10 p.m. UTC | #21
On Mon, Jun 05, 2023 at 09:30:00PM +0200, Borislav Petkov wrote:
> Btw, why was that spinlock raw?

Didn't want RT code thinking it could preempt us but it doesn't really
matter.
Aristeu Rozanski June 5, 2023, 8:33 p.m. UTC | #22
On Mon, Jun 05, 2023 at 09:30:00PM +0200, Borislav Petkov wrote:
> I have a feeling we might need to dedup at some point but we can do that
> in luserspace.

Problem with throwing this userspace is that it won't get rid of the
duplication on the kernel log.
Borislav Petkov June 5, 2023, 8:56 p.m. UTC | #23
On Mon, Jun 05, 2023 at 04:33:15PM -0400, Aristeu Rozanski wrote:
> Problem with throwing this userspace is that it won't get rid of the
> duplication on the kernel log.

The kernel log never was and is a source to be scraped for hw errors.
The right thing is called tracepoints and rasdaemon uses them already.
Aristeu Rozanski June 5, 2023, 9:01 p.m. UTC | #24
On Mon, Jun 05, 2023 at 10:56:58PM +0200, Borislav Petkov wrote:
> On Mon, Jun 05, 2023 at 04:33:15PM -0400, Aristeu Rozanski wrote:
> > Problem with throwing this userspace is that it won't get rid of the
> > duplication on the kernel log.
> 
> The kernel log never was and is a source to be scraped for hw errors.
> The right thing is called tracepoints and rasdaemon uses them already.

Yes, but I'm talking about flooding the log with duplicates.
Borislav Petkov June 5, 2023, 9:06 p.m. UTC | #25
On Mon, Jun 05, 2023 at 05:01:08PM -0400, Aristeu Rozanski wrote:
> Yes, but I'm talking about flooding the log with duplicates.

And?

We don't *have* to log them if someone's too confused by staring at
something someone should not be staring at in the first place.

But let's see how valid that "use case" actually is first...
Tony Luck June 5, 2023, 9:29 p.m. UTC | #26
> But let's see how valid that "use case" actually is first...

I'm going to meet with the users later this week to find out what they are doing.

-Tony
Aristeu Rozanski June 5, 2023, 9:58 p.m. UTC | #27
On Mon, Jun 05, 2023 at 11:06:10PM +0200, Borislav Petkov wrote:
> On Mon, Jun 05, 2023 at 05:01:08PM -0400, Aristeu Rozanski wrote:
> > Yes, but I'm talking about flooding the log with duplicates.
> 
> And?
> 
> We don't *have* to log them if someone's too confused by staring at
> something someone should not be staring at in the first place.

It makes support people's lives worse by reducing the amount of useful
information in dmesg, when it can be avoided, in the case of duplicates.
Getting rid of the messages completely would be harmful as well, not
always customers run rasdaemon or provide the database with events, which
sometimes can be useful as hints.

> But let's see how valid that "use case" actually is first...

Of course. Tony is already aware who inside Intel talked to them and
they'll talk.

--
Aristeu
Borislav Petkov June 6, 2023, 8:25 a.m. UTC | #28
On Mon, Jun 05, 2023 at 05:58:39PM -0400, Aristeu Rozanski wrote:
> It makes support people's lives worse by reducing the amount of useful
> information in dmesg, when it can be avoided, in the case of duplicates.
> Getting rid of the messages completely would be harmful as well, not
> always customers run rasdaemon or provide the database with events, which
> sometimes can be useful as hints.

Lemme make sure you realize what you're arguing for: you want to not log
MCE duplicates in a ring buffer which can get overwritten due to a bunch
of reasons and as such is best effort only anyway.

Hell, it can get overwritten even by non-duplicate MCEs if they're just
enough many to overflow the buffer's size.

So relying on dmesg for any serious error counting or whatnot is
a exercise in futility.
Aristeu Rozanski June 6, 2023, 2 p.m. UTC | #29
On Tue, Jun 06, 2023 at 10:25:41AM +0200, Borislav Petkov wrote:
> Lemme make sure you realize what you're arguing for: you want to not log
> MCE duplicates in a ring buffer which can get overwritten due to a bunch
> of reasons and as such is best effort only anyway.
> 
> Hell, it can get overwritten even by non-duplicate MCEs if they're just
> enough many to overflow the buffer's size.
> 
> So relying on dmesg for any serious error counting or whatnot is
> a exercise in futility.

I'm not sure when I made you believe I want to do error counting or
error parsing from the kernel log.

I started saying that we need to avoid, when possible, to waste kernel
log. Duplicates are a waste. You mentioned not even logging MCEs at all
and I said that I believe it's not right, as it might be useful to be
aware of MCEs being delivered before a crash.

--
Aristeu
Borislav Petkov June 6, 2023, 2:08 p.m. UTC | #30
On Tue, Jun 06, 2023 at 10:00:11AM -0400, Aristeu Rozanski wrote:
> I started saying that we need to avoid, when possible, to waste kernel
> log. Duplicates are a waste. You mentioned not even logging MCEs at all
> and I said that I believe it's not right, as it might be useful to be
> aware of MCEs being delivered before a crash.

You don't have to summarize the whole thread - I know what we're talking
about.

And just to give you another example: if the MCE which caused that crash
gets overwritten in dmesg, your saved kernel log is useless as you want
the *first* one logged.

That's all I'm saying - the kernel log is best effort - nothing more.

Independent of that, yes, we will try not to pollute it with duplicates
once we know what the issue exactly is which makes people disable CMCI.

Thx.
Tony Luck June 9, 2023, 12:26 a.m. UTC | #31
> Independent of that, yes, we will try not to pollute it with duplicates
> once we know what the issue exactly is which makes people disable CMCI.

I talked to the vendor today. They have customers who want the OS to see
all reported corrected errors, but they also need to have their firmware keep
logs of all uncorrected errors (so their field engineers can replace the DIMMs
that are throwing the UC errors).

Intel machine check architecture has conflated processing of corrected errors
and "UCNA" (the log that comes from the memory controller when ECC says
the error is uncorrected ... this is distinct from the log that comes from the core
when the poison data is consumed and signals a machine check).

So the closest they can get to meeting both the customer requirement and
their service requirement is to have their BIOS configure to get an SMI for
the UCNA *and* tell the users to run in mce=no_cmci mode. If Linux enables
CMCI, then the h/w will also generate an SMI for every corrected error. Neither
the customer, nor the vendor wants that overhead.

Bottom line: This is a legitimate, production, use case.

-Tony
Borislav Petkov June 9, 2023, 10:17 a.m. UTC | #32
On Fri, Jun 09, 2023 at 12:26:56AM +0000, Luck, Tony wrote:
> > Independent of that, yes, we will try not to pollute it with duplicates
> > once we know what the issue exactly is which makes people disable CMCI.
> 
> I talked to the vendor today. They have customers who want the OS to see
> all reported corrected errors, but they also need to have their firmware keep
> logs of all uncorrected errors (so their field engineers can replace the DIMMs
> that are throwing the UC errors).
> 
> Intel machine check architecture has conflated processing of corrected errors
> and "UCNA" (the log that comes from the memory controller when ECC says
> the error is uncorrected ... this is distinct from the log that comes from the core
> when the poison data is consumed and signals a machine check).

I'd expect poison data consumption not to be a "UCNA" - i.e., "No
Action" required :)

> So the closest they can get to meeting both the customer requirement and
> their service requirement is to have their BIOS configure to get an SMI for
> the UCNA *and* tell the users to run in mce=no_cmci mode. If Linux enables
> CMCI, then the h/w will also generate an SMI for every corrected error. Neither
> the customer, nor the vendor wants that overhead.
> 
> Bottom line: This is a legitimate, production, use case.

Ok, that makes more sense.

Thanks.
Yazen Ghannam June 9, 2023, 3 p.m. UTC | #33
On 6/9/23 6:17 AM, Borislav Petkov wrote:
> On Fri, Jun 09, 2023 at 12:26:56AM +0000, Luck, Tony wrote:
>>> Independent of that, yes, we will try not to pollute it with duplicates
>>> once we know what the issue exactly is which makes people disable CMCI.
>>
>> I talked to the vendor today. They have customers who want the OS to see
>> all reported corrected errors, but they also need to have their firmware keep
>> logs of all uncorrected errors (so their field engineers can replace the DIMMs
>> that are throwing the UC errors).
>>
>> Intel machine check architecture has conflated processing of corrected errors
>> and "UCNA" (the log that comes from the memory controller when ECC says
>> the error is uncorrected ... this is distinct from the log that comes from the core
>> when the poison data is consumed and signals a machine check).
> 
> I'd expect poison data consumption not to be a "UCNA" - i.e., "No
> Action" required :)
>

So "UCNA" is like the AMD "Deferred" severity it seems. How is this
different from "Action Optional"? I've been equating DFR and AO.

>> So the closest they can get to meeting both the customer requirement and
>> their service requirement is to have their BIOS configure to get an SMI for
>> the UCNA *and* tell the users to run in mce=no_cmci mode. If Linux enables
>> CMCI, then the h/w will also generate an SMI for every corrected error. Neither
>> the customer, nor the vendor wants that overhead.
>>
>> Bottom line: This is a legitimate, production, use case.
> 
> Ok, that makes more sense.
>

Please do restrict this change to Intel, or at least !AMD, systems. AMD
MCA implemenation guarantees each MCA bank can only be accessed by a
single logical CPU, so there is no contention.

Thanks,
Yazen
Tony Luck June 9, 2023, 3:24 p.m. UTC | #34
> So "UCNA" is like the AMD "Deferred" severity it seems. How is this
> different from "Action Optional"? I've been equating DFR and AO.

Categories of uncorrected errors memory errors on Intel

1) "UCNA" ... these are logged by memory controllers when ECC says that a memory read cannot
supply correct data. If CMCI is enabled, signaled with CMCI. Note that these will occur on prefetch
or speculative reads as well as "regular" reads. The data might never be consumed.

2) "SRAO". This is now legacy. Pre-Icelake systems log these for uncorrected errors found by
the patrol scrubber, and for evictions of poison from L3 cache (if that poison was due to an ECC
failure in the cache itself, not for poison created elsewhere and currently resident in the cache).
Signaled with a broadcast machine check. Icelake and newer systems use UCNA for these.

3) "SRAR". Attempt to consume poison (either data read or instruction fetch). Signaled with
machine check. Pre-Skylake this was broadcast. Skylake and newer have an opt-in mechanism
to request #MC delivery to just the logical CPU trying to consume (Linux always opts-in).


UCNA = Uncorrected No Action required. But Linux does take action to try to offline the page.

SRAO = Software Recoverable Action Optional. As with UCNA Linux tries to offline the page.

SRAR = Software Recoverable Action Required. Linux will replace a clean page with a new copy
if it can (think read-only text pages mapped from ELF executable). If not it sends SIGBUS to the
application. Some SRAR in the kernel are recoverable ... see the copy_mc*() functions.

-Tony
Aristeu Rozanski June 9, 2023, 3:59 p.m. UTC | #35
Hi Borislav,

On Tue, Jun 06, 2023 at 04:08:48PM +0200, Borislav Petkov wrote:
> Independent of that, yes, we will try not to pollute it with duplicates
> once we know what the issue exactly is which makes people disable CMCI.

how about this (untested, with possibly a short comment on
intel_cmci_poll_lock()):

--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1618,7 +1618,9 @@ static void mce_timer_fn(struct timer_list *t)
 	iv = __this_cpu_read(mce_next_interval);
 
 	if (mce_available(this_cpu_ptr(&cpu_info))) {
+		bool locked = intel_cmci_poll_lock();
 		machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));
+		intel_cmci_poll_unlock(locked);
 
 		if (mce_intel_cmci_poll()) {
 			iv = mce_adjust_timer(iv);
diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c
index 95275a5e57e0..c688c088f5cf 100644
--- a/arch/x86/kernel/cpu/mce/intel.c
+++ b/arch/x86/kernel/cpu/mce/intel.c
@@ -73,13 +73,10 @@ enum {
 
 static atomic_t cmci_storm_on_cpus;
 
-static int cmci_supported(int *banks)
+static int cmci_supported_hw(int *banks)
 {
 	u64 cap;
 
-	if (mca_cfg.cmci_disabled || mca_cfg.ignore_ce)
-		return 0;
-
 	/*
 	 * Vendor check is not strictly needed, but the initial
 	 * initialization is vendor keyed and this
@@ -96,6 +93,14 @@ static int cmci_supported(int *banks)
 	return !!(cap & MCG_CMCI_P);
 }
 
+static int cmci_supported(int *banks)
+{
+	if (mca_cfg.cmci_disabled || mca_cfg.ignore_ce)
+		return 0;
+
+	return cmci_supported_hw(banks);
+}
+
 static bool lmce_supported(void)
 {
 	u64 tmp;
@@ -519,3 +524,25 @@ bool intel_filter_mce(struct mce *m)
 
 	return false;
 }
+
+static DEFINE_SPINLOCK(cmci_poll_lock);
+bool intel_cmci_poll_lock(void)
+{
+	int banks;
+
+	if (!cmci_supported_hw(&banks))
+		return false;
+
+	spin_lock(&cmci_poll_lock);
+
+	return true;
+}
+
+void intel_cmci_poll_unlock(bool locked)
+{
+	if (!locked)
+		return;
+
+	spin_unlock(&cmci_poll_lock);
+}
+
diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index d2412ce2d312..25bcc4a13e8c 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -49,6 +49,8 @@ void intel_init_cmci(void);
 void intel_init_lmce(void);
 void intel_clear_lmce(void);
 bool intel_filter_mce(struct mce *m);
+bool intel_cmci_poll_lock(void);
+void intel_cmci_poll_unlock(bool locked);
 #else
 # define cmci_intel_adjust_timer mce_adjust_timer_default
 static inline bool mce_intel_cmci_poll(void) { return false; }
@@ -58,6 +60,8 @@ static inline void intel_init_cmci(void) { }
 static inline void intel_init_lmce(void) { }
 static inline void intel_clear_lmce(void) { }
 static inline bool intel_filter_mce(struct mce *m) { return false; }
+static inline bool intel_cmci_poll_lock(void) { return false; }
+static inline void intel_cmci_poll_unlock(bool locked) { }
 #endif
 
 void mce_timer_kick(unsigned long interval);
Yazen Ghannam June 9, 2023, 4 p.m. UTC | #36
On 6/9/23 11:24 AM, Luck, Tony wrote:
>> So "UCNA" is like the AMD "Deferred" severity it seems. How is this
>> different from "Action Optional"? I've been equating DFR and AO.
>

Thanks Tony.

> Categories of uncorrected errors memory errors on Intel
> 
> 1) "UCNA" ... these are logged by memory controllers when ECC says that a memory read cannot
> supply correct data. If CMCI is enabled, signaled with CMCI. Note that these will occur on prefetch
> or speculative reads as well as "regular" reads. The data might never be consumed.
>

Yes, this is like AMD.

Key differences:
	* Logged using "Deferred" severity. However, deferred errors
	  aren't always from the memory controller. So there still needs
	  to be an error code check in addition to severity.
	* Signalled with a Deferred error APIC interrupt. This way UC
	  errors can be signalled independently of CEs.

> 2) "SRAO". This is now legacy. Pre-Icelake systems log these for uncorrected errors found by
> the patrol scrubber, and for evictions of poison from L3 cache (if that poison was due to an ECC
> failure in the cache itself, not for poison created elsewhere and currently resident in the cache).
> Signaled with a broadcast machine check. Icelake and newer systems use UCNA for these.
>

Yes, this mostly fails within the Deferred/UCNA case for AMD also.

> 3) "SRAR". Attempt to consume poison (either data read or instruction fetch). Signaled with
> machine check. Pre-Skylake this was broadcast. Skylake and newer have an opt-in mechanism
> to request #MC delivery to just the logical CPU trying to consume (Linux always opts-in).
> 
> 
> UCNA = Uncorrected No Action required. But Linux does take action to try to offline the page.
>

That's right. So we'll use this for the Deferred memory error case. But
there will need to be updates for these to be actionable on AMD systems.

> SRAO = Software Recoverable Action Optional. As with UCNA Linux tries to offline the page.
>

Okay, so ignore these going forward.

> SRAR = Software Recoverable Action Required. Linux will replace a clean page with a new copy
> if it can (think read-only text pages mapped from ELF executable). If not it sends SIGBUS to the
> application. Some SRAR in the kernel are recoverable ... see the copy_mc*() functions.
>

Yep, this mostly works. There's still an AMD IF Unit quirk that needs to
be handled. And the kernel recovery cases needs to be tested.

Thanks again!

-Yazen
diff mbox series

Patch

--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1597,6 +1597,7 @@  static unsigned long check_interval = INITIAL_CHECK_INTERVAL;
 
 static DEFINE_PER_CPU(unsigned long, mce_next_interval); /* in jiffies */
 static DEFINE_PER_CPU(struct timer_list, mce_timer);
+static DEFINE_RAW_SPINLOCK(timer_fn_lock);
 
 static unsigned long mce_adjust_timer_default(unsigned long interval)
 {
@@ -1628,7 +1629,9 @@  static void mce_timer_fn(struct timer_list *t)
 	iv = __this_cpu_read(mce_next_interval);
 
 	if (mce_available(this_cpu_ptr(&cpu_info))) {
+		raw_spin_lock(&timer_fn_lock);
 		machine_check_poll(0, this_cpu_ptr(&mce_poll_banks));
+		raw_spin_unlock(&timer_fn_lock);
 
 		if (mce_intel_cmci_poll()) {
 			iv = mce_adjust_timer(iv);