Message ID | 20230515143225.GC4090740@cathedrallabs.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mce: prevent concurrent polling of MCE events | expand |
On Mon, May 15, 2023 at 10:32:25AM -0400, Aristeu Rozanski wrote: > Error injection in modern HP machines with CMCI disabled will cause the > injected MCE to be found only by polling. Because these newer machines have a > big number of CPUs per package, it makes a lot more likely for multiple > CPUs polling IMC registers (that are shared in the same package) at same time, So the MCE code already has some notion of shared banks and tries to pay attention to it. I think it'll be cleaner if machine_check_poll() paid attention to that and polled only on the smallest CPU number which shares the bank instead of doing an unconditional wholesale spinlock grabbing which is not really needed... Thx.
> So the MCE code already has some notion of shared banks and tries to pay > attention to it. The MCE code only knows sharing scope if CMCI is enabled. The algorithm to determine ownership of a shared bank is "first to find the bank is the owner" as determined by successful setting of bit 30 in IA32_MCi_CTL2. There isn't any other practical way to determine scope of bank sharing. In Aristeu's case with the HP systems CMCI is disabled. I saw the same on a system here when booted with "mce=no_cmci". Twelve CPUs (out of 144) were polling on the same jiffy and reported the same error. -Tony
On Mon, May 15, 2023 at 05:18:06PM +0000, Luck, Tony wrote: > The MCE code only knows sharing scope if CMCI is enabled. The algorithm to > determine ownership of a shared bank is "first to find the bank is the owner" > as determined by successful setting of bit 30 in IA32_MCi_CTL2. There > isn't any other practical way to determine scope of bank sharing. I'm not talking about this ownership - I'm talking about how many logical cores get to see the same error info because the bank is replicated across all of them in *hardware*. > In Aristeu's case with the HP systems CMCI is disabled. I saw the same > on a system here when booted with "mce=no_cmci". Twelve CPUs (out > of 144) were polling on the same jiffy and reported the same error. Is CMCI disabled a valid use case or is this someone doing testing? Thx.
> I'm not talking about this ownership - I'm talking about how many > logical cores get to see the same error info because the bank is > replicated across all of them in *hardware*. The hardware doesn't enumerate that banks are shared, or which logical CPUs share each bank. The sharing scope can, and has, changed from one generation to another. E.g. Banks 0 and 1 were core scoped from Nehalem until Cascade Lake. But Icelake and subsequent (so far) CPUs made them thread scoped. -Tony
On Mon, May 15, 2023 at 07:08:01PM +0000, Luck, Tony wrote: > > I'm not talking about this ownership - I'm talking about how many > > logical cores get to see the same error info because the bank is > > replicated across all of them in *hardware*. > > The hardware doesn't enumerate that banks are shared, or which > logical CPUs share each bank. The sharing scope can, and has, > changed from one generation to another. E.g. Banks 0 and 1 were > core scoped from Nehalem until Cascade Lake. But Icelake and > subsequent (so far) CPUs made them thread scoped. So this unconditional serialization is just gross. This is a per-cpu timer grabbing a global lock. It is going to make unrelated threads all wait on a single lock while polling, for nothing. On a huge box with gazillion cores and when the timers fire just at the right time, they're all going to hit that lock real hard. There has to be a better way... Also, from my previous mail: Is CMCI disabled a valid use case or is this someone doing testing?
> So this unconditional serialization is just gross. This is a per-cpu > timer grabbing a global lock. It is going to make unrelated threads all > wait on a single lock while polling, for nothing. > > On a huge box with gazillion cores and when the timers fire just at the > right time, they're all going to hit that lock real hard. Linux does try to avoid all the CPUs taking the polling interrupt at the same time. See my earlier e-mail: https://lore.kernel.org/all/ZF18kdWCWKqFc23p@agluck-desk3.sc.intel.com/ but there are often more CPUs than HZ ticks to spread things out. So still get a few collisions. > There has to be a better way... Options: 1) Divide up the lock. 1a) One lock per machine check bank 1b) One lock per socket 1c) Combo of 1a & 1b 2) Enumerate the bank sharing with a big table per CPU model number [No, we really don't want to do that!] 3) Allow multiple CPUs to read the same error. Add a filter to suppress duplicates in the logging path. > Also, from my previous mail: Is CMCI disabled a valid use case or is > this someone doing testing? I disabled CMCI for debugging reasons. Aristeu needs to comment on his use case. -Tony
On Mon, May 15, 2023 at 08:07:10PM +0000, Luck, Tony wrote: > I disabled CMCI for debugging reasons. Aristeu needs to comment on > his use case. Partner has a toggle in BIOS and reported the issue. I suspect it has to do with a big mutual customer that wants to reduce CE storms caused CPU spikes to a minimal, but I'll get more details.
> Partner has a toggle in BIOS and reported the issue. I suspect it has to do > with a big mutual customer that wants to reduce CE storms caused CPU spikes > to a minimal, but I'll get more details. If CMCI storms are their problem ... perhaps this is an answer: https://lore.kernel.org/all/20230411173841.70491-1-tony.luck@intel.com/ -Tony
On Mon, May 15, 2023 at 08:27:19PM +0000, Luck, Tony wrote: > If CMCI storms are their problem ... perhaps this is an answer: > > https://lore.kernel.org/all/20230411173841.70491-1-tony.luck@intel.com/ Apparently they're disabling CMCI for CE and UCNA events when OS-first is enabled for two entire lines of products. So there might be more behind this than I suspected. I just sent a message to their engineers to find out what's going on. Now on the CMCI storm mitigation, wouldn't a system that got a storm be subject to this same issue of duplicated events since it switches to polling?
> Now on the CMCI storm mitigation, wouldn't a system that got a storm be > subject to this same issue of duplicated events since it switches to polling? Current code switches to polling. But I think that it makes use of the bank ownership information that it derived when CMCI was originally enabled. It wouldn't surprise me if things go wrong if you mix in CPU hotplug with a storm and offline the CPU that "owns" a bank. Perhaps the ownership will get passed to another CPU in the same scope. But I'm not sure. Those new patches don't switch to polling. They just reset the threshold in MCi_CTL2 to a very big value to throttle the number of interrupts for corrected errors, while still processing the UCNA errors that are also signaled using CMCI right away. -Tony
On Mon, May 15, 2023 at 04:32:20PM -0400, Aristeu Rozanski wrote: > On Mon, May 15, 2023 at 08:27:19PM +0000, Luck, Tony wrote: > > If CMCI storms are their problem ... perhaps this is an answer: > > > > https://lore.kernel.org/all/20230411173841.70491-1-tony.luck@intel.com/ > > Apparently they're disabling CMCI for CE and UCNA events when OS-first is > enabled for two entire lines of products. So there might be more behind > this than I suspected. I just sent a message to their engineers to find out > what's going on. Might also ask them to test Tony's patches, see whether they help. Thx.
On Mon, May 15, 2023 at 08:07:10PM +0000, Luck, Tony wrote: > I disabled CMCI for debugging reasons. Aristeu needs to comment on > his use case. Got an answer from them. They disable it in two lines of products (one uses IceLake, the other uses Sapphire Rapids) and in these lines they can use thresholding to signal UCNA without signaling any corrected events and it won't work with CMCI enabled. They did make Intel aware of it (maybe you heard details about it? If not I can get them in contact with you) but it's not certain if this can be fixed or not.
On Tue, May 23, 2023 at 10:15:23AM -0400, Aristeu Rozanski wrote: > On Mon, May 15, 2023 at 08:07:10PM +0000, Luck, Tony wrote: > > I disabled CMCI for debugging reasons. Aristeu needs to comment on > > his use case. > > Got an answer from them. They disable it in two lines of products > (one uses IceLake, the other uses Sapphire Rapids) and in these lines > they can use thresholding to signal UCNA without signaling any corrected > events and it won't work with CMCI enabled. They did make Intel aware of > it (maybe you heard details about it? If not I can get them in contact with > you) but it's not certain if this can be fixed or not. Tony?
>> > I disabled CMCI for debugging reasons. Aristeu needs to comment on >> > his use case. >> >> Got an answer from them. They disable it in two lines of products >> (one uses IceLake, the other uses Sapphire Rapids) and in these lines >> they can use thresholding to signal UCNA without signaling any corrected >> events and it won't work with CMCI enabled. They did make Intel aware of >> it (maybe you heard details about it? If not I can get them in contact with >> you) but it's not certain if this can be fixed or not. Since mce=no_cmci is a supported mode, and end-users are making use of it on production systems ... I think Linux needs a patch to prevent multiple CPUs logging the same error. I don't have any other ideas on how to do that besides Aristeu's patch to add a spinlock to prevent simultaneous access. This isn't a hot path, and in most cases not contended, so I don't see much downside. -Tony
On Mon, Jun 05, 2023 at 03:33:48PM +0000, Luck, Tony wrote: > Since mce=no_cmci is a supported mode, and end-users are making use > of it on production systems ... I think Linux needs a patch to prevent multiple > CPUs logging the same error. > > I don't have any other ideas I have some: 1. One can teach mce_gen_pool_add() to scan what's already on the list and drop duplicate errors. Not sure if it is worth it. or 2. Carve out that abysmal locking into a function called poll_cmci_disabled() or so, run this function only on CMCI machines and make sure the use case is properly explained in the commit message. I have no clue what "they can use thresholding to signal UCNA without signaling any corrected events and it won't work with CMCI enabled." I thought HP use firmware first and all that magic EFI gunk to do RAS and don't rely on OS facilities. Oh well...
>> I don't have any other ideas > > I have some: > > 1. One can teach mce_gen_pool_add() to scan what's already on the list > and drop duplicate errors. Not sure if it is worth it. That would need to take care to only strip out truly duplicated reports, but leave in genuine repeats of errors from the same location. I agree this path isn't worth taking. > or > > 2. Carve out that abysmal locking into a function called > poll_cmci_disabled() or so, run this function only on CMCI machines > and make sure the use case is properly explained in the commit > message. How much code do you want to duplicate? machine_check_poll() is ~100 lines, most of in the racy section from: m.status = mce_rdmsrl(mca_msr_reg(i, MCA_STATUS)); to mce_wrmsrl(mca_msr_reg(i, MCA_STATUS), 0); That looks like long term maintenance pain to keep the CMCI on/off versions in step. > I have no clue what "they can use thresholding to signal UCNA without > signaling any corrected events and it won't work with CMCI enabled." I can't parse that either. Maybe rooted in the fact that UCNA ignores the threshold in MCi_CTL2 ... but maybe there's some missing description of what the SMI handler is doing. > I thought HP use firmware first and all that magic EFI gunk to do RAS > and don't rely on OS facilities. Oh well... HPE are big fans of firmware first for corrected errors. It's more complicated for uncorrected ones as the OS really needs to know about those. -Tony
On Mon, Jun 05, 2023 at 07:41:04PM +0200, Borislav Petkov wrote: > I have no clue what "they can use thresholding to signal UCNA without > signaling any corrected events and it won't work with CMCI enabled." That's pretty much exactly what was passed to me. I'll ask who the Intel contact was and inform Tony so they can talk internally. -- Aristeu
On Mon, Jun 05, 2023 at 05:58:40PM +0000, Luck, Tony wrote: > That would need to take care to only strip out truly duplicated reports, but > leave in genuine repeats of errors from the same location. Yeah, timestamps will make them unique. > I agree this path isn't worth taking. I have a feeling we might need to dedup at some point but we can do that in luserspace. > How much code do you want to duplicate? The usual. See untested diff below. > I can't parse that either. Maybe rooted in the fact that UCNA ignores the > threshold in MCi_CTL2 ... but maybe there's some missing description > of what the SMI handler is doing. If you don't know what this does either, then this is going nowhere before it is properly explained. > HPE are big fans of firmware first for corrected errors. It's more complicated > for uncorrected ones as the OS really needs to know about those. Bah, the firmware is so powerful and perfect. What does it need the OS for? </sarcasm> --- Btw, why was that spinlock raw? diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 22dfcb2adcd7..bcdf5b52003c 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -1595,6 +1595,13 @@ static unsigned long mce_adjust_timer_default(unsigned long interval) static unsigned long (*mce_adjust_timer)(unsigned long interval) = mce_adjust_timer_default; +static void poll_cmci_disabled(void) +{ + machine_check_poll(0, this_cpu_ptr(&mce_poll_banks)); +} + +static void (*poll_cmci_disabled)(void) = poll_cmci_disabled_default; + static void __start_timer(struct timer_list *t, unsigned long interval) { unsigned long when = jiffies + interval; @@ -1618,7 +1625,7 @@ static void mce_timer_fn(struct timer_list *t) iv = __this_cpu_read(mce_next_interval); if (mce_available(this_cpu_ptr(&cpu_info))) { - machine_check_poll(0, this_cpu_ptr(&mce_poll_banks)); + poll_cmci_disabled(); if (mce_intel_cmci_poll()) { iv = mce_adjust_timer(iv); diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c index 95275a5e57e0..4722a57baf1e 100644 --- a/arch/x86/kernel/cpu/mce/intel.c +++ b/arch/x86/kernel/cpu/mce/intel.c @@ -71,6 +71,8 @@ enum { CMCI_STORM_SUBSIDED, }; +static DEFINE_SPINLOCK(timer_fn_lock); + static atomic_t cmci_storm_on_cpus; static int cmci_supported(int *banks) @@ -426,12 +428,22 @@ void cmci_disable_bank(int bank) raw_spin_unlock_irqrestore(&cmci_discover_lock, flags); } +static void poll_cmci_disabled_intel(void) +{ + raw_spin_lock(&timer_fn_lock); + machine_check_poll(0, this_cpu_ptr(&mce_poll_banks)); + raw_spin_unlock(&timer_fn_lock); +} + void intel_init_cmci(void) { int banks; - if (!cmci_supported(&banks)) + if (!cmci_supported(&banks)) { + if (mca_cfg.cmci_disabled) + poll_cmci_disabled = poll_cmci_disabled_intel; return; + } mce_threshold_vector = intel_threshold_interrupt; cmci_discover(banks);
>> How much code do you want to duplicate? > > The usual. See untested diff below. That looks like it covers the "mce=cmci_off" case. What about the case where CMCI is disabled temporarily because of a storm? -Tony
On Mon, Jun 05, 2023 at 07:37:37PM +0000, Luck, Tony wrote: > That looks like it covers the "mce=cmci_off" case. What about the case where > CMCI is disabled temporarily because of a storm? Then the decision which function to call will have to be dynamic and not with function pointers assigned once, statically. This was just an example about how it could be done.
On Mon, Jun 05, 2023 at 09:30:00PM +0200, Borislav Petkov wrote:
> Btw, why was that spinlock raw?
Didn't want RT code thinking it could preempt us but it doesn't really
matter.
On Mon, Jun 05, 2023 at 09:30:00PM +0200, Borislav Petkov wrote: > I have a feeling we might need to dedup at some point but we can do that > in luserspace. Problem with throwing this userspace is that it won't get rid of the duplication on the kernel log.
On Mon, Jun 05, 2023 at 04:33:15PM -0400, Aristeu Rozanski wrote: > Problem with throwing this userspace is that it won't get rid of the > duplication on the kernel log. The kernel log never was and is a source to be scraped for hw errors. The right thing is called tracepoints and rasdaemon uses them already.
On Mon, Jun 05, 2023 at 10:56:58PM +0200, Borislav Petkov wrote: > On Mon, Jun 05, 2023 at 04:33:15PM -0400, Aristeu Rozanski wrote: > > Problem with throwing this userspace is that it won't get rid of the > > duplication on the kernel log. > > The kernel log never was and is a source to be scraped for hw errors. > The right thing is called tracepoints and rasdaemon uses them already. Yes, but I'm talking about flooding the log with duplicates.
On Mon, Jun 05, 2023 at 05:01:08PM -0400, Aristeu Rozanski wrote:
> Yes, but I'm talking about flooding the log with duplicates.
And?
We don't *have* to log them if someone's too confused by staring at
something someone should not be staring at in the first place.
But let's see how valid that "use case" actually is first...
> But let's see how valid that "use case" actually is first...
I'm going to meet with the users later this week to find out what they are doing.
-Tony
On Mon, Jun 05, 2023 at 11:06:10PM +0200, Borislav Petkov wrote: > On Mon, Jun 05, 2023 at 05:01:08PM -0400, Aristeu Rozanski wrote: > > Yes, but I'm talking about flooding the log with duplicates. > > And? > > We don't *have* to log them if someone's too confused by staring at > something someone should not be staring at in the first place. It makes support people's lives worse by reducing the amount of useful information in dmesg, when it can be avoided, in the case of duplicates. Getting rid of the messages completely would be harmful as well, not always customers run rasdaemon or provide the database with events, which sometimes can be useful as hints. > But let's see how valid that "use case" actually is first... Of course. Tony is already aware who inside Intel talked to them and they'll talk. -- Aristeu
On Mon, Jun 05, 2023 at 05:58:39PM -0400, Aristeu Rozanski wrote: > It makes support people's lives worse by reducing the amount of useful > information in dmesg, when it can be avoided, in the case of duplicates. > Getting rid of the messages completely would be harmful as well, not > always customers run rasdaemon or provide the database with events, which > sometimes can be useful as hints. Lemme make sure you realize what you're arguing for: you want to not log MCE duplicates in a ring buffer which can get overwritten due to a bunch of reasons and as such is best effort only anyway. Hell, it can get overwritten even by non-duplicate MCEs if they're just enough many to overflow the buffer's size. So relying on dmesg for any serious error counting or whatnot is a exercise in futility.
On Tue, Jun 06, 2023 at 10:25:41AM +0200, Borislav Petkov wrote: > Lemme make sure you realize what you're arguing for: you want to not log > MCE duplicates in a ring buffer which can get overwritten due to a bunch > of reasons and as such is best effort only anyway. > > Hell, it can get overwritten even by non-duplicate MCEs if they're just > enough many to overflow the buffer's size. > > So relying on dmesg for any serious error counting or whatnot is > a exercise in futility. I'm not sure when I made you believe I want to do error counting or error parsing from the kernel log. I started saying that we need to avoid, when possible, to waste kernel log. Duplicates are a waste. You mentioned not even logging MCEs at all and I said that I believe it's not right, as it might be useful to be aware of MCEs being delivered before a crash. -- Aristeu
On Tue, Jun 06, 2023 at 10:00:11AM -0400, Aristeu Rozanski wrote: > I started saying that we need to avoid, when possible, to waste kernel > log. Duplicates are a waste. You mentioned not even logging MCEs at all > and I said that I believe it's not right, as it might be useful to be > aware of MCEs being delivered before a crash. You don't have to summarize the whole thread - I know what we're talking about. And just to give you another example: if the MCE which caused that crash gets overwritten in dmesg, your saved kernel log is useless as you want the *first* one logged. That's all I'm saying - the kernel log is best effort - nothing more. Independent of that, yes, we will try not to pollute it with duplicates once we know what the issue exactly is which makes people disable CMCI. Thx.
> Independent of that, yes, we will try not to pollute it with duplicates > once we know what the issue exactly is which makes people disable CMCI. I talked to the vendor today. They have customers who want the OS to see all reported corrected errors, but they also need to have their firmware keep logs of all uncorrected errors (so their field engineers can replace the DIMMs that are throwing the UC errors). Intel machine check architecture has conflated processing of corrected errors and "UCNA" (the log that comes from the memory controller when ECC says the error is uncorrected ... this is distinct from the log that comes from the core when the poison data is consumed and signals a machine check). So the closest they can get to meeting both the customer requirement and their service requirement is to have their BIOS configure to get an SMI for the UCNA *and* tell the users to run in mce=no_cmci mode. If Linux enables CMCI, then the h/w will also generate an SMI for every corrected error. Neither the customer, nor the vendor wants that overhead. Bottom line: This is a legitimate, production, use case. -Tony
On Fri, Jun 09, 2023 at 12:26:56AM +0000, Luck, Tony wrote: > > Independent of that, yes, we will try not to pollute it with duplicates > > once we know what the issue exactly is which makes people disable CMCI. > > I talked to the vendor today. They have customers who want the OS to see > all reported corrected errors, but they also need to have their firmware keep > logs of all uncorrected errors (so their field engineers can replace the DIMMs > that are throwing the UC errors). > > Intel machine check architecture has conflated processing of corrected errors > and "UCNA" (the log that comes from the memory controller when ECC says > the error is uncorrected ... this is distinct from the log that comes from the core > when the poison data is consumed and signals a machine check). I'd expect poison data consumption not to be a "UCNA" - i.e., "No Action" required :) > So the closest they can get to meeting both the customer requirement and > their service requirement is to have their BIOS configure to get an SMI for > the UCNA *and* tell the users to run in mce=no_cmci mode. If Linux enables > CMCI, then the h/w will also generate an SMI for every corrected error. Neither > the customer, nor the vendor wants that overhead. > > Bottom line: This is a legitimate, production, use case. Ok, that makes more sense. Thanks.
On 6/9/23 6:17 AM, Borislav Petkov wrote: > On Fri, Jun 09, 2023 at 12:26:56AM +0000, Luck, Tony wrote: >>> Independent of that, yes, we will try not to pollute it with duplicates >>> once we know what the issue exactly is which makes people disable CMCI. >> >> I talked to the vendor today. They have customers who want the OS to see >> all reported corrected errors, but they also need to have their firmware keep >> logs of all uncorrected errors (so their field engineers can replace the DIMMs >> that are throwing the UC errors). >> >> Intel machine check architecture has conflated processing of corrected errors >> and "UCNA" (the log that comes from the memory controller when ECC says >> the error is uncorrected ... this is distinct from the log that comes from the core >> when the poison data is consumed and signals a machine check). > > I'd expect poison data consumption not to be a "UCNA" - i.e., "No > Action" required :) > So "UCNA" is like the AMD "Deferred" severity it seems. How is this different from "Action Optional"? I've been equating DFR and AO. >> So the closest they can get to meeting both the customer requirement and >> their service requirement is to have their BIOS configure to get an SMI for >> the UCNA *and* tell the users to run in mce=no_cmci mode. If Linux enables >> CMCI, then the h/w will also generate an SMI for every corrected error. Neither >> the customer, nor the vendor wants that overhead. >> >> Bottom line: This is a legitimate, production, use case. > > Ok, that makes more sense. > Please do restrict this change to Intel, or at least !AMD, systems. AMD MCA implemenation guarantees each MCA bank can only be accessed by a single logical CPU, so there is no contention. Thanks, Yazen
> So "UCNA" is like the AMD "Deferred" severity it seems. How is this > different from "Action Optional"? I've been equating DFR and AO. Categories of uncorrected errors memory errors on Intel 1) "UCNA" ... these are logged by memory controllers when ECC says that a memory read cannot supply correct data. If CMCI is enabled, signaled with CMCI. Note that these will occur on prefetch or speculative reads as well as "regular" reads. The data might never be consumed. 2) "SRAO". This is now legacy. Pre-Icelake systems log these for uncorrected errors found by the patrol scrubber, and for evictions of poison from L3 cache (if that poison was due to an ECC failure in the cache itself, not for poison created elsewhere and currently resident in the cache). Signaled with a broadcast machine check. Icelake and newer systems use UCNA for these. 3) "SRAR". Attempt to consume poison (either data read or instruction fetch). Signaled with machine check. Pre-Skylake this was broadcast. Skylake and newer have an opt-in mechanism to request #MC delivery to just the logical CPU trying to consume (Linux always opts-in). UCNA = Uncorrected No Action required. But Linux does take action to try to offline the page. SRAO = Software Recoverable Action Optional. As with UCNA Linux tries to offline the page. SRAR = Software Recoverable Action Required. Linux will replace a clean page with a new copy if it can (think read-only text pages mapped from ELF executable). If not it sends SIGBUS to the application. Some SRAR in the kernel are recoverable ... see the copy_mc*() functions. -Tony
Hi Borislav, On Tue, Jun 06, 2023 at 04:08:48PM +0200, Borislav Petkov wrote: > Independent of that, yes, we will try not to pollute it with duplicates > once we know what the issue exactly is which makes people disable CMCI. how about this (untested, with possibly a short comment on intel_cmci_poll_lock()): --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -1618,7 +1618,9 @@ static void mce_timer_fn(struct timer_list *t) iv = __this_cpu_read(mce_next_interval); if (mce_available(this_cpu_ptr(&cpu_info))) { + bool locked = intel_cmci_poll_lock(); machine_check_poll(0, this_cpu_ptr(&mce_poll_banks)); + intel_cmci_poll_unlock(locked); if (mce_intel_cmci_poll()) { iv = mce_adjust_timer(iv); diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c index 95275a5e57e0..c688c088f5cf 100644 --- a/arch/x86/kernel/cpu/mce/intel.c +++ b/arch/x86/kernel/cpu/mce/intel.c @@ -73,13 +73,10 @@ enum { static atomic_t cmci_storm_on_cpus; -static int cmci_supported(int *banks) +static int cmci_supported_hw(int *banks) { u64 cap; - if (mca_cfg.cmci_disabled || mca_cfg.ignore_ce) - return 0; - /* * Vendor check is not strictly needed, but the initial * initialization is vendor keyed and this @@ -96,6 +93,14 @@ static int cmci_supported(int *banks) return !!(cap & MCG_CMCI_P); } +static int cmci_supported(int *banks) +{ + if (mca_cfg.cmci_disabled || mca_cfg.ignore_ce) + return 0; + + return cmci_supported_hw(banks); +} + static bool lmce_supported(void) { u64 tmp; @@ -519,3 +524,25 @@ bool intel_filter_mce(struct mce *m) return false; } + +static DEFINE_SPINLOCK(cmci_poll_lock); +bool intel_cmci_poll_lock(void) +{ + int banks; + + if (!cmci_supported_hw(&banks)) + return false; + + spin_lock(&cmci_poll_lock); + + return true; +} + +void intel_cmci_poll_unlock(bool locked) +{ + if (!locked) + return; + + spin_unlock(&cmci_poll_lock); +} + diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h index d2412ce2d312..25bcc4a13e8c 100644 --- a/arch/x86/kernel/cpu/mce/internal.h +++ b/arch/x86/kernel/cpu/mce/internal.h @@ -49,6 +49,8 @@ void intel_init_cmci(void); void intel_init_lmce(void); void intel_clear_lmce(void); bool intel_filter_mce(struct mce *m); +bool intel_cmci_poll_lock(void); +void intel_cmci_poll_unlock(bool locked); #else # define cmci_intel_adjust_timer mce_adjust_timer_default static inline bool mce_intel_cmci_poll(void) { return false; } @@ -58,6 +60,8 @@ static inline void intel_init_cmci(void) { } static inline void intel_init_lmce(void) { } static inline void intel_clear_lmce(void) { } static inline bool intel_filter_mce(struct mce *m) { return false; } +static inline bool intel_cmci_poll_lock(void) { return false; } +static inline void intel_cmci_poll_unlock(bool locked) { } #endif void mce_timer_kick(unsigned long interval);
On 6/9/23 11:24 AM, Luck, Tony wrote: >> So "UCNA" is like the AMD "Deferred" severity it seems. How is this >> different from "Action Optional"? I've been equating DFR and AO. > Thanks Tony. > Categories of uncorrected errors memory errors on Intel > > 1) "UCNA" ... these are logged by memory controllers when ECC says that a memory read cannot > supply correct data. If CMCI is enabled, signaled with CMCI. Note that these will occur on prefetch > or speculative reads as well as "regular" reads. The data might never be consumed. > Yes, this is like AMD. Key differences: * Logged using "Deferred" severity. However, deferred errors aren't always from the memory controller. So there still needs to be an error code check in addition to severity. * Signalled with a Deferred error APIC interrupt. This way UC errors can be signalled independently of CEs. > 2) "SRAO". This is now legacy. Pre-Icelake systems log these for uncorrected errors found by > the patrol scrubber, and for evictions of poison from L3 cache (if that poison was due to an ECC > failure in the cache itself, not for poison created elsewhere and currently resident in the cache). > Signaled with a broadcast machine check. Icelake and newer systems use UCNA for these. > Yes, this mostly fails within the Deferred/UCNA case for AMD also. > 3) "SRAR". Attempt to consume poison (either data read or instruction fetch). Signaled with > machine check. Pre-Skylake this was broadcast. Skylake and newer have an opt-in mechanism > to request #MC delivery to just the logical CPU trying to consume (Linux always opts-in). > > > UCNA = Uncorrected No Action required. But Linux does take action to try to offline the page. > That's right. So we'll use this for the Deferred memory error case. But there will need to be updates for these to be actionable on AMD systems. > SRAO = Software Recoverable Action Optional. As with UCNA Linux tries to offline the page. > Okay, so ignore these going forward. > SRAR = Software Recoverable Action Required. Linux will replace a clean page with a new copy > if it can (think read-only text pages mapped from ELF executable). If not it sends SIGBUS to the > application. Some SRAR in the kernel are recoverable ... see the copy_mc*() functions. > Yep, this mostly works. There's still an AMD IF Unit quirk that needs to be handled. And the kernel recovery cases needs to be tested. Thanks again! -Yazen
--- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -1597,6 +1597,7 @@ static unsigned long check_interval = INITIAL_CHECK_INTERVAL; static DEFINE_PER_CPU(unsigned long, mce_next_interval); /* in jiffies */ static DEFINE_PER_CPU(struct timer_list, mce_timer); +static DEFINE_RAW_SPINLOCK(timer_fn_lock); static unsigned long mce_adjust_timer_default(unsigned long interval) { @@ -1628,7 +1629,9 @@ static void mce_timer_fn(struct timer_list *t) iv = __this_cpu_read(mce_next_interval); if (mce_available(this_cpu_ptr(&cpu_info))) { + raw_spin_lock(&timer_fn_lock); machine_check_poll(0, this_cpu_ptr(&mce_poll_banks)); + raw_spin_unlock(&timer_fn_lock); if (mce_intel_cmci_poll()) { iv = mce_adjust_timer(iv);