diff mbox series

x86/mce: Increase the size of the MCE pool from 2 to 8 pages

Message ID 20231011163320.79732-1-sironi@amazon.de (mailing list archive)
State New, archived
Headers show
Series x86/mce: Increase the size of the MCE pool from 2 to 8 pages | expand

Commit Message

Sironi, Filippo Oct. 11, 2023, 4:33 p.m. UTC
On some of our large servers and some of our most sorry servers ( :) ),
we're seeing the kernel reporting the warning in mce_gen_pool_add: "MCE
records pool full!". Let's increase the amount of memory that we use to
store the MCE records from 2 to 8 pages to prevent this from happening
and be able to collect useful information.

Signed-off-by: Filippo Sironi <sironi@amazon.de>
---
 arch/x86/kernel/cpu/mce/genpool.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Dave Hansen Oct. 11, 2023, 5:32 p.m. UTC | #1
On 10/11/23 09:33, Filippo Sironi wrote:
> On some of our large servers and some of our most sorry servers ( 
Sironi, Filippo Oct. 12, 2023, 11:46 a.m. UTC | #2
On 10/11/23 19:33, Dave Hansen wrote: 
> On 10/11/23 09:33, Filippo Sironi wrote:
> > On some of our large servers and some of our most sorry servers ( 
Borislav Petkov Oct. 12, 2023, 11:52 a.m. UTC | #3
On Thu, Oct 12, 2023 at 11:46:13AM +0000, Sironi, Filippo wrote:
> I'm happy to make this a compile time configuration where the default is
> still 2 pages, to avoid changing the status quo.

Why don't you keep this patch in your kernels, simply, as this seems to
be an issue only for you currently?
Dave Hansen Oct. 12, 2023, 3:49 p.m. UTC | #4
On 10/12/23 04:46, Sironi, Filippo wrote:
> There's correlation across the errors that we're seeing, indeed,
> we're looking at the same row being responsible for multiple CPUs
> tripping and running into #MC. I still don't like the full lack of
> visibility; it's not uncommon in a large fleet to see to take a
> server out of production, replace a DIMM and shortly after taking it
> out of production again to replace another DIMM just because some of
> the errors weren't properly logged.

So you had two nearly simultaneous DIMM failures.  The first failed,
filled up the buffer and then the second failed, but there was no room.
The second failed *SO* soon after the first that there was no
opportunity to empty the buffer between.

Right?

How do you know that storing 8 pages of records will catch this case as
opposed to storing 2?

>> Is there any way that the size of the pool can be more automatically
>> determined? Is the likelihood of a bunch errors proportional to the
>> number of CPUs or amount of RAM or some other aspect of the hardware?
>>
>> Could the pool be emptied more aggressively so that it does not fill up?

You didn't really address the additional questions I posed there.

I'll add one more: how many of the messages are duplicates or
*effectively* duplicates?  Or is that hard to determine at the time that
the entries are being made that they are duplicates?

It _should_ also be fairly easy to enlarge the buffer on demand, say, if
it got half full.  What's the time scale over which the buffer filled
up?  Did a single #MC fill it up?

I really think we need to understand what the problem is and have _some_
confidence that the proposed solution will fix that, even if we're just
talking about a new config option.
Yazen Ghannam Oct. 16, 2023, 2:14 p.m. UTC | #5
On 10/12/23 11:49 AM, Dave Hansen wrote:
> On 10/12/23 04:46, Sironi, Filippo wrote:
>> There's correlation across the errors that we're seeing, indeed,
>> we're looking at the same row being responsible for multiple CPUs
>> tripping and running into #MC. I still don't like the full lack of
>> visibility; it's not uncommon in a large fleet to see to take a
>> server out of production, replace a DIMM and shortly after taking it
>> out of production again to replace another DIMM just because some of
>> the errors weren't properly logged.
> 
> So you had two nearly simultaneous DIMM failures.  The first failed,
> filled up the buffer and then the second failed, but there was no room.
> The second failed *SO* soon after the first that there was no
> opportunity to empty the buffer between.
> 
> Right?
> 
> How do you know that storing 8 pages of records will catch this case as
> opposed to storing 2?
> 
>>> Is there any way that the size of the pool can be more automatically
>>> determined? Is the likelihood of a bunch errors proportional to the
>>> number of CPUs or amount of RAM or some other aspect of the hardware?
>>>
>>> Could the pool be emptied more aggressively so that it does not fill up?
> 
> You didn't really address the additional questions I posed there.
> 
> I'll add one more: how many of the messages are duplicates or
> *effectively* duplicates?  Or is that hard to determine at the time that
> the entries are being made that they are duplicates?
> 
> It _should_ also be fairly easy to enlarge the buffer on demand, say, if
> it got half full.  What's the time scale over which the buffer filled
> up?  Did a single #MC fill it up?
> 
> I really think we need to understand what the problem is and have _some_
> confidence that the proposed solution will fix that, even if we're just
> talking about a new config option.

I've seen a similar issue, and it's not just related to memory errors.
In my experience it was MCA errors from a variety of hardware blocks.
For example, a bad link internal to an SoC could spew MCA errors
regardless of the scale of RAM or CPUs. Same thing is possible for a bad
cache, etc.

These were during pre-production testing, and the easy workaround is to
increase the MCE genpool size at build time.

I don't think this needs to be the default though.

How about this to start?

1) Keep the current config size for boot time.
2) Add a kernel parameter and/or sysfs file to allow users to request
additional genpool capacity.
3) Use gen_pool_add(), or whichever, to add the capacity based on user
input.

Maybe this can be expanded later to be automatic. But I think it simpler
to start with explicit user input.

Thanks,
Yazen
Dave Hansen Oct. 16, 2023, 2:24 p.m. UTC | #6
On 10/16/23 07:14, Yazen Ghannam wrote:
> 1) Keep the current config size for boot time. 
> 2) Add a kernel parameter
> and/or sysfs file to allow users to request additional genpool capacity.
> 3) Use gen_pool_add(), or whichever, to add the capacity based on user
> input. Maybe this can be expanded later to be automatic. But I think it
> simpler to start with explicit user input.

I guarantee virtually nobody will ever use an explicit kernel interface
to bump the size up.  It'll be the same exact folks that recompile their
kernels.

An automatic resizing one doesn't have to be fancy and only has to
expand once:

static bool expanded = false;

...

	if (full && !expanded) {
		expand();
		expanded = true;
	}

It might be a _wee_ bit worse than that because you might have to queue
some work outside of #MC context but seriously we're talking 10-ish
lines of code.  It'd probably be even smaller than doing it when poked
by userspace and wouldn't involve new ABI.
Borislav Petkov Oct. 16, 2023, 2:40 p.m. UTC | #7
On Mon, Oct 16, 2023 at 07:24:34AM -0700, Dave Hansen wrote:
> On 10/16/23 07:14, Yazen Ghannam wrote:
> > 1) Keep the current config size for boot time. 
> > 2) Add a kernel parameter
> > and/or sysfs file to allow users to request additional genpool capacity.
> > 3) Use gen_pool_add(), or whichever, to add the capacity based on user
> > input. Maybe this can be expanded later to be automatic. But I think it
> > simpler to start with explicit user input.
> 
> I guarantee virtually nobody will ever use an explicit kernel interface
> to bump the size up.  It'll be the same exact folks that recompile their
> kernels.
> 
> An automatic resizing one doesn't have to be fancy and only has to
> expand once:

Can we slow down here pls?

You expand once and then that stuck bit error increases its rate of MCEs
reported and all of a sudden that raised size is overflowed again.

Or, you add logic which thresholds duplicate errors similar to the cmci
storm logic we have.

*If* that is the case at all.

So can we analyze first what type of errors are reported and why they're
overflowing the pool before we do ad-hoc hackery?

Thx.
Yazen Ghannam Oct. 16, 2023, 2:47 p.m. UTC | #8
On 10/16/23 10:24 AM, Dave Hansen wrote:
> On 10/16/23 07:14, Yazen Ghannam wrote:
>> 1) Keep the current config size for boot time. 
>> 2) Add a kernel parameter
>> and/or sysfs file to allow users to request additional genpool capacity.
>> 3) Use gen_pool_add(), or whichever, to add the capacity based on user
>> input. Maybe this can be expanded later to be automatic. But I think it
>> simpler to start with explicit user input.
> 
> I guarantee virtually nobody will ever use an explicit kernel interface
> to bump the size up.  It'll be the same exact folks that recompile their
> kernels.

Right, I agree. My intent was to help avoid a recompile if, for example,
the config size was not enough. Meaning even if a fleet has a custom
kernel with 8 pages of size, it may happen that more is needed. But "may
happen" isn't a strong reason to make a change here. :)

> 
> An automatic resizing one doesn't have to be fancy and only has to
> expand once:
>

Why once?

> static bool expanded = false;
> 
> ...
> 
> 	if (full && !expanded) {
> 		expand();
> 		expanded = true;
> 	}
> 
> It might be a _wee_ bit worse than that because you might have to queue
> some work outside of #MC context but seriously we're talking 10-ish
> lines of code.  It'd probably be even smaller than doing it when poked
> by userspace and wouldn't involve new ABI.

Okay, I'm with you here.

Thanks,
Yazen
Luck, Tony Oct. 16, 2023, 4:14 p.m. UTC | #9
> > An automatic resizing one doesn't have to be fancy and only has to
> > expand once:
> >
>
> Why once?
>
> > static bool expanded = false;
> >
> > ...
> >
> >     if (full && !expanded) {
> >             expand();
> >             expanded = true;
> >     }
> >
> > It might be a _wee_ bit worse than that because you might have to queue
> > some work outside of #MC context but seriously we're talking 10-ish
> > lines of code.  It'd probably be even smaller than doing it when poked
> > by userspace and wouldn't involve new ABI.
>
> Okay, I'm with you here.

I'm not. The reason that pool exists is so that an error record (struct mce)
can be copied into it during machine check context and atomically added
to the mce_event_llist. See mce_gen_pool_add().

I don't see how your "expand()" function is going to safely allocate additional
pages to the pool while executing in the machine check handler.

Just make it one of:

1) CONFIG option
2) boot command line option.
3) Dynamic based on num_online_cpus()

-Tony
diff mbox series

Patch

diff --git a/arch/x86/kernel/cpu/mce/genpool.c b/arch/x86/kernel/cpu/mce/genpool.c
index fbe8b61c3413..870dcb7011cd 100644
--- a/arch/x86/kernel/cpu/mce/genpool.c
+++ b/arch/x86/kernel/cpu/mce/genpool.c
@@ -17,9 +17,9 @@ 
  *
  * This memory pool is only to be used to save MCE records in MCE context.
  * MCE events are rare, so a fixed size memory pool should be enough. Use
- * 2 pages to save MCE events for now (~80 MCE records at most).
+ * 8 pages to save MCE events for now (~320 MCE records at most).
  */
-#define MCE_POOLSZ	(2 * PAGE_SIZE)
+#define MCE_POOLSZ	(8 * PAGE_SIZE)
 
 static struct gen_pool *mce_evt_pool;
 static LLIST_HEAD(mce_event_llist);