diff mbox series

[v2,2/5] x86/mce: dump error msg from severities

Message ID 20250217063335.22257-3-xueshuai@linux.alibaba.com (mailing list archive)
State New
Headers show
Series mm/hwpoison: Fix regressions in memory failure handling | expand

Commit Message

Shuai Xue Feb. 17, 2025, 6:33 a.m. UTC
The message in severities is useful for identifying the type of MCE that
has occurred; dump it if it is valid.

Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
---
 arch/x86/kernel/cpu/mce/core.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Borislav Petkov Feb. 28, 2025, 12:37 p.m. UTC | #1
On Mon, Feb 17, 2025 at 02:33:32PM +0800, Shuai Xue wrote:
> The message in severities is useful for identifying the type of MCE that
> has occurred; dump it if it is valid.

Needs more explanation as to what "useful" means. We already log and report
MCEs in gazillion ways.
Shuai Xue March 1, 2025, 6:16 a.m. UTC | #2
在 2025/2/28 20:37, Borislav Petkov 写道:
> On Mon, Feb 17, 2025 at 02:33:32PM +0800, Shuai Xue wrote:
>> The message in severities is useful for identifying the type of MCE that
>> has occurred; dump it if it is valid.
> 
> Needs more explanation as to what "useful" means. We already log and report
> MCEs in gazillion ways.
> 

You are right.

While MCE information is indeed decoded by the EDAC and APEI drivers,
the decoding is limited to the interpretation of hardware registers
and lacks the contextual details of the error.
  
For instance, it does not specify whether the error occurred in the
context of IN_KERNEL or IN_KERNEL_RECOV, which are crucial for
understanding the error's circumstances.

For the regression cases (copy from user) in Patch 3, an error message

     "mce: Action required: data load in error recoverable area of kernel"

will be added if this patch is applied.

I could add more explanations in next version if you have no objection.

Thanks.
Shuai
Borislav Petkov March 1, 2025, 11:10 a.m. UTC | #3
On Sat, Mar 01, 2025 at 02:16:12PM +0800, Shuai Xue wrote:
> For instance, it does not specify whether the error occurred in the
> context of IN_KERNEL or IN_KERNEL_RECOV, which are crucial for
> understanding the error's circumstances.

1. Crucial for whom? For you? Or for users?

You need to explain how this error message is going to be used. Because simply
issuing such a message causes a lot of panicked people calling a lot of admins
to figure out why their machine is broken. Because they see "mce" and think
"hw broken, need to replace it immediately."

This is one of the reasons we did the cec.c thing - just to save people from
panicking unnecessarily and causing expensive and useless maintenance calls.

2. This message goes to dmesg which means something needs to parse it, beside
   a human. An AI?

3. Dmesg is a ring buffer which gets overwritten and this message is
   eventually lost

There's a reason why MCEs get logged with the notifiers and through
a tracepoint - so that agents can act upon them properly.

And we have had this discussion for years now - I'm sorry that you're late to
the party.

> For the regression cases (copy from user) in Patch 3, an error message
> 
>     "mce: Action required: data load in error recoverable area of kernel"

See above.

Besides, this message is completely useless as it has no concrete info about
the error and what is being done about it.

> I could add more explanations in next version if you have no objection.

All of the above are objections.

Please go into git history and read why we're avoiding dumping useless
messages instead of proposing silly patches.
Shuai Xue March 1, 2025, 2:03 p.m. UTC | #4
在 2025/3/1 19:10, Borislav Petkov 写道:
> On Sat, Mar 01, 2025 at 02:16:12PM +0800, Shuai Xue wrote:
>> For instance, it does not specify whether the error occurred in the
>> context of IN_KERNEL or IN_KERNEL_RECOV, which are crucial for
>> understanding the error's circumstances.
> 
> 1. Crucial for whom? For you? Or for users?
> 
> You need to explain how this error message is going to be used. Because simply
> issuing such a message causes a lot of panicked people calling a lot of admins
> to figure out why their machine is broken. Because they see "mce" and think
> "hw broken, need to replace it immediately."
> 
> This is one of the reasons we did the cec.c thing - just to save people from
> panicking unnecessarily and causing expensive and useless maintenance calls.


For me, and cloud providers which maintains million servers.

(By the way, Cenots/Redhat build kernel without CONFIG_RAS_CEC set, becase
it breaks EDAC decoding. We do not use CEC in production at all for the same
reasion.)

> 
> 2. This message goes to dmesg which means something needs to parse it, beside
>     a human. An AI?

Yes, we collect all kernel message from host, parse the logs and predict panic
with AI tools. The more details we collect, the better the performance of
the AI model.

> 
> 3. Dmesg is a ring buffer which gets overwritten and this message is
>     eventually lost
> 
> There's a reason why MCEs get logged with the notifiers and through
> a tracepoint - so that agents can act upon them properly.
> 
> And we have had this discussion for years now - I'm sorry that you're late to
> the party.

Agreed, tracepoint is a more elegant way. However, it does not include error context,
just some hardware registers.

> 
>> For the regression cases (copy from user) in Patch 3, an error message
>>
>>      "mce: Action required: data load in error recoverable area of kernel"
> 
> See above.
> 
> Besides, this message is completely useless as it has no concrete info about
> the error and what is being done about it.

I don't think so,

"Action required" means MCI_UC_AR
"data load" means MCACOD_DATA
"recoverable area of kernel" means KERNEL_RECOV

It is more readable and concrete than "Uncorrected hardware memory error", e.g.
message in kill_me_maybe():

     "mce: Uncorrected hardware memory error in user-access at 3b116c400"

> 
>> I could add more explanations in next version if you have no objection.
> 
> All of the above are objections.
> 
> Please go into git history and read why we're avoiding dumping useless
> messages instead of proposing silly patches.
> 

Anyway, I respect the maintainer's opinion.

Thanks
Shuai
Borislav Petkov March 1, 2025, 6:47 p.m. UTC | #5
On Sat, Mar 01, 2025 at 10:03:13PM +0800, Shuai Xue wrote:
> (By the way, Cenots/Redhat build kernel without CONFIG_RAS_CEC set, becase
> it breaks EDAC decoding. We do not use CEC in production at all for the same
> reasion.)

It doesn't "break" error decoding - it collects every correctable DRAM error
and puts it in "leaky" bucket of sorts. And when a certain error address
generates too many errors, it memory_failure()s the page and poisons it.

You do not use it in production because you want to see every error, collect
it, massage it and perhaps decide when DIMMs go bad and you can replace
them... or whatever you do.

All the others who enable it and we can sleep properly, without getting
unnecessarily upset about a correctable error.

> Yes, we collect all kernel message from host, parse the logs and predict panic
> with AI tools. The more details we collect, the better the performance of
> the AI model.

LOL.

We go the great effort of going a MCE tracepoint which gives a *structured*
error record, show an example how to use
it in rasdaemon and you go and do the crazy hard and, at the same time, silly
thing and parse dmesg?!??!

This is priceless. Oh boy.

> Agreed, tracepoint is a more elegant way. However, it does not include error
> context, just some hardware registers.

The error context is in the behavior of the hw. If the error is fatal, you
won't see it - the machine will panic or do something else to prevent error
propagation. It definitely won't run any software anymore.

If you see the error getting logged, it means it is not fatal enough to kill
the machine.

> > Besides, this message is completely useless as it has no concrete info about
> > the error and what is being done about it.
> 
> I don't think so,

I think so and you're not reading my mail.

>     "mce: Uncorrected hardware memory error in user-access at 3b116c400"

Ask yourself: what can you do when you see a message like that?

Exactly *nothing* because there's not nearly enough information to recover
from it or log it or whatever. That error message is *totally useless* and
you're upsetting your users unnecessarily and even if they report it to you,
you can't help them.
Shuai Xue March 2, 2025, 7:14 a.m. UTC | #6
在 2025/3/2 02:47, Borislav Petkov 写道:
> On Sat, Mar 01, 2025 at 10:03:13PM +0800, Shuai Xue wrote:
>> (By the way, Cenots/Redhat build kernel without CONFIG_RAS_CEC set, becase
>> it breaks EDAC decoding. We do not use CEC in production at all for the same
>> reasion.)
> 
> It doesn't "break" error decoding - it collects every correctable DRAM error
> and puts it in "leaky" bucket of sorts. And when a certain error address
> generates too many errors, it memory_failure()s the page and poisons it.
> 
> You do not use it in production because you want to see every error, collect
> it, massage it and perhaps decide when DIMMs go bad and you can replace
> them... or whatever you do.
> 
> All the others who enable it and we can sleep properly, without getting
> unnecessarily upset about a correctable error.

Yes, we want to see event CE error and use the CE pattern (e.g. correctable
error-bit)[1][2] to  predict whether a row fault is prone to UEs or not.
And we are not upset to CE error, becasue it have corrected by hardware :)

[1]https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fault-aware-prediction-guide.pdf
[2]https://arxiv.org/html/2312.02855v2

> 
>> Yes, we collect all kernel message from host, parse the logs and predict panic
>> with AI tools. The more details we collect, the better the performance of
>> the AI model.
> 
> LOL.
> 
> We go the great effort of going a MCE tracepoint which gives a *structured*
> error record, show an example how to use
> it in rasdaemon and you go and do the crazy hard and, at the same time, silly
> thing and parse dmesg?!??!
> 
> This is priceless. Oh boy.
> 
>> Agreed, tracepoint is a more elegant way. However, it does not include error
>> context, just some hardware registers.
> 
> The error context is in the behavior of the hw. If the error is fatal, you
> won't see it - the machine will panic or do something else to prevent error
> propagation. It definitely won't run any software anymore.
> 
> If you see the error getting logged, it means it is not fatal enough to kill
> the machine.

Agreed.

> 
>>> Besides, this message is completely useless as it has no concrete info about
>>> the error and what is being done about it.
>>
>> I don't think so,
> 
> I think so and you're not reading my mail.
> 
>>      "mce: Uncorrected hardware memory error in user-access at 3b116c400"

It is the current message in kill_me_maybe(), not added by me.

> 
> Ask yourself: what can you do when you see a message like that?
> 
> Exactly *nothing* because there's not nearly enough information to recover
> from it or log it or whatever. That error message is *totally useless* and
> you're upsetting your users unnecessarily and even if they report it to you,
> you can't help them.
> 

I believe we are approaching this issue from different perspectives.
As a cloud service provider, I need to address the following points:

1. I must be able to explain to end users why the MCE has occurred.
2. It is important to determine whether there are any kernel bugs that could
    compromise the overall stability of the cloud platform.
3. We need to identify and implement potential improvements.

"mce: Uncorrected hardware memory error in user-access at 3b116c400"

is *nothing* but

"mce: Action required: data load in error recoverable area of kernel"

helps.


Thanks for your time.
Shuai
Borislav Petkov March 2, 2025, 7:37 a.m. UTC | #7
On Sun, Mar 02, 2025 at 03:14:52PM +0800, Shuai Xue wrote:
> > >      "mce: Uncorrected hardware memory error in user-access at 3b116c400"
> 
> It is the current message in kill_me_maybe(), not added by me.

Doesn't change the fact that it is not really helpful when it comes to logging
all errors properly.

  [ Properly means using a structured log format with the tracepoint and not
    dumping it into dmesg. ]

And figuring out what hw is failing so that it can be replaced. No one has
come with a real need for making it better, more useful.

You're coming with what I think is such a need and I'm trying to explain to
you what needs to be done. But you want to feed your AI with dmesg and solve
it this way.

If you wanna do it right, we can talk. Otherwise, have fun.

> 3. We need to identify and implement potential improvements.
> 
> "mce: Uncorrected hardware memory error in user-access at 3b116c400"
> 
> is *nothing* but
> 
> "mce: Action required: data load in error recoverable area of kernel"
> 
> helps.

I don't think you've read what I wrote but that's ok. If you think it helps,
you can keep it in your kernels.
Shuai Xue March 2, 2025, 9:13 a.m. UTC | #8
在 2025/3/2 15:37, Borislav Petkov 写道:
> On Sun, Mar 02, 2025 at 03:14:52PM +0800, Shuai Xue wrote:
>>>>       "mce: Uncorrected hardware memory error in user-access at 3b116c400"
>>
>> It is the current message in kill_me_maybe(), not added by me.
> 
> Doesn't change the fact that it is not really helpful when it comes to logging
> all errors properly.
> 
>    [ Properly means using a structured log format with the tracepoint and not
>      dumping it into dmesg. ]
> 
> And figuring out what hw is failing so that it can be replaced. No one has
> come with a real need for making it better, more useful.
> 
> You're coming with what I think is such a need and I'm trying to explain to
> you what needs to be done. But you want to feed your AI with dmesg and solve
> it this way.
> 
> If you wanna do it right, we can talk. Otherwise, have fun.

I see. So I am just curious why we define `msg` in `severities`?

I perfer to use structured log format with the tracepoint, and we do use it in
production, but it lacks of process context.

AMD folks add error message for panic errors[1] to help debugging
in which the EDAC driver is not able to decode.

For non-fatal errors, is it reasonable to assume that all users are using
tracepoint-based tools like Rasdaemon?

[1]https://lore.kernel.org/all/20220405183212.354606-1-carlos.bilbao@amd.com/

> 
>> 3. We need to identify and implement potential improvements.
>>
>> "mce: Uncorrected hardware memory error in user-access at 3b116c400"
>>
>> is *nothing* but
>>
>> "mce: Action required: data load in error recoverable area of kernel"
>>
>> helps.
> 
> I don't think you've read what I wrote but that's ok. If you think it helps,
> you can keep it in your kernels.
> 

Fine, I could drop patch 1 and 2 in next version.

Thanks.
Shuai
Luck, Tony March 3, 2025, 4:49 p.m. UTC | #9
> The error context is in the behavior of the hw. If the error is fatal, you
> won't see it - the machine will panic or do something else to prevent error
> propagation. It definitely won't run any software anymore.
>
> If you see the error getting logged, it means it is not fatal enough to kill
> the machine.

One place in the fatal case where I would like to see more information is the

  "Action required: data load in error *UN*recoverable area of kernel"

[emphasis on the "UN" added].

case.  We have a few places where the kernel does recover. And most places
we crash. Our code for the recoverable cases is fragile. Most of this series is
about repairing regressions where we used to recover from places where kernel
is doing get_user() or copy_from_user() which can be recovered if those places
get an error return and the kernel kills the process instead of crashing.

A long time ago I posted some patches to include a stack trace for this type
of crash. It didn't make it into the kernel, and I got distracted by other things.

If we had that, it would have been easier to diagnose this regression (Shaui
Xie would have seen crashes with a stack trace pointing to code that used
to recover in older kernels). Folks with big clusters would also be able to
point out other places where the kernel crashes often enough that additional
EXTABLE recovery paths would be worth investigating.

So:

1) We need to fix the regressions. That just needs new commit messages
for these patches that explain the issue better.

2) I'd like to see a patch for a stack trace for the unrecoverable case.

3) I don't see much value in a message that reports the recoverable case.

Yazen: At one point I think you said you were looking at adding additional
decorations to the return value from mce_severity() to indicate actions
needed for recoverable errors (kill the process, offline the page) rather
than have do_machine_check() figure it out by looking at various fields
in the "struct mce". Did that go anywhere? Those extra details might be
interesting in the tracepoint.

-Tony
Yazen Ghannam March 3, 2025, 6:08 p.m. UTC | #10
On Mon, Mar 03, 2025 at 04:49:25PM +0000, Luck, Tony wrote:
> > The error context is in the behavior of the hw. If the error is fatal, you
> > won't see it - the machine will panic or do something else to prevent error
> > propagation. It definitely won't run any software anymore.
> >
> > If you see the error getting logged, it means it is not fatal enough to kill
> > the machine.
> 
> One place in the fatal case where I would like to see more information is the
> 
>   "Action required: data load in error *UN*recoverable area of kernel"
> 
> [emphasis on the "UN" added].
> 
> case.  We have a few places where the kernel does recover. And most places
> we crash. Our code for the recoverable cases is fragile. Most of this series is
> about repairing regressions where we used to recover from places where kernel
> is doing get_user() or copy_from_user() which can be recovered if those places
> get an error return and the kernel kills the process instead of crashing.
> 
> A long time ago I posted some patches to include a stack trace for this type
> of crash. It didn't make it into the kernel, and I got distracted by other things.
> 
> If we had that, it would have been easier to diagnose this regression (Shaui
> Xie would have seen crashes with a stack trace pointing to code that used
> to recover in older kernels). Folks with big clusters would also be able to
> point out other places where the kernel crashes often enough that additional
> EXTABLE recovery paths would be worth investigating.
> 
> So:
> 
> 1) We need to fix the regressions. That just needs new commit messages
> for these patches that explain the issue better.
> 
> 2) I'd like to see a patch for a stack trace for the unrecoverable case.
> 
> 3) I don't see much value in a message that reports the recoverable case.
> 
> Yazen: At one point I think you said you were looking at adding additional
> decorations to the return value from mce_severity() to indicate actions
> needed for recoverable errors (kill the process, offline the page) rather
> than have do_machine_check() figure it out by looking at various fields
> in the "struct mce". Did that go anywhere? Those extra details might be
> interesting in the tracepoint.
> 

Hi Tony,

Yes, I have a patch here:
https://github.com/AMDESE/linux/commit/cf0b8a97240abf0fbd98a91cd8deb262f827721b

Branch:
https://github.com/AMDESE/linux/commits/wip-mca/

This work is at the tail-end of a lot of other refactoring. But it can
be prioritized if there's interest. Most of the dependencies have
already been merged.

Thanks,
Yazen
Shuai Xue March 5, 2025, 1:50 a.m. UTC | #11
在 2025/3/4 00:49, Luck, Tony 写道:
>> The error context is in the behavior of the hw. If the error is fatal, you
>> won't see it - the machine will panic or do something else to prevent error
>> propagation. It definitely won't run any software anymore.
>>
>> If you see the error getting logged, it means it is not fatal enough to kill
>> the machine.
> 
> One place in the fatal case where I would like to see more information is the
> 
>    "Action required: data load in error *UN*recoverable area of kernel"
> 
> [emphasis on the "UN" added].

Do you mean this one?

     MCESEV(
         PANIC, "Data load in unrecoverable area of kernel",
         SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|MCACOD_DATA),
         KERNEL
        ),


> 
> case.  We have a few places where the kernel does recover. And most places
> we crash. Our code for the recoverable cases is fragile.Most of this series is
> about repairing regressions where we used to recover from places where kernel
> is doing get_user() or copy_from_user() which can be recovered if those places
> get an error return and the kernel kills the process instead of crashing.

I can’t agree with you more.


> A long time ago I posted some patches to include a stack trace for this type
> of crash. It didn't make it into the kernel, and I got distracted by other things.
> 
> If we had that, it would have been easier to diagnose this regression (Shaui
> Xie would have seen crashes with a stack trace pointing to code that used
> to recover in older kernels). Folks with big clusters would also be able to
> point out other places where the kernel crashes often enough that additional
> EXTABLE recovery paths would be worth investigating.

Agreed, a stack trace will be helpful for debug unrecoverable cases.
The current panic message is bellow:

[ 1879.726794] mce: [Hardware Error]: CPU 178: Machine Check Exception: f Bank 1: bd80000000100134
[ 1879.726798] mce: [Hardware Error]: RIP 10:<ffffffff981d7af3> {futex_wait_setup+0x83/0xf0}
[ 1879.726807] mce: [Hardware Error]: TSC 49a1e6001c1 ADDR 80f7ada400 MISC 86 PPIN fc6b80e0ba9d616
[ 1879.726809] mce: [Hardware Error]: PROCESSOR 0:806f4 TIME 1741091252 SOCKET 1 APIC c5 microcode 2b000571
[ 1879.726811] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 1879.726813] mce: [Hardware Error]: Machine check events logged
[ 1879.727166] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
[ 1879.727168] Kernel panic - not syncing: Fatal local machine check


It only provides a RIP and I spent a lot time to figure out the root cause about
why get_user() and copy_from_user() fail in upstream kernel.

> 
> So:
> 
> 1) We need to fix the regressions. That just needs new commit messages
> for these patches that explain the issue better.

I will polish commit message.

> 
> 2) I'd like to see a patch for a stack trace for the unrecoverable case.

Could you provide any reference link to your previous patch?

> 
> 3) I don't see much value in a message that reports the recoverable case.
> 

Got it.

Thanks
Shuai
Luck, Tony March 5, 2025, 4:16 p.m. UTC | #12
>> 2) I'd like to see a patch for a stack trace for the unrecoverable case.
>
> Could you provide any reference link to your previous patch?

https://lore.kernel.org/all/20220922195136.54575-1-tony.luck@intel.com/

-Tony
Luck, Tony March 5, 2025, 10:33 p.m. UTC | #13
>>> 2) I'd like to see a patch for a stack trace for the unrecoverable case.
>>
>> Could you provide any reference link to your previous patch?
>
> https://lore.kernel.org/all/20220922195136.54575-1-tony.luck@intel.com/

Yazen has some cleanups to the severity code under consideration
here:

https://github.com/AMDESE/linux/commit/cf0b8a97240abf0fbd98a91cd8deb262f827721b


I wonder if a slightly different approach with the "mce_action" that Yazen
defines being a bitmask of options, instead of an enum, would be possible?

If that happened, then adding a "dump the call stack to the console" option
would fit neatly into the scheme.

-Tony
Yazen Ghannam March 6, 2025, 3:58 p.m. UTC | #14
On Wed, Mar 05, 2025 at 10:33:04PM +0000, Luck, Tony wrote:
> >>> 2) I'd like to see a patch for a stack trace for the unrecoverable case.
> >>
> >> Could you provide any reference link to your previous patch?
> >
> > https://lore.kernel.org/all/20220922195136.54575-1-tony.luck@intel.com/
> 
> Yazen has some cleanups to the severity code under consideration
> here:
> 
> https://github.com/AMDESE/linux/commit/cf0b8a97240abf0fbd98a91cd8deb262f827721b
> 
> 
> I wonder if a slightly different approach with the "mce_action" that Yazen
> defines being a bitmask of options, instead of an enum, would be possible?
> 
> If that happened, then adding a "dump the call stack to the console" option
> would fit neatly into the scheme.
> 

Right, I was thinking something similar. Basically, replace single
action (enum) with multiple possible actions (flags/bitmask). And we can
have a mask for a "class" of actions for general cases.

Thanks,
Yazen
diff mbox series

Patch

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index f2e730d4acc5..c1b945dbccdf 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1456,6 +1456,8 @@  static void queue_task_work(struct mce_hw_err *err, char *msg, void (*func)(stru
 	if (count > 1)
 		return;
 
+	if (msg)
+		pr_err("%s\n", msg);
 	task_work_add(current, &current->mce_kill_me, TWA_RESUME);
 }