diff mbox series

x86/mce/amd: init mce severity to handle deferred memory failure

Message ID 20230425121829.61755-1-xueshuai@linux.alibaba.com (mailing list archive)
State New, archived
Headers show
Series x86/mce/amd: init mce severity to handle deferred memory failure | expand

Commit Message

Shuai Xue April 25, 2023, 12:18 p.m. UTC
When a deferred UE error is detected, e.g by background patrol scruber, it
will be handled in APIC interrupt handler amd_deferred_error_interrupt().
The handler will collect MCA banks, init mce struct and process it by
nofitying the registered MCE decode chain.

The uc_decode_notifier, one of MCE decode chain, will process memory
failure but only limit to MCE_AO_SEVERITY and MCE_DEFERRED_SEVERITY.
However, APIC interrupt handler does not init mce severity and the
uninitialized severity is 0 (MCE_NO_SEVERITY).

To handle the deferred memory failure case, init mce severity when logging
MCA banks.

Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>

---
Steps to reproduce:

step 1: inject a patrol scrub error by ras-tools
#einj_mem_uc patrol

step 2: check dmesg, no memory failure log
#dmesg -c
[51295.686806] mce: [Hardware Error]: Machine check events logged
[51295.693566] mce->status: 0x942031000400011b
[51295.698248] mce->misc: 0x00000000
[51295.701952] mce->severity: 0x00000000	# Manually added printk  
[51295.726640] [Hardware Error]: Deferred error, no action required.
[51295.733448] [Hardware Error]: CPU:65 (19:11:1) MC21_STATUS[-|-|-|AddrV|-|-|SyndV|UECC|Deferred|-|Scrub]: 0x942031000400011b
[51295.733452] [Hardware Error]: Error Addr: 0x0000000006350a00
[51295.733453] [Hardware Error]: PPIN: 0x02b69e294c148024
[51295.733453] [Hardware Error]: IPID: 0x0000109600250f00, Syndrome: 0x9a4a00000b800000
[51295.733455] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
[51295.733463] mce: umc_normaddr_to_sysaddr: Invalid DramBaseAddress range: 0x0.
[51295.733471] EDAC MC0: 1 UE Cannot decode normalized address on mc#0csrow#0channel#2 (csrow:0 channel:2 page:0x0 offset:0x0 grain:64)
[51295.733471] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

After this fix:

[  514.966892] mce: [Hardware Error]: Machine check events logged
[  514.966912] mce->status: 0x942031000400011b
[  514.978093] mce->misc: 0x00000000
[  514.981796] mce->severity: 0x00000001
[  514.985885] <uc_decode_notifier> pre_handler: p->addr = 0x00000000e09e69e4, ip = ffffffff8104b955, flags = 0x282
[  514.997253] <uc_decode_notifier> post_handler: p->addr = 0x00000000e09e69e4, flags = 0x282
[  515.006501] Memory failure: 0x5dc2: recovery action for free buddy page: Recovered
[  515.015188] [Hardware Error]: Deferred error, no action required.
[  515.022006] [Hardware Error]: CPU:67 (19:11:1) MC21_STATUS[-|-|-|AddrV|-|-|SyndV|UECC|Deferred|-|Scrub]: 0x942031000400011b
[  515.034440] [Hardware Error]: Error Addr: 0x0000000005dc2a00
[  515.034442] [Hardware Error]: PPIN: 0x02b69e294c148024
[  515.034443] [Hardware Error]: IPID: 0x0000109600650f00, Syndrome: 0x9a4a00000b800008
[  515.034445] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
[  515.034453] umc_normaddr_to_sysaddr: Invalid DramBaseAddress range: 0x0.
[  515.034458] EDAC MC1: 1 UE Cannot decode normalized address on mc#1csrow#0channel#6 (csrow:0 channel:6 page:0x0 offset:0x0 grain:64)
[  515.034461] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

Note, the memory_failure handles wrong physical address because
umc_normaddr_to_sysaddr fails. I don't figure out why it fails.
---
 arch/x86/kernel/cpu/mce/amd.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Yazen Ghannam May 9, 2023, 2:25 p.m. UTC | #1
On 4/25/23 8:18 AM, Shuai Xue wrote:
> When a deferred UE error is detected, e.g by background patrol scruber, it
> will be handled in APIC interrupt handler amd_deferred_error_interrupt().
> The handler will collect MCA banks, init mce struct and process it by
> nofitying the registered MCE decode chain.
> 
> The uc_decode_notifier, one of MCE decode chain, will process memory
> failure but only limit to MCE_AO_SEVERITY and MCE_DEFERRED_SEVERITY.
> However, APIC interrupt handler does not init mce severity and the
> uninitialized severity is 0 (MCE_NO_SEVERITY).
> 
> To handle the deferred memory failure case, init mce severity when logging
> MCA banks.
> 
> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
>

Hi Shuai Xue,

I think this patch is fair to do. But it won't have the intended effect
in practice.

The value in MCA_ADDR for DRAM ECC errors will be a memory controller
"normalized address". This is not a system physical address that the OS
can use to take action.

The mce_usable_address() function needs to be updated to handle this.
I'll send a patchset this week to do so. Afterwards, the
uc_decode_notifier will not attempt to handle these errors.

Thanks,
Yazen
Shuai Xue May 10, 2023, 2:17 a.m. UTC | #2
On 2023/5/9 22:25, Yazen Ghannam wrote:
> On 4/25/23 8:18 AM, Shuai Xue wrote:
>> When a deferred UE error is detected, e.g by background patrol scruber, it
>> will be handled in APIC interrupt handler amd_deferred_error_interrupt().
>> The handler will collect MCA banks, init mce struct and process it by
>> nofitying the registered MCE decode chain.
>>
>> The uc_decode_notifier, one of MCE decode chain, will process memory
>> failure but only limit to MCE_AO_SEVERITY and MCE_DEFERRED_SEVERITY.
>> However, APIC interrupt handler does not init mce severity and the
>> uninitialized severity is 0 (MCE_NO_SEVERITY).
>>
>> To handle the deferred memory failure case, init mce severity when logging
>> MCA banks.
>>
>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
>>
> 
> Hi Shuai Xue,
> 
> I think this patch is fair to do. But it won't have the intended effect
> in practice.
> 
> The value in MCA_ADDR for DRAM ECC errors will be a memory controller
> "normalized address". This is not a system physical address that the OS
> can use to take action.
> 
> The mce_usable_address() function needs to be updated to handle this.
> I'll send a patchset this week to do so. Afterwards, the
> uc_decode_notifier will not attempt to handle these errors.

From the experience of other platforms (e.g. ARM64 RAS and Intel MCA),
uc_decode_notifier should handle these error to hard offline the corrupted
page. If the corrupted page is a free buddy page, we can isolate it and avoid
using the page in the future.

In my test case, the error is detected by patrol scrubber in memory controller.
The scrubber may lack of system address space perspective, and only reports
"normalized address". But we can decode the "normalized address" to system address
by EDAC (umc_normaddr_to_sysaddr), right?

(I am not quite familiar with AMD RAS, please correct me if I am wrong)

> 
> Thanks,
> Yazen

Thank you.

Best Regards,
Shuai
Yazen Ghannam May 10, 2023, 1:59 p.m. UTC | #3
On 5/9/23 10:17 PM, Shuai Xue wrote:
> 
> 
> On 2023/5/9 22:25, Yazen Ghannam wrote:
>> On 4/25/23 8:18 AM, Shuai Xue wrote:
>>> When a deferred UE error is detected, e.g by background patrol scruber, it
>>> will be handled in APIC interrupt handler amd_deferred_error_interrupt().
>>> The handler will collect MCA banks, init mce struct and process it by
>>> nofitying the registered MCE decode chain.
>>>
>>> The uc_decode_notifier, one of MCE decode chain, will process memory
>>> failure but only limit to MCE_AO_SEVERITY and MCE_DEFERRED_SEVERITY.
>>> However, APIC interrupt handler does not init mce severity and the
>>> uninitialized severity is 0 (MCE_NO_SEVERITY).
>>>
>>> To handle the deferred memory failure case, init mce severity when logging
>>> MCA banks.
>>>
>>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
>>>
>>
>> Hi Shuai Xue,
>>
>> I think this patch is fair to do. But it won't have the intended effect
>> in practice.
>>
>> The value in MCA_ADDR for DRAM ECC errors will be a memory controller
>> "normalized address". This is not a system physical address that the OS
>> can use to take action.
>>
>> The mce_usable_address() function needs to be updated to handle this.
>> I'll send a patchset this week to do so. Afterwards, the
>> uc_decode_notifier will not attempt to handle these errors.
> 
> From the experience of other platforms (e.g. ARM64 RAS and Intel MCA),
> uc_decode_notifier should handle these error to hard offline the corrupted
> page. If the corrupted page is a free buddy page, we can isolate it and avoid
> using the page in the future.
> 
> In my test case, the error is detected by patrol scrubber in memory controller.
> The scrubber may lack of system address space perspective, and only reports
> "normalized address". But we can decode the "normalized address" to system address
> by EDAC (umc_normaddr_to_sysaddr), right?
> 
> (I am not quite familiar with AMD RAS, please correct me if I am wrong)
>

Yes, that's correct.

The address translation requires some updates that are still in-review.
Afterwards, we can investigate ways to use the translated address. It
may require some rework in the MCE notifier chain or, more simply,
calling memory_failure() from the EDAC module itself.

Thanks,
Yazen
Ruidong Tian April 18, 2024, 8:42 a.m. UTC | #4
AMD ATL has merged to upstream, can we merge this patch to process 
deferred error with memory_failure()?

Thanks,
Ruidong
Yazen Ghannam April 18, 2024, 1:23 p.m. UTC | #5
On 4/18/24 04:42, Ruidong Tian wrote:
> 
> AMD ATL has merged to upstream, can we merge this patch to process 
> deferred error with memory_failure()?
>

Hi Ruidong,

Thanks for the follow up.

This patch is made redundant by the following patch in review.
https://lore.kernel.org/linux-edac/20240404151359.47970-11-yazen.ghannam@amd.com/

Also, this is still not sufficient. The address translation still needs
to be invoked in order for memory_failure() to have a valid system
physical address.

Please see the following work-in-progress patch.
https://github.com/AMDESE/linux/commit/6ddd8e90d08edb4a2730ccd02981baef4645bb43

Thanks,
Yazen
diff mbox series

Patch

diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index 23c5072fbbb7..b5e1a27b0881 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -734,6 +734,7 @@  static void __log_error(unsigned int bank, u64 status, u64 addr, u64 misc)
 	m.misc   = misc;
 	m.bank   = bank;
 	m.tsc	 = rdtsc();
+	m.severity = mce_severity(&m, NULL, NULL, false);
 
 	if (m.status & MCI_STATUS_ADDRV) {
 		m.addr = addr;