x86/mce/amd: init mce severity to handle deferred memory failure

When a deferred UE error is detected, e.g by background patrol scruber, it
will be handled in APIC interrupt handler amd_deferred_error_interrupt().
The handler will collect MCA banks, init mce struct and process it by
nofitying the registered MCE decode chain.

The uc_decode_notifier, one of MCE decode chain, will process memory
failure but only limit to MCE_AO_SEVERITY and MCE_DEFERRED_SEVERITY.
However, APIC interrupt handler does not init mce severity and the
uninitialized severity is 0 (MCE_NO_SEVERITY).

To handle the deferred memory failure case, init mce severity when logging
MCA banks.

Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>

---
Steps to reproduce:

step 1: inject a patrol scrub error by ras-tools
#einj_mem_uc patrol

step 2: check dmesg, no memory failure log
#dmesg -c
[51295.686806] mce: [Hardware Error]: Machine check events logged
[51295.693566] mce->status: 0x942031000400011b
[51295.698248] mce->misc: 0x00000000
[51295.701952] mce->severity: 0x00000000	# Manually added printk  
[51295.726640] [Hardware Error]: Deferred error, no action required.
[51295.733448] [Hardware Error]: CPU:65 (19:11:1) MC21_STATUS[-|-|-|AddrV|-|-|SyndV|UECC|Deferred|-|Scrub]: 0x942031000400011b
[51295.733452] [Hardware Error]: Error Addr: 0x0000000006350a00
[51295.733453] [Hardware Error]: PPIN: 0x02b69e294c148024
[51295.733453] [Hardware Error]: IPID: 0x0000109600250f00, Syndrome: 0x9a4a00000b800000
[51295.733455] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
[51295.733463] mce: umc_normaddr_to_sysaddr: Invalid DramBaseAddress range: 0x0.
[51295.733471] EDAC MC0: 1 UE Cannot decode normalized address on mc#0csrow#0channel#2 (csrow:0 channel:2 page:0x0 offset:0x0 grain:64)
[51295.733471] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

After this fix:

[  514.966892] mce: [Hardware Error]: Machine check events logged
[  514.966912] mce->status: 0x942031000400011b
[  514.978093] mce->misc: 0x00000000
[  514.981796] mce->severity: 0x00000001
[  514.985885] <uc_decode_notifier> pre_handler: p->addr = 0x00000000e09e69e4, ip = ffffffff8104b955, flags = 0x282
[  514.997253] <uc_decode_notifier> post_handler: p->addr = 0x00000000e09e69e4, flags = 0x282
[  515.006501] Memory failure: 0x5dc2: recovery action for free buddy page: Recovered
[  515.015188] [Hardware Error]: Deferred error, no action required.
[  515.022006] [Hardware Error]: CPU:67 (19:11:1) MC21_STATUS[-|-|-|AddrV|-|-|SyndV|UECC|Deferred|-|Scrub]: 0x942031000400011b
[  515.034440] [Hardware Error]: Error Addr: 0x0000000005dc2a00
[  515.034442] [Hardware Error]: PPIN: 0x02b69e294c148024
[  515.034443] [Hardware Error]: IPID: 0x0000109600650f00, Syndrome: 0x9a4a00000b800008
[  515.034445] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
[  515.034453] umc_normaddr_to_sysaddr: Invalid DramBaseAddress range: 0x0.
[  515.034458] EDAC MC1: 1 UE Cannot decode normalized address on mc#1csrow#0channel#6 (csrow:0 channel:6 page:0x0 offset:0x0 grain:64)
[  515.034461] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

Note, the memory_failure handles wrong physical address because
umc_normaddr_to_sysaddr fails. I don't figure out why it fails.
---
 arch/x86/kernel/cpu/mce/amd.c | 1 +
 1 file changed, 1 insertion(+)

Message ID	20230425121829.61755-1-xueshuai@linux.alibaba.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-edac-owner@vger.kernel.org> From: Shuai Xue <xueshuai@linux.alibaba.com> To: bp@alien8.de, tony.luck@intel.com Cc: tglx@linutronix.de, mingo@redhat.com, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, baolin.wang@linux.alibaba.com, linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH] x86/mce/amd: init mce severity to handle deferred memory failure Date: Tue, 25 Apr 2023 20:18:29 +0800 Message-Id: <20230425121829.61755-1-xueshuai@linux.alibaba.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	x86/mce/amd: init mce severity to handle deferred memory failure \| expand x86/mce/amd: init mce severity to handle deferred memory failure

x86/mce/amd: init mce severity to handle deferred memory failure

Commit Message

Comments

Patch