Message ID | 20250213-wip-mca-updates-v2-16-3636547fe05f@amd.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | AMD MCA interrupts rework | expand |
> From: Yazen Ghannam <yazen.ghannam@amd.com> > Sent: Friday, February 14, 2025 12:46 AM > To: x86@kernel.org; Luck, Tony <tony.luck@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-edac@vger.kernel.org; > Smita.KoralahalliChannabasappa@amd.com; Yazen Ghannam > <yazen.ghannam@amd.com> > Subject: [PATCH v2 16/16] x86/mce: Handle AMD threshold interrupt storms > > From: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com> > > Extend the logic of handling CMCI storms to AMD threshold interrupts. > > Rely on the similar approach as of Intel's CMCI to mitigate storms per CPU > and per bank. But, unlike CMCI, do not set thresholds and reduce interrupt > rate on a storm. Rather, disable the interrupt on the corresponding CPU and > bank. Re-enable back the interrupts if enough consecutive polls of the bank > show no corrected errors (30, as programmed by Intel). > > Turning off the threshold interrupts would be a better solution on AMD > systems as other error severities will still be handled even if the threshold > interrupts are disabled. > > [Tony: Small tweak because mce_handle_storm() isn't a pointer now] > [Yazen: Rebase and simplify] > > Signed-off-by: Smita Koralahalli > <Smita.KoralahalliChannabasappa@amd.com> > Signed-off-by: Tony Luck <tony.luck@intel.com> > Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> LGTM. Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c index 404e0c38f9d8..a2d02f0c2153 100644 --- a/arch/x86/kernel/cpu/mce/amd.c +++ b/arch/x86/kernel/cpu/mce/amd.c @@ -1218,3 +1218,21 @@ void mce_threshold_create_device(unsigned int cpu) mce_threshold_vector = amd_threshold_interrupt; return; } + +void mce_amd_handle_storm(unsigned int bank, bool on) +{ + struct threshold_bank **thr_banks = this_cpu_read(threshold_banks); + struct threshold_block *block, *tmp; + struct thresh_restart tr; + + if (!thr_banks || !thr_banks[bank]) + return; + + memset(&tr, 0, sizeof(tr)); + + list_for_each_entry_safe(block, tmp, &thr_banks[bank]->miscj, miscj) { + tr.b = block; + tr.b->interrupt_enable = on; + threshold_restart_bank(&tr); + } +} diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h index fe519acfafcf..9d771db2bcae 100644 --- a/arch/x86/kernel/cpu/mce/internal.h +++ b/arch/x86/kernel/cpu/mce/internal.h @@ -266,6 +266,7 @@ void mce_prep_record_per_cpu(unsigned int cpu, struct mce *m); #ifdef CONFIG_X86_MCE_AMD void mce_threshold_create_device(unsigned int cpu); +void mce_amd_handle_storm(unsigned int bank, bool on); extern bool amd_filter_mce(struct mce *m); bool amd_mce_usable_address(struct mce *m); void amd_reset_thr_limit(unsigned int bank); @@ -297,6 +298,7 @@ static __always_inline void smca_extract_err_addr(struct mce *m) void mce_smca_cpu_init(void); #else static inline void mce_threshold_create_device(unsigned int cpu) { } +static inline void mce_amd_handle_storm(unsigned int bank, bool on) { } static inline bool amd_filter_mce(struct mce *m) { return false; } static inline bool amd_mce_usable_address(struct mce *m) { return false; } static inline void amd_reset_thr_limit(unsigned int bank) { } diff --git a/arch/x86/kernel/cpu/mce/threshold.c b/arch/x86/kernel/cpu/mce/threshold.c index f4a007616468..45144598ec74 100644 --- a/arch/x86/kernel/cpu/mce/threshold.c +++ b/arch/x86/kernel/cpu/mce/threshold.c @@ -63,6 +63,9 @@ static void mce_handle_storm(unsigned int bank, bool on) case X86_VENDOR_INTEL: mce_intel_handle_storm(bank, on); break; + case X86_VENDOR_AMD: + mce_amd_handle_storm(bank, on); + break; } }