mbox series

[v2,0/2] New CMCI storm mitigation for Intel CPUs

Message ID 20220315181509.351704-1-tony.luck@intel.com (mailing list archive)
Headers show
Series New CMCI storm mitigation for Intel CPUs | expand

Message

Tony Luck March 15, 2022, 6:15 p.m. UTC
Two-part motivation:

1) Disabling CMCI globally is an overly big hammer

2) Intel signals some UNCORRECTED errors using CMCI (yes, turns
out that was a poorly chosen name given the later evolution of
the architecture). Since we don't want to miss those, the proposed
storm code just bumps the threshold to (almost) maximum to mitigate,
but not eliminate the storm. Note that the threshold only applies
to corrected errors.

Patch 1 deletes the parts of the old storm code that are no
longer needed.

Patch 2 adds the new per-bank mitigation.

Smita: Unless Boris finds a some more stuff for me to fix, this
version will be a better starting point to merge with your changes.

Changes since v1 (based on feedback from Boris)

- Spelling fixes in commit message
- Many more comments explaining what is going on
- Change name of function that does tracking
- Change names for #defines for storm BEGIN/END
- #define for high threshold in decimal, not hex

Tony Luck (2):
  x86/mce: Remove old CMCI storm mitigation code
  x86/mce: Add per-bank CMCI storm mitigation

 arch/x86/kernel/cpu/mce/core.c     |  46 +++---
 arch/x86/kernel/cpu/mce/intel.c    | 241 ++++++++++++++---------------
 arch/x86/kernel/cpu/mce/internal.h |  10 +-
 3 files changed, 141 insertions(+), 156 deletions(-)


base-commit: ffb217a13a2eaf6d5bd974fc83036a53ca69f1e2

Comments

Borislav Petkov March 15, 2022, 6:34 p.m. UTC | #1
On Tue, Mar 15, 2022 at 11:15:07AM -0700, Tony Luck wrote:
> Smita: Unless Boris finds a some more stuff for me to fix, this
> version will be a better starting point to merge with your changes.

Right, I'm wondering if AMD can use the same scheme so that abstracting
out the hw-specific accesses (MSR writes, etc) would be enough...
Koralahalli Channabasappa, Smita March 15, 2022, 9:46 p.m. UTC | #2
On 3/15/22 1:34 PM, Borislav Petkov wrote:

> On Tue, Mar 15, 2022 at 11:15:07AM -0700, Tony Luck wrote:
>> Smita: Unless Boris finds a some more stuff for me to fix, this
>> version will be a better starting point to merge with your changes.
> Right, I'm wondering if AMD can use the same scheme so that abstracting
> out the hw-specific accesses (MSR writes, etc) would be enough...

Thanks Tony.

Agreed. Most of this would apply for AMD's threshold interrupts too.

Will come up with a merged patch and move the storm handling to
mce/core.c and just keep the hw-specific accesses separate for
Intel and AMD in their respective files.

Thanks
Smita.