mbox series

[v3,0/5] ARM Error Source Table V2 Support

Message ID 20250115084228.107573-1-tianruidong@linux.alibaba.com (mailing list archive)
Headers show
Series ARM Error Source Table V2 Support | expand

Message

Ruidong Tian Jan. 15, 2025, 8:42 a.m. UTC
AEST provides a mechanism for hardware to directly notify Kernel to
handle RAS errors through interrupts, which is also known as Kernel-first
mode.

AEST's Advantage
========================

1. AEST uses EL1 interrupts to report CE/DE, making it more lightweight
   than GHES (the Firmware First solution on Arm).
2. The lightweight AEST allows system to report each CE, enabling user
    applications to utilize this information for memory error prediction.

AEST Driver Architecture
========================

AEST Driver Device Mana
The AEST driver consists of three components:
  - AEST device: Handle interrupts and manage AEST nodes and records.
  - AEST node: corresponding to RAS node in hardware[1],
  - AEST record: RAS register sets.

They are organized together as follows.

 ┌──────────────────────────────────────────────────┐
 │             AEST Driver Device Management        │
 │┌─────────────┐    ┌──────────┐     ┌───────────┐ │
 ││ AEST Device ├─┬─►│AEST Node ├──┬─►│AEST Record│ │
 │└─────────────┘ │  └──────────┘  │  └───────────┘ │
 │                │       .        │  ┌───────────┐ │
 │                │       .        ├─►│AEST Record│ │
 │                │       .        │  └───────────┘ │
 │                │  ┌──────────┐  │        .       │
 │                ├─►│AEST Node │  │        .       │
 │                │  └──────────┘  │        .       │
 │                │                │  ┌───────────┐ │
 │                │  ┌──────────┐  └─►│AEST Record│ │
 │                └─►│AEST Node │     └───────────┘ │
 │                   └──────────┘                   │
 └──────────────────────────────────────────────────┘


AEST Interrupt Handle
=====================

Once AEST interrupt occur
1. The AEST device traverses all AEST nodes to locate errored record.
2. There are two types of records in each node:
      - report record: node can locate errored record through bitmap in
                       ERRGSR register.
      - poll record: node need poll all record to check if it errored.
3. process record:
      - if error is corrected, reset ce threshold and print it
      - if error is defered, dump register and call memory_fauilre
      - if error is uncorrected, panic, in fact UE usually raise a
        exception rather than interrupt.
4. decode record: AEST driver notify other driver,like EDAC,to decode
    RAS register.

AEST Error Injection
====================

AEST driver provides error(Software simulation instead of real hardware errors)
inject interface, details can be see in patch0003

Address Translation
===================

AS describe in 2.2[0], error address reported by AEST record may be
'''node-specific Logical Addresses''' rather than '''System Physical Address'''
used by Kernel, driver need tracelate LA to SPA, these is similar to AMD ATL[2].
So patch0004 introduce a common function both for AMD and ARM.

I have tested this series on THead Yitian710 SOC.

Future work:
1. Add CE storm mitigation.
2. Support AEST vendor node.

This series is based on Tyler Baicar's patches [1], which do not have v2
sended to mail list yet. Change from origin patch:
1. Add a genpool to collect all AEST error, and log them in a workqueue
other than in irq context.
2. Just use the same one aest_proc function for system register interface
and MMIO interface.
3. Reconstruct some structures and functions to make it more clear.
4. Accept all comments in Tyler Baicar's mail list.

Change from V2:
https://lore.kernel.org/all/20240321025317.114621-1-tianruidong@linux.alibaba.com/
1. Tomohiro Misono
    - dump register before panic
2. Baolin Wang & Shuai Xue: accept all comment.
3. Support AEST V2.

Change from V1:
https://lore.kernel.org/all/20240304111517.33001-1-tianruidong@linux.alibaba.com/
1. Marc Zyngier
  - Use readq/writeq_relaxed instead of readq/writeq for MMIO address.
  - Add sync for system register operation.
  - Use irq_is_percpu_devid() helper to identify a per-CPU interrupt.
  - Other fix.
2. Set RAS CE threshold in AEST driver.
3. Enable RAS interrupt explicitly in driver.
4. UER and UEO trigger memory_failure other than panic.

[0]: https://developer.arm.com/documentation/den0085/0101/
[1]: https://lore.kernel.org/all/20211124170708.3874-1-baicar@os.amperecomputing.com/
[2]: https://lore.kernel.org/all/20240123041401.79812-2-yazen.ghannam@amd.com/

Ruidong Tian (5):
  ACPI/RAS/AEST: Initial AEST driver
  RAS/AEST: Introduce AEST driver sysfs interface
  RAS/AEST: Introduce AEST inject interface to test AEST driver
  RAS/ATL: Unified ATL interface for ARM64 and AMD
  trace, ras: add ARM RAS extension trace event

 Documentation/ABI/testing/debugfs-aest |  115 +++
 MAINTAINERS                            |   11 +
 arch/arm64/include/asm/ras.h           |   95 +++
 drivers/acpi/arm64/Kconfig             |   11 +
 drivers/acpi/arm64/Makefile            |    1 +
 drivers/acpi/arm64/aest.c              |  340 ++++++++
 drivers/acpi/arm64/init.c              |    2 +
 drivers/acpi/arm64/init.h              |    1 +
 drivers/edac/amd64_edac.c              |    2 +-
 drivers/ras/Kconfig                    |    1 +
 drivers/ras/Makefile                   |    1 +
 drivers/ras/aest/Kconfig               |   17 +
 drivers/ras/aest/Makefile              |    7 +
 drivers/ras/aest/aest-core.c           | 1017 ++++++++++++++++++++++++
 drivers/ras/aest/aest-inject.c         |  151 ++++
 drivers/ras/aest/aest-sysfs.c          |  230 ++++++
 drivers/ras/aest/aest.h                |  338 ++++++++
 drivers/ras/amd/atl/core.c             |    4 +-
 drivers/ras/amd/atl/internal.h         |    2 +-
 drivers/ras/amd/atl/umc.c              |    3 +-
 drivers/ras/ras.c                      |   27 +-
 include/linux/acpi_aest.h              |   68 ++
 include/linux/cpuhotplug.h             |    1 +
 include/linux/ras.h                    |   17 +-
 include/ras/ras_event.h                |   71 ++
 25 files changed, 2510 insertions(+), 23 deletions(-)
 create mode 100644 Documentation/ABI/testing/debugfs-aest
 create mode 100644 arch/arm64/include/asm/ras.h
 create mode 100644 drivers/acpi/arm64/aest.c
 create mode 100644 drivers/ras/aest/Kconfig
 create mode 100644 drivers/ras/aest/Makefile
 create mode 100644 drivers/ras/aest/aest-core.c
 create mode 100644 drivers/ras/aest/aest-inject.c
 create mode 100644 drivers/ras/aest/aest-sysfs.c
 create mode 100644 drivers/ras/aest/aest.h
 create mode 100644 include/linux/acpi_aest.h