mbox series

[RFC,v1,0/2] Userspace Can Control Memory Failure Recovery

Message ID 20240924043924.3562257-1-jiaqiyan@google.com (mailing list archive)
Headers show
Series Userspace Can Control Memory Failure Recovery | expand

Message

Jiaqi Yan Sept. 24, 2024, 4:39 a.m. UTC
Introduction and Motivation
===========================

Recently there is an enforcement on the userspace control over how kernel
handles memory with corrected memory errors [1]. This RFC wants to extend
userspace's control to how the kernel deals with uncorrectable memory errors,
so userspace can now control all aspects of memory failure recovery (MFR).

Why does userspace need to play a role in MFR? There are two major use cases,
both from the cloud provider's perspective.

The first use case is 1G HugeTLB, which provides critical optimization for
Virtual Machines (VM) where database-centric and data-intensive workloads have
requirements of both large memory size (hundreds of GB or several TB [2]),
and high performance of address mapping. These VMs usually also require high
availability, so tolerating and recovering from inevitable uncorrectable
memory errors is usually provided by host RAS features for long VM uptime
(SLA is 99.95% Monthly Uptime [3]). Due to the 1GB granularity, once a byte
of memory in a hugepage is hardware corrupted, the kernel discards the whole
1G hugepage, not only the corrupted bytes but also the healthy portion, from
HugeTLB system. In the cloud environment this is a great loss of memory to VM,
putting VM in a dilemma: although the host is able to keep serving the VM,
the VM itself struggles to continue its data-intensive workload with the
unnecessary loss of ~1G data. On the other hand, simply terminating the VM
greatly reduces its uptime given the frequency of uncorrectable memory errors
occurrence. There was a RFC [4] that utilizes high granularity mapping [5]
to more efficiently recover HugeTLB memory failures. However, it faded away
with the high granularity mapping’s upstream proposal.

The second use case comes from the discussion of MFR for huge VM_PFNMAP [6],
which is to greatly improve TLB hits for PCI MMIO bars and kernel unmanaged
host primary memory. They are most relevant for VMs that run Machine Learning
(ML) workloads, which also requires reliable VM uptime (see [7] for detail).
There is no official MFR support for VM_PFNMAP yet, but [8] is proposed.
It unmaps non-struct-page memory without considering huge VM_PFNMAP. Peter,
Jason and I discussed what will be the MFR behavior for huge VM_PFNMAP [9]:
if the driver originally VM_PFNMAP-ed with PUD, it must first zap the PUD,
and then intercept future page faults to either install PTE/PMD for clean PFNs,
or return VM_FAULT_HWPOISON for poisoned PFNs. Zapping PUD means there will be
a huge hole in the EPT or stage-2 (S2) page table, causing a lot of EPT or S2
violations that need to be fixed up by the device driver. There will be
noticeable VM performance downgrades, not only during refilling, but also
after the hole is refilled, as EPT or S2 is already fragmented.

For both use cases, HARD_OFFLINE behavior in MFR arguably does more harm than
good to the VM. For the 1st case, if we simply leave the 1G HugeTLB hugepage
mapped, VM access to the clean PFNs within the poisoned 1G region still works
well; we just need to still send SIGBUS to userspace in case of re-access
poisoned PFNs to stop populating corrupted data. For the 2nd case, if zapping
PUD doesn't happen there is no need for the driver to intercept page faults to
clean memory on HBM or EGM. In addition, in both cases, there is no EPT or S2
violation, so no performance cost for accessing clean guest pages already
mapped in EPT and S2.

Keeping or discarding a large chunk of memory wrt memory error therefore
sounds like something that userspace should be able to control. The virtual
machine monitor can choose the better behavior for its managed VM.

Background and Terminology
============================

First I want to set the scope of the userspace control in the process of
kernel's memory error containment and recovery, which is drawn in [10]:
1. Interpret Hardware Fault: respond immediately to decode the exception
   generated by the hardware / firmware. On X86 the kernel needs to interpret
   machine check exceptions; on ARM the kernel needs to interpret the
   command platform error records.
2. Classify Error: the kernel classifies memory errors, white boxes in [10].
3. Result: different memory errors end up with different results, gray boxes
   in [10].

The scope of the control exposed to userspace is in Memory Failure Recovery
(highlighted in red in [10]), the only part that is relevant to userspace
and needs improvement. Userspace policy defines what userspace wants kernel
to do to the memory page in this stage. If specified by userspace, kernel
performs actions conforming to the policy; otherwise the behavior is exactly
what is in the kernel today. Once the recovery actions that must be executed
by the kernel are done, if any, the kernel engages relevant userspace
threads that must be signaled to prevent data corruption, and must participate
in MFR.

Depending on whether the memory error is corrected or uncorrectable, a memory
page is referred as
- Error page/hugepage/folio: the raw/huge/(either raw or huge) page
  consisting of the physical memory poisoned by platform's RAS.
- Corrected page/hugepage/folio: the raw/huge/(either raw or huge) page
  consisting of the physical memory corrected by platform's RAS.

We propose two design options for the uAPI of the MFR policy.

Option 1. Global MFR Policy
===========================

The global MFR policy depicted in [11] is "global" in the sense that
1. It is for entire system-level memory managed by the kernel, regardless
   of the underlying memory type.
2. It applies to all userspace threads, regardless if the physical memory is
   currently backing any VMA (free memory) or what VMAs it is backing.
3. It applies to PCI(e) device memory (e.g. HBM on GPU) as well, on the
   condition that their device driver deliberately wants to follow the
   kernel's MFR, instead of being entirely taken care of by device driver
   (e.g. drivers/nvdimm/pmem.c).

Drawn in blue, the kernel initializes the MFR policy with the default policy
to ensure two things.
- There is always a policy during memory failure recovery available.
- Default MFR is compatible with the existing MFR behavior in the kernel today.

MFR Config, which can be any binary that runs in userspace with root privilege,
configures or modifies the MFR policy offline to the MFR execution. MFR Config
is independent from and does not interact with stakeholder threads at all.
MFR config establishes the source-of-truth policy that is enacted when memory
errors occur.

The userspace API to set/get the policy is via sysctl:
- /proc/sys/vm/enable_soft_offline: whether to SOFT_OFFLINE corrected folio
- /proc/sys/vm/enable_hard_offline: whether to HARD_OFFLINE error folio

SOFT_OFFLINE and HARD_OFFLINE are current upstream behavior and will be used
as default policy when initializing global MFR policy.

There is one important thing to point out in when
/proc/sys/vm/enable_hard_offline = 0. The kernel does NOT set HWPoison flag
in the struct page or struct folio. This behavior has implications now that
no enforcement is done by kernel to isolate poisoned page and prevent both
userspace and kernel from consuming memory error and causing hardware fault
again (which used to be "setting the HWPoison flag"):
1. Userspace already has sufficient capability to prevent itself from
   consuming memory error and causing hardware fault: with the poisoned
   virtual address sent in SIGBUS, it can ask the kernel to remap the poisoned
   page with data loss, or simply abort the memory load operation. That being
   said, there needs to be a mechanism to detect and forcibly kill a malicious
   userspace thread if it keeps ignoring SIGBUS and generates hardware faults
   repeatedly.
2. Kernel won't be able to forbid the reuse of the free error pages in future
   memory allocations. If an error page is allocated to the kernel, when the
   kernel consumes the allocated error page, a kernel panic is most likely to
   happen. For userspace, it is now not guaranteed that newly allocated memory
   is free of memory errors.

Option 2. Per-VMA MFR
=====================

This design provides a policy for every struct vm_area_struct (VMA). VMA
depicts a virtual memory area (a virtual address interval of contiguous memory
that all share the same characteristics), and its granularity is per VM-area
and per task. One major usage of VMA is to associate a virtual memory area of
a process with a special rule for the page-fault handlers. Since the recovery
action is mainly about the page fault (PF), how PF handler behaves wrt
corrected or error page, attaching MFR policy into VMA sounds a natural fit.

The interface for a userspace thread to set the MFR policy for one of its own
virtual memory area is

  int madvise(void *vaddr, size_t length, int MADV_MFR_XXX)

or if we want some "MFR master" to be able to assign policy to other threads
of interests:

  int process_madvise(int pidfd, const struct iovec iovec[.n], size_t n,
                      int MADV_MFR_XXX, unsigned int flags)

where MADV_MFR_XXX is:
MADV_MFR_HARD_OFFLINE: HARD_OFFLINE error folio.
MADV_MFR_SOFT_OFFLINE: SOFT_OFFLINE corrected folio  .
MADV_MFR_KEEP_ERROR_MAPPED: keep error folio mapped to preserve the ability
                            for userspace to re-access the error folio.
MADV_MFR_KEEP_CORRECTED_MAPPED: keep corrected folio mapped to preserve the
				ability for userspace to re-access the
				corrected folio.

MADV_MFR_KEEP_ERROR_MAPPED is the inverse of MADV_MFR_HARD_OFFLINE.
MADV_MFR_KEEP_CORRECTED_MAPPED is the inverse of MADV_MFR_SOFT_OFFLINE.
{MADV_MFR_HARD_OFFLINE, MADV_MFR_KEEP_ERROR_MAPPED} are independent with
{MADV_MFR_SOFT_OFFLINE, MADV_MFR_KEEP_CORRECTED_MAPPED}.

The exact behaviors of SOFT_OFFLINE and HARD_OFFLINE are specific to the page
types. In general what will happen is
- Some pages, including pages not impacted by memory error, can be unmapped
  from userspace. PF handler sends SIGBUS to userspace for every unmapped page.
- Some pages, including pages not impacted by memory error, can be migrated
  to pages somewhere else.
- Transparent hugepage will be split to raw pages.
- HugeTLB hugepage will be dissolved to raw pages.
- PUD/PMD/PTE installed for (huge) VM_PFNMAP will be removed.

The default MFR policy is (MADV_MFR_SOFT_OFFLINE | MADV_MFR_HARD_OFFLINE) for
every VMA at its creation time, to be consistent with the current kernel’s
MFR behavior.

Unmap and Page Fault Behavior with Per-VMA Policy
=================================================

Before describing the behavior in kernel with the proposed per-VMA MFR policy,
one thing not changed by per-VMA MFR policy but very worthy to point out is:
regardless of the MFR policy configured to the VMA, the kernel sets the
HWPoison flag on the struct page or struct folio.

New MFR and page fault behavior with per-VMA policy is illustrated in [12],
and here is a walkthrough with some pseudocode.

If PF handler hasn't yet see the error page is HWPoison, a load to corrupted
physical memory will kicks off the memory error containment process depicted
in [10], and for the scope we care, corrected + recoverable uncorrected memory
errors, kernel gets into the MFR box:

page = pfn_to_page(pfn)
pgoff = page_to_offset(page)
SetPageHWPoison(page)					// step 6 in [12]
for thread in collect_procs_mapped(page):
    for vma in vma_interval_tree(thread, pgoff)
        if vma's MFR policy == MADV_MFR_OFFLINE:	// step 7 in [12]
            try_to_unmap(vma, page)			// step 8 in [12]

        vaddr = page_address_in_vma(vma, page)
        kill_proc(thread, vaddr)			// step 9 in [12]

Two outcomes for the owning thread after kernel sends SIGBUS
- Thread is terminated and no possible access to the error page
- Thread performs some recovery action and keeps running. In this case,
  a thread is able to attempt re-access the poisoned memory (step 1 in [12]
  "re-access HWPOISON").

When all threads that share the HWPoison-flagged folio exit, this problematic
page will be isolated and prevented from future allocation. However, before
that happens, there could still be a surviving thread or a sharing thread that
attempts to access the HWPoison-flagged folio. For these threads, PF handler
needs to handle the access to HWPOISON-flagged folio, either because folio is
already unmapped somehow (e.g. by Memory Failure Recovery), or because folio
is not yet mapped to thread:

if PageHWPoison(page):					// step 2 in [12]
    if vma's MFR policy != MADV_MFR_KEEP_ERROR_MAPPED:  // step 3 in [12]
        kill_proc(current, vaddr)			// step 4 in [12]
        return VM_FAULT_HWPOISON

resolve the PF successfully
install VA => PFN in thread's page table

If the page fault is resolved and mapping installed to page table successfully,
or there isn't a PF at all as the mapping is still present in thread's page
table, there are two possible outcomes to the memory load to a folio that
contains raw error page(s):
- Sad case: The underlying physical memory mapped into vaddr has an
  uncorrectable memory error. Memory error consumption kicks off the memory
  error containment process depicted in [10].
- Happy case: The underlying physical memory mapped into vaddr is clean,
  i.e. the thread is accessing a healthy raw page in the compound hugepage.
  Thread gets the clean data it asked for.

The worst case is, a thread will repeatedly hit the sad case. It spreads
sadness to the entire system in the manner of denial of service. What can we
do to mitigate this Loop of Sadness? Forcibly unmapping the poisoned folio
won’t help, we must forcibly terminate the thread with SIGKILL.

When different VMAs backed by the same chunk of memory, or physical address
range (PAR), want to have conflicting MFR policies, how to resolve conflict?
[13] illustrated an example of this. VMA1 wants MADV_MFR_KEEP_MAPPED but VMA2
wants MADV_MFR_HARD_OFFLINE. When MFR deals with error hugepage in PAR2,
after the hugepage is kept in VMA1 but unmapped from VMA2 at the same time,
should it dissolve the hugepage or leave it as it is? I think we can treat
MADV_MFR_XXX as a non-persistent property of the VMA, like MADV_FREE,
MADV_DONTNEED. After all, with VMA established, a thread can already modify
the underlying physical memory content, why can't it change the MFR policy
on it? A simple resolution to the conflicting MADV_MFR_XXX for PAR2 is that
the policy is established by the thread that mapped it and last wrote it.

Pros and Cons
=============

Part of option 1 (/proc/sys/vm/enable_soft_offline) has already merged upstream.
It already provides userspace the control necessary for corrected memory errors.
From uncorrected memory error's perspective, the per-VMA MFR option and the
global MFR option have their own pros and cons:
- /proc/sys/vm/enable_hard_offline in appearance is like a natural extended
  solution to /proc/sys/vm/enable_soft_offline, but the implications caused by
  the HWPoison flag introduces risks to everyone in the system. Per-VMA policy
  can also introduce repeated hardware fault risk, but the risk is contained
  in the VMA / thread and can be cleaned up by the kernel when the life of
  VMA / thread ends.
- /proc/sys/vm/enable_hard_offline is easy to implement, especially with the
  recent change from Jane [14]. I attached the patch for enable_hard_offline.
  On the other hand, per-VMA policy requires more code, like code to change
  page fault handler and to handle conflicting MFR policies.
- Per-VMA MPR provides userspace much better flexibility than global MFR.
- Global MFR policy requires high privilege, either root or dedicated user
  group. On the other hand, per-VMA policy naturally can be done by any
  userspace thread who owns the VMA.

No matter how the MFR policy is designed, userspace will eventually be
notified of poisoned memory, via SIGBUS BUS_MCEERR_AR or BUS_MCEERR_AO.

So far I personally prefer the global MFR policy but open to feedbacks to both
options, or new ideas.

[1] https://lwn.net/Articles/978732
[2] https://cloud.google.com/compute/docs/memory-optimized-machines#m2_machine_types
[3] https://cloud.google.com/compute/sla
[4] https://lore.kernel.org/lkml/20230428004139.2899856-1-jiaqiyan@google.com
[5] https://lwn.net/Articles/912017
[6] https://lore.kernel.org/linux-mm/20240828234958.GE3773488@nvidia.com/T/#m34d054d967a72ad8a7c8120c19447b415fd12179
[7] The example for MMIO bar is Nvidia's GB 200. In passthrough mode it supports
VM access to nearly half of its 196GB HBM per card [7.1]. The example for kernel
unmanaged host primary memory is Nvidia's extended GPU memory (EGM) [7.2],
so that ~400GB LPDDR5 DIMMs per socket on the host can not only back VM memory,
but are also accessible by GPU at high speed. Both HBM and EGM are exposed to
VM via VM_PFNMAP under the hood, and MFR for both HBM and EGM are important
because ML workload requires long VM uptime.
[7.1] https://www.nvidia.com/en-us/data-center/gb200-nvl72/?ncid=pa-srch-goog-739865&_bt=709953060161&_bk=nvidia%20blackwell%20tensor%20core%20gpus&_bm=p&_bn=g&_bg=169122792888&gad_source=1&gclid=Cj0KCQjwz7C2BhDkARIsAA_SZKbHWgnjAA_0Ve8niwtx9FooW-bgzehdRkDnoke-zIKafDaVu9d75eEaAjc_EALw_wcB
[7.2] https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/#extended_gpu_memory
[8] https://lore.kernel.org/lkml/20231123003513.24292-2-ankita@nvidia.com/#t
[9] https://lore.kernel.org/linux-mm/20240828234958.GE3773488@nvidia.com/T/#m413a61acaf1fc60e65ee7968ab0ae3093f7b1ea3
[10] https://docs.google.com/drawings/d/1Dmx2sxUGyRWdA1-5-HVko6IpsFL6PYAYL0ZL8T8AhY4
[11] https://docs.google.com/drawings/d/1E4m5Zy6_JFLmsacM3Z8FU6LLxLiTPMxvbmf4gzZhN6c
[12] https://docs.google.com/drawings/d/1hEe2BuEEJAlnqE4cjiZc-eBLjrkUk4BwOPDyL7TDClw
[13] https://docs.google.com/drawings/d/1u4er__Bziwn7itijOwghXhfu-JrXMDhnfFVu62BTzr0
[14] https://lore.kernel.org/all/20240524215306.2705454-2-jane.chu@oracle.com/T/#mbd530effd89d50eef7e9dd9375b900e7e34803c1

Jiaqi Yan (2):
  mm/memory-failure: introduce global MFR policy
  docs: mm: add enable_hard_offline sysctl

 Documentation/admin-guide/sysctl/vm.rst | 92 +++++++++++++++++++++++++
 mm/memory-failure.c                     | 33 +++++++++
 2 files changed, 125 insertions(+)