mbox series

[RFC,v1,0/3] Userspace MFR Policy via memfd

Message ID 20250118231549.1652825-1-jiaqiyan@google.com (mailing list archive)
Headers show
Series Userspace MFR Policy via memfd | expand

Message

Jiaqi Yan Jan. 18, 2025, 11:15 p.m. UTC
Background and Motivation
=========================

To recap [1] in short: in Cloud, HugeTLB and huge VM_PFNMAP serves
capacity- and performance-critical guest memory, but the current memory
failure recovery (MFR) behavior for both are not ideal:
* Once a byte of memory in a hugepage is hardware corrupted, kernel
  discards the whole hugepage, not only the corrupted bytes but also the
  healthy portion, from the HugeTLB system, causing a great loss of
  memory to VM. We use 1GB HugeTLB for the vast majority of guest memory
  in GCE.
* After MFR zaps PUD (assuming memory mapping is huge VM_PFNMAP [2]),
  there will be a huge hole in the EPT or stage-2 (S2) page table,
  causing a lot of EPT or S2 violations that need to be fixed up by the
  device driver or core MM, and fragmented EPT or S2 after fixup. There
  will be noticeable VM performance downgrades (see “MemCycler
  Benchmarking”).

Therefore keeping or discarding a large chunk of contiguous memory
mapped to userspace (particularly to serve guest memory) due to
uncorrected memory error (UE, recoverable is implied) should be
controlled by userspace, e.g. virtual machine monitor (VMM) in Cloud.

In the MM upstream alignment meeting [3], we were able to align with
folks from the Linux MM upstream community on “Why Control Needed” and
“What to Control”. However, the two proposed approaches for “How to
Control” are both not well accepted, and we think it is worthy to pursue
the memfd-based userspace MFR idea brought up by Jason Gunthorpe.

MemCycler Benchmarking
======================

To follow up the question by Dave Hansen, “If one motivation for this is
guest performance, then it would be great to have some data to back that
up, even if it is worst-case data”, we run MemCycler in guest and
compare its performance when there are an extremely large number of
memory errors.

The MemCycler benchmark cycles through memory with multiple threads. On
each iteration, the thread reads the current value, validates it, and
writes a counter value. The benchmark continuously outputs rates
indicating the speed at which it is reading and writing 64-bit integers,
and aggregates the reads and writes of the multiple threads across
multiple iterations into a single rate (unit: 64-bit per microsecond).

MemCycler is running inside a VM with 80 vCPUs and 640 GB guest memory.
The hardware platform hosting the VM is using Intel Emerald Rapids CPUs
(in total 120 physical cores) and 1.5 T DDR5 memory. MemCycler allocates
memory with 2M transparent hugepage in the guest. Our in-house VMM backs
the guest memory with 2M transparent hugepage on the host. The final
aggregate rate after 60 runtime is 17,204.69 and referred to as the
baseline case.

In the experimental case, all the setups are identical to the baseline
case, however 25% of the guest memory is split from THP to 4K pages due
to the memory failure recovery triggered by MADV_HWPOISON. I made some
minor changes in the kernel so that the MADV_HWPOISON-ed pages are
unpoisoned, and afterwards the in-guest MemCycle is still able to read
and write its data. The final aggregate rate is 16,355.11, which is
decreased by 5.06% compared to the baseline case. When 5% of the guest
memory is split after MADV_HWPOISON, the final aggregate rate is
16,999.14, a drop of 1.20% compared to the baseline case.

Design
======

Userspace process creates memfd to get a file that lives in RAM, that
has a volatile backing storage, that the backing memory has anonymous
semantics. Userspace then can modify, truncate, memory-map the file and
so on.

Per-memfd MFR Policy associates the userspace MFR policy with a memfd
instance. This approach is promising for the following reasons:
1. Keeping memory with UE mapped to a process has risks if the process
   does not do its duty to prevent itself from repeatedly consuming UER.
   The MFR policy can be associated with a memfd to limit such risk to a
   particular memory space owned by a particular process that opts in
   the policy. This is much preferable than the Global MFR Policy
   proposed in the initial RFC, which provides no granularity
   whatsoever.
2. Like Per-VMA MFR Policy in the initial RFC, poisoning the folio and
   keeping the mapping are not conflicting in Per-memfd MFR Policy;
   Kernel can keep setting the HWPoison flag to the folios affected by
   the UE, while the folio is kept mapping to userspace. This is an
   advantage to the Global MFR Policy, which breaks kernel’s HWPoison
   behavior.
3. Although MFR policy allows the userspace process to keep memory UE
   mapped, eventually these HWPoison-ed folios need to be dealt with by
   the kernel (e.g. split into smallest chunk and isolated from
   future allocation). For memfd once all references to it are dropped,
   it is automatically released from userspace, which is a perfect
   timing for the kernel to do its duties to HWPoison-ed folios if any.
   This is also a big advantage to the Global MFR Policy, which breaks
   kernel’s protection to HWPoison-ed folios.
4. Given memfd’s anonymous semantic, we don’t need to worry about that
   different threads can have different and conflicting MFR policies. It
   allows a simpler implementation than the Per-VMA MFR Policy in the
   initial RFC [1].

Userspace can choose the memory backing the created file either be
managed by HugeTLB (MFD_HUGETLB) or SHMEM. To userspace the Per-memfd
MFR Policy is independent of the memory management systems, although the
implementations required in kernel are different because the existing
MFR behavior already varies.

UAPI
====

The UAPI to control MFR policy via memfd is through the memfd_create
syscall with a new flag value:

  #define MFD_MF_KEEP_UE_MAPPED	0x0020U
  int memfd_create(const char *name, unsigned int flags);

When MFD_MF_KEEP_UE_MAPPED is set in flags, memory failure (MF) recovery
in the kernel doesn’t hard offline memory due to uncorrected error (UE)
until the created memfd is released. IOW, the HWPoison-ed memory remains
accessible via the returned memfd or the memory mapping created with
that memfd.

However, the affected memory will be immediately protected and isolated
from future use by both kernel and userspace once the owning memfd is
gone or the memory is truncated. By default MFD_MF_KEEP_UE_MAPPED is not
set, and kernel hard offlines memory having UEs. Kernel immediately
poisons the folios for both cases.

MFD_MF_KEEP_UE_MAPPED translates to a new flag value introduced to
address_space around which the new code changes in MFR, mm fault
handler, and in-RAM file system are added.

  /* * Bits in mapping->flags. */
  enum mapping_flags {
    ...
    /*
     * Keeps folios belong to the mapping mapped even if uncorrectable
     * memory errors (UE) caused memory failure (MF) within the folio.
     * Only at the end of mapping will its HWPoison-ed folios be dealt
     * with.
     */
    AS_MF_KEEP_UE_MAPPED = 9,
    ...
  };

Implementation
==============

Implementation is relatively straightforward with two major parts.

Part 1: When a AS_MF_KEEP_UE_MAPPED memfd is alive and one of its memory
pages is affected by UE:
* MFR needs to defer operations (e.g. unmapping, splitting, dissolving)
  needed to hard offline the memory page. MFR still holds a refcount for
  every raw HWPoison-ed page. MFR still sends SIGBUS to consuming
  thread, but the si_addr_lsb will be reduced to PAGE_SHIFT.
* If the memory was not faulted in yet, the fault handler also needs to
  unblock the fault to HWPoison-ed folio.

Part2: When a AS_MF_KEEP_UE_MAPPED memfd is about to be released, or
when the userspace process truncates a range of memory pages belonging
to a AS_MF_KEEP_UE_MAPPED memfd:
* When the in-memory file system is evicting the inode corresponding to
  the memfd, it needs to prepare the HWPoison-ed folios that are easily
  identifiable with the PG_HWPOISON flag. This operation is implemented
  by populate_memfd_hwp_folios and is exported to file systems.
* After the file system removes all the folios, there is nothing else
  preventing MFR from dealing with HWPoison-ed folios, so the file
  system forwards them to MFR. This step is implemented by
  offline_memfd_hwp_folios and is exported to file systems.
* MFR has been holding refcount(s) of each HWPoison-ed folio. After
  dropping the refcounts, a HWPoison-ed folio should become free and can
  be disposed of. MFR naturally takes the responsibility for this,
  implemented as filemap_offline_hwpoison_folio. How the folio is
  disposed of depends on the type of the memory management system.
  Taking HugeTLB as an example, a HugeTLB folio is dissolved into a set
  of raw pages. The healthy raw pages can be reclaimed by the buddy
  allocator while the HWPoison-ed raw pages need to be taken off and
  prevented from future allocation, as implemented by
  filemap_offline_hwpoison_folio_hugetlb.

This RFC includes the code patch to demonstrate the implementation for
HugeTLB.

In V2 I can probably offline each folio as they get remove, instead of
doing this in batch. The advantage is we can get rid of
populate_memfd_hwp_folios and the linked list needed to store poisoned
folios. One way is to insert filemap_offline_hwpoison_folio into
somewhere in folio_batch_release, or into per file system's free_folio
handler.

Extensibility: Guest memfd
==========================

Guest memfd is going to be the future API used by a virtual machine
monitor (VMM) to allocate and configure memory for the guest but with
better protections that are needed for confidential VM. The current MFR
in guest memfd works as follows:
1. KVM unmaps all the GFNs that are backed by the HWPoison-ed folio from
   the stage-2 page table and invalidates the range in TLB. This
   protects KVM / VM from causing poison consumption at hardware level.
   On the other hand, if the folio backs a large amount of GFNs, e.g. 1G
   HugeTLB, it is likely that majority of the GFNs are still healthy but
   has been “offlined” together (a big hole in stage-2 and guest memory
   region).
2. In react to later fault to any part of the HWPoison-ed folio, guest
   memfd returns KVM_PFN_ERR_HWPOISON, and KVM sends SIGBUS to VMM. This
   is good enough for actual hardware corrupted PFN backed GFNs, but not
   ideal for the healthy PFNs “offlined” together with the error PFNs.
   The userspace MFR policy can be useful if VMM wants KVM to 1. Keep
   these GFNs mapped in the stage-2 page table 2. In react to later
   access to the actual hardware corrupted part of the HWPoison-ed
   folio, there is going to be a (repeated) poison consumption event,
   and KVM returns KVM_PFN_ERR_HWPOISON for the actual poisoned PFN.
3. In response to later access to the still healthy part of the
   HWPoison-ed folio, guest is able to fast access the memory as the
   healthy PFNs are still in stage-2 page table.

This behavior is better from the PoV of capacity (if the folio contains
a large number of raw pages) and performance (if both the stage-1 and
stage-2 page table sizes are huge), however, at the cost of the risk of
recurring poison consumptions. The cost can be mitigated by splitting
stage-2 page table wrt to HWPoison-ed PFNs so that stage-2 and guest
memory region only have smaller holes.

The UAPI for userspace MFR control via guest memfd can be through the
KVM_CREATE_GUEST_MEMFD IOCTL. It is easy to apply MFD_MF_KEEP_UE_MAPPED
to kvm_create_guest_memfd:

  static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
  {
    //...
    if (flags & MFD_MF_KEEP_UE_MAPPED)
      mapping_set_mf_keep_ue_mapped(inode->i_mapping);
    //...
  }

Of course full implementation requires more code changes in guest memfd,
e.g. __kvm_gmem_get_pfn, kvm_gmem_error_folio etc, especially for VMs
that are built on both guest memfd and HugeTLB. Feedbacks are welcome
before I put out an implementation.

Extensibility: VFIO Device Memory
=================================

In Cloud a significant amount of memory can be managed by certain VFIO
drivers, for example the GPU device memory, or EGM memory. As mentioned
before, the unmapping behavior in MFR becomes a concern if the VFIO
device driver supports the device memory mapping via huge VM_PFNMAP.

This RFC [4] proposes a MFR framework for VFIO device managed userspace
memory (i.e. memory regions mapped by remap_pfn_region). The userspace
MFR policy can instruct the device driver to keep all PFN mapped in a
VMA (i.e. don’t unmap_mapping_range).

Of course, the above memfd uAPI (MFD_MF_KEEP_UE_MAPPED + memfd_create)
doesn’t work with VFIO device kernel drivers (as of today I don’t think
userspace can create a memfd with the path name to a VFIO device). I
don’t have a satisfying uAPI design, but here is what I considered, VFIO
Device Specific IOCTL:
* IOCTL to the VFIO Device File. The device driver usually expose a
  file-like uAPI to its managed device memory (e.g. PCI MMIO BAR)
  directly with the file to the VFIO device. AS_MF_KEEP_UE_MAPPED can be
  placed in the address_space of the file to the VFIO device. Device
  driver can implement a specific IOCTL to the VFIO device file for
  userspace to set AS_MF_KEEP_UE_MAPPED.
* IOCTL to the Char File. The device driver can create a char device for
  its managed memory regions, then expose the file-like uAPI (open,
  close, mmap, unlocked_ioctl) with the created char device using
  cdev_init. AS_MF_KEEP_UE_MAPPED can be straightforwardly put into the
  address_space of the file to the char device. Device driver can
  implement a specific IOCTL to the char device file for userspace to
  set AS_MF_KEEP_UE_MAPPED.

What is common (and unsatisfactory) above is every device driver needs
to add device-specific IOCTL to support MFD_MF_KEEP_UE_MAPPED. The
timing of accepting the IOCTL also needs to be restricted to be after
file descriptor creation (e.g. VFIO_GROUP_GET_DEVICE_FD) and before the
first mmap request. I am still considering how to integrate
MFD_MF_KEEP_UE_MAPPED to VFIO framework’s uAPI.

Extensibility: THP SHMEM/TMPFS
==============================

The current MFR behavior for THP SHMEM/TMPFS is to split the hugepage
into raw page and only offline the raw HWPoison-ed page. In most cases
THP is 2M and raw page size is 4K, so userspace loses the “huge”
property of a 2M huge memory, but the actual data loss is only 4K.

Using populate_memfd_hwp_folios and offline_memfd_hwp_folios, it is not
hard to implement AS_MF_KEEP_UE_MAPPED for THP so that userspace process
retain the huge property of the hugepage when it is affected by memory
errors. However, this benefit is not as attractive as to HugeTLB and it
is not implemented for now.

[1] https://lwn.net/Articles/991513
[2] https://lore.kernel.org/kvm/20240826204353.2228736-1-peterx@redhat.com/
[3] https://docs.google.com/presentation/d/1tWqcuAqeCLhfd47uXXLdu2SzolKu7WYvM03vEkbhobc/edit#slide=id.g3014a65d24b_0_0
[4] https://lore.kernel.org/linux-mm/20231123003513.24292-2-ankita@nvidia.com/

Jiaqi Yan (3):
  mm: memfd/hugetlb: introduce userspace memory failure recovery policy
  selftests/mm: test userspace MFR for HugeTLB 1G hugepage
  Documentation: add userspace MF recovery policy via memfd

 Documentation/userspace-api/index.rst         |   1 +
 .../userspace-api/mfd_mfr_policy.rst          |  55 ++++
 fs/hugetlbfs/inode.c                          |  16 ++
 include/linux/hugetlb.h                       |   7 +
 include/linux/pagemap.h                       |  43 +++
 include/uapi/linux/memfd.h                    |   1 +
 mm/filemap.c                                  |  78 +++++
 mm/hugetlb.c                                  |  20 +-
 mm/memfd.c                                    |  15 +-
 mm/memory-failure.c                           | 119 +++++++-
 tools/testing/selftests/mm/.gitignore         |   1 +
 tools/testing/selftests/mm/Makefile           |   1 +
 tools/testing/selftests/mm/hugetlb-mfr.c      | 267 ++++++++++++++++++
 13 files changed, 607 insertions(+), 17 deletions(-)
 create mode 100644 Documentation/userspace-api/mfd_mfr_policy.rst
 create mode 100644 tools/testing/selftests/mm/hugetlb-mfr.c