mbox series

[v1,0/4] hugetlbfs memory HW error fixes

Message ID 20241022213503.1189954-1-william.roche@oracle.com (mailing list archive)
Headers show
Series hugetlbfs memory HW error fixes | expand

Message

“William Roche Oct. 22, 2024, 9:34 p.m. UTC
From: William Roche <william.roche@oracle.com>

This set of patches fixes several problems with hardware memory errors
impacting hugetlbfs memory backed VMs. When using hugetlbfs large
pages, any large page location being impacted by an HW memory error
results in poisoning the entire page, suddenly making a large chunk of
the VM memory unusable.

The main problem that currently exists in Qemu is the lack of backend
file repair before resetting the VM memory, resulting in the impacted
memory to be silently unusable even after a VM reboot.

In order to fix this issue, we track the SIGBUS page size information
when informed of a HW error (with si_addr_lsb) and record the size with
the appropriate poisoned page location. On recording a large page
position, we take note of the beginning of the page and its size. The
size information is taken from the backend file page_size value.

Also provide the impact information of a large page of memory loss,
only reported once when the page is poisoned -- for a better
debug-ability of these situations.

This code is scripts/checkpatch.pl clean
'make check' runs fine on both x86 and ARM.
Units tests have been successfully run on x86,
but the ARM VM doesn't deal with several errors on different memory
locations triggered too quickly from each other (which is the case
with hugetlbfs page being poisoned) and either aborts after a
"failed to record the error" message or becomes unresponsive.


William Roche (4):
  accel/kvm: SIGBUS handler should also deal with si_addr_lsb
  accel/kvm: Keep track of the HWPoisonPage page_size
  system/physmem: Largepage punch hole before reset of memory pages
  accel/kvm: Report the loss of a large memory page

 accel/kvm/kvm-all.c       | 27 +++++++++++++++++++++------
 accel/stubs/kvm-stub.c    |  4 ++--
 include/exec/cpu-common.h |  1 +
 include/qemu/osdep.h      |  5 +++--
 include/sysemu/kvm.h      |  7 ++++---
 include/sysemu/kvm_int.h  |  7 +++++--
 system/cpus.c             |  6 ++++--
 system/physmem.c          | 28 ++++++++++++++++++++++++++++
 target/arm/kvm.c          |  8 ++++++--
 target/i386/kvm/kvm.c     |  8 ++++++--
 util/oslib-posix.c        |  3 +++
 11 files changed, 83 insertions(+), 21 deletions(-)