diff mbox series

[RFC,v1] KVM: arm64: Introduce KVM_CAP_ARM_SIGBUS_ON_SEA

Message ID 20241031212104.1429609-1-jiaqiyan@google.com (mailing list archive)
State New
Headers show
Series [RFC,v1] KVM: arm64: Introduce KVM_CAP_ARM_SIGBUS_ON_SEA | expand

Commit Message

Jiaqi Yan Oct. 31, 2024, 9:21 p.m. UTC
Currently KVM handles SEA in guest by injecting async SError into
guest directly, bypassing VMM, usually results in guest kernel panic.

One major situation of guest SEA is when vCPU consumes uncorrectable
memory error on the physical memory. Although SError and guest kernel
panic effectively stops the propagation of corrupted memory, it is not
easy for VMM and guest to recover from memory error in a more graceful
manner.

Alternatively KVM can send a SIGBUS BUS_OBJERR to VMM/vCPU, just like
how core kernel signals SIGBUS BUS_OBJERR to the poison consuming
thread.
In addition to the benifit that KVM's handling for SEA becomes aligned
with core kernel behavior
- The blast radius in VM can be limited to only the consuming thread
  in guest, instead of entire guest kernel, unless the consumption is
  from guest kernel.
- VMM now has the chance to do its duties to stop the VM from repeatedly
  consuming corrupted data. For example, VMM can unmap the guest page
  from stage-2 table to intercept forseen memory poison consumption,
  and for every consumption injects SEA to EL1 with synthetic memory
  error CPER.

Introduce a new KVM ARM capability KVM_CAP_ARM_SIGBUS_ON_SEA. VMM
can opt in this new capability if it prefers SIGBUS than SError
injection during VM init. Now SEA handling in KVM works as follows:
1. Delegate to APEI/GHES to see if SEA can be claimed by them.
2. If APEI failed to claim the SEA and KVM_CAP_ARM_SIGBUS_ON_SEA is
   enabled for the VM, and the SEA is NOT about translation table,
   send SIGBUS BUS_OBJERR signal with host virtual address.
3. Otherwise directly inject async SError to guest.

Tested on a machine running Siryn AmpereOne processor. A in-house VMM
that opts in KVM_CAP_ARM_SIGBUS_ON_SEA started a VM. A dummy application
in VM allocated some memory buffer. The test used EINJ to inject an
uncorrectable recoverable memory error at a page in the allocated memory
buffer. The dummy application then consumed the memory error. Some hack
was done to make core kernel's memory_failure triggered by poison
generation to fail, so KVM had to deal with the SEA guest abort due to
poison consumption. vCPU thread in VMM received SIGBUS BUS_OBJERR with
valid host virtual address of the poisoned page. VMM then injected a SEA
into guest using KVM_SET_VCPU_EVENTS with ext_dabt_pending=1. At last
the dummy application in guest was killed by SIGBUS BUS_OBJERR, while the
guest survived and continued to run.

Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
 arch/arm64/include/asm/kvm_host.h |  2 +
 arch/arm64/include/asm/kvm_ras.h  | 20 ++++----
 arch/arm64/kvm/Makefile           |  2 +-
 arch/arm64/kvm/arm.c              |  5 ++
 arch/arm64/kvm/kvm_ras.c          | 77 +++++++++++++++++++++++++++++++
 arch/arm64/kvm/mmu.c              |  8 +---
 include/uapi/linux/kvm.h          |  1 +
 7 files changed, 98 insertions(+), 17 deletions(-)
 create mode 100644 arch/arm64/kvm/kvm_ras.c

Comments

Oliver Upton Oct. 31, 2024, 9:46 p.m. UTC | #1
Hi Jiaqi,

Thank you for sending this out.

On Thu, Oct 31, 2024 at 09:21:04PM +0000, Jiaqi Yan wrote:
> Currently KVM handles SEA in guest by injecting async SError into
> guest directly, bypassing VMM, usually results in guest kernel panic.
> 
> One major situation of guest SEA is when vCPU consumes uncorrectable
> memory error on the physical memory. Although SError and guest kernel
> panic effectively stops the propagation of corrupted memory, it is not
> easy for VMM and guest to recover from memory error in a more graceful
> manner.
> 
> Alternatively KVM can send a SIGBUS BUS_OBJERR to VMM/vCPU, just like
> how core kernel signals SIGBUS BUS_OBJERR to the poison consuming
> thread.
> In addition to the benifit that KVM's handling for SEA becomes aligned
> with core kernel behavior
> - The blast radius in VM can be limited to only the consuming thread
>   in guest, instead of entire guest kernel, unless the consumption is
>   from guest kernel.
> - VMM now has the chance to do its duties to stop the VM from repeatedly
>   consuming corrupted data. For example, VMM can unmap the guest page
>   from stage-2 table to intercept forseen memory poison consumption,
>   and for every consumption injects SEA to EL1 with synthetic memory
>   error CPER.
> 
> Introduce a new KVM ARM capability KVM_CAP_ARM_SIGBUS_ON_SEA. VMM
> can opt in this new capability if it prefers SIGBUS than SError
> injection during VM init. Now SEA handling in KVM works as follows:

I'm somewhat tempted to force the new behavior on userspace
unconditionally. Working back from an unexpected SError in the VM to the
KVM SEA handler is a bit of a mess, and can be annoying if the operator
can't access console logs of the VM.

As it stands today, UAPI expectations around SEAs are platform
dependent. If APEI claims the SEA and decides to offline a page, the
user will get a SIGBUS.

So sending a SIGBUS for the case that firmware _doesn't_ claim the SEA
seems like a good move from a consistency PoV. But it is a decently-sized
change to do without explicit buy-in from userspace so let's see what
others think.

> 1. Delegate to APEI/GHES to see if SEA can be claimed by them.
> 2. If APEI failed to claim the SEA and KVM_CAP_ARM_SIGBUS_ON_SEA is
>    enabled for the VM, and the SEA is NOT about translation table,
>    send SIGBUS BUS_OBJERR signal with host virtual address.
> 3. Otherwise directly inject async SError to guest.

The other reason I'm a bit lukewarm on user buy in is the UAPI suffers
from the same issue we do today: it depends on the platform. If the SEA
is claimed by APEI/GHES then the cap does nothing.

> +static int kvm_delegate_guest_sea(phys_addr_t addr, u64 esr)
> +{
> +	/* apei_claim_sea(NULL) expects to mask interrupts itself */
> +	lockdep_assert_irqs_enabled();
> +	return apei_claim_sea(NULL);
> +}

Consider dropping parameters from this since they're unused.

> +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
> +{
> +	bool sigbus_on_sea;
> +	int idx;
> +	u64 vcpu_esr = kvm_vcpu_get_esr(vcpu);
> +	u8 fsc = kvm_vcpu_trap_get_fault(vcpu);
> +	phys_addr_t fault_ipa = kvm_vcpu_get_fault_ipa(vcpu);
> +	gfn_t gfn = fault_ipa >> PAGE_SHIFT;
> +	/* When FnV is set, send 0 as si_addr like what do_sea() does. */
> +	unsigned long hva = 0UL;
> +
> +	/*
> +	 * For RAS the host kernel may handle this abort.
> +	 * There is no need to SIGBUS VMM, or pass the error into the guest.
> +	 */
> +	if (kvm_delegate_guest_sea(fault_ipa, vcpu_esr) == 0)
> +		return;
> +
> +	sigbus_on_sea = test_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA,
> +				 &(vcpu->kvm->arch.flags));
> +
> +	/*
> +	 * In addition to userspace opt-in, SIGBUS only makes sense if the
> +	 * abort is NOT about translation table walk and NOT about hardware
> +	 * update of translation table.
> +	 */
> +	sigbus_on_sea &= (fsc == ESR_ELx_FSC_EXTABT || fsc == ESR_ELx_FSC_SECC);

Is this because we potentially can't determine a valid HVA for the
fault? Maybe these should go out to userspace still with si_addr = 0.
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index bf64fed9820ea..eb37a2489411a 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -334,6 +334,8 @@  struct kvm_arch {
 	/* Fine-Grained UNDEF initialised */
 #define KVM_ARCH_FLAG_FGU_INITIALIZED			8
 	unsigned long flags;
+	/* Instead of injecting SError into guest, SIGBUS VMM */
+#define KVM_ARCH_FLAG_SIGBUS_ON_SEA			9
 
 	/* VM-wide vCPU feature set */
 	DECLARE_BITMAP(vcpu_features, KVM_VCPU_MAX_FEATURES);
diff --git a/arch/arm64/include/asm/kvm_ras.h b/arch/arm64/include/asm/kvm_ras.h
index 87e10d9a635b5..4bb7a424e3f6c 100644
--- a/arch/arm64/include/asm/kvm_ras.h
+++ b/arch/arm64/include/asm/kvm_ras.h
@@ -11,15 +11,17 @@ 
 #include <asm/acpi.h>
 
 /*
- * Was this synchronous external abort a RAS notification?
- * Returns '0' for errors handled by some RAS subsystem, or -ENOENT.
+ * Handle synchronous external abort (SEA) in the following order:
+ * 1. Delegate to APEI/GHES to see if SEA can be claimed by them. If so, we
+ *    are all done.
+ * 2. If userspace opts in KVM_CAP_ARM_SIGBUS_ON_SEA, and if the SEA is NOT
+ *    about translation table, send SIGBUS
+ *    - si_code is BUS_OBJERR.
+ *    - si_addr will be 0 when accurate HVA is unavailable.
+ * 3. Otherwise, directly inject an async SError to guest.
+ *
+ * Note this applies to both ESR_ELx_EC_IABT_* and ESR_ELx_EC_DABT_*.
  */
-static inline int kvm_handle_guest_sea(phys_addr_t addr, u64 esr)
-{
-	/* apei_claim_sea(NULL) expects to mask interrupts itself */
-	lockdep_assert_irqs_enabled();
-
-	return apei_claim_sea(NULL);
-}
+void kvm_handle_guest_sea(struct kvm_vcpu *vcpu);
 
 #endif /* __ARM64_KVM_RAS_H__ */
diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
index 3cf7adb2b5038..c4a3a6d4870e6 100644
--- a/arch/arm64/kvm/Makefile
+++ b/arch/arm64/kvm/Makefile
@@ -23,7 +23,7 @@  kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
 	 vgic/vgic-v3.o vgic/vgic-v4.o \
 	 vgic/vgic-mmio.o vgic/vgic-mmio-v2.o \
 	 vgic/vgic-mmio-v3.o vgic/vgic-kvm-device.o \
-	 vgic/vgic-its.o vgic/vgic-debug.o
+	 vgic/vgic-its.o vgic/vgic-debug.o kvm_ras.o
 
 kvm-$(CONFIG_HW_PERF_EVENTS)  += pmu-emul.o pmu.o
 kvm-$(CONFIG_ARM64_PTR_AUTH)  += pauth.o
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 48cafb65d6acf..bb97ad678dbec 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -151,6 +151,10 @@  int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 		}
 		mutex_unlock(&kvm->slots_lock);
 		break;
+	case KVM_CAP_ARM_SIGBUS_ON_SEA:
+		r = 0;
+		set_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA, &kvm->arch.flags);
+		break;
 	default:
 		break;
 	}
@@ -339,6 +343,7 @@  int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_ARM_SYSTEM_SUSPEND:
 	case KVM_CAP_IRQFD_RESAMPLE:
 	case KVM_CAP_COUNTER_OFFSET:
+	case KVM_CAP_ARM_SIGBUS_ON_SEA:
 		r = 1;
 		break;
 	case KVM_CAP_SET_GUEST_DEBUG2:
diff --git a/arch/arm64/kvm/kvm_ras.c b/arch/arm64/kvm/kvm_ras.c
new file mode 100644
index 0000000000000..3225462bcbcda
--- /dev/null
+++ b/arch/arm64/kvm/kvm_ras.c
@@ -0,0 +1,77 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <linux/bitops.h>
+#include <linux/kvm_host.h>
+
+#include <asm/kvm_emulate.h>
+#include <asm/kvm_ras.h>
+#include <asm/system_misc.h>
+
+/*
+ * For synchrnous external instruction or data abort, not on translation
+ * table walk or hardware update of translation table, is FAR_EL2 valid?
+ */
+static inline bool kvm_vcpu_sea_far_valid(const struct kvm_vcpu *vcpu)
+{
+	return !(vcpu->arch.fault.esr_el2 & ESR_ELx_FnV);
+}
+
+/*
+ * Was this synchronous external abort a RAS notification?
+ * Returns '0' for errors handled by some RAS subsystem, or -ENOENT.
+ */
+static int kvm_delegate_guest_sea(phys_addr_t addr, u64 esr)
+{
+	/* apei_claim_sea(NULL) expects to mask interrupts itself */
+	lockdep_assert_irqs_enabled();
+	return apei_claim_sea(NULL);
+}
+
+void kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
+{
+	bool sigbus_on_sea;
+	int idx;
+	u64 vcpu_esr = kvm_vcpu_get_esr(vcpu);
+	u8 fsc = kvm_vcpu_trap_get_fault(vcpu);
+	phys_addr_t fault_ipa = kvm_vcpu_get_fault_ipa(vcpu);
+	gfn_t gfn = fault_ipa >> PAGE_SHIFT;
+	/* When FnV is set, send 0 as si_addr like what do_sea() does. */
+	unsigned long hva = 0UL;
+
+	/*
+	 * For RAS the host kernel may handle this abort.
+	 * There is no need to SIGBUS VMM, or pass the error into the guest.
+	 */
+	if (kvm_delegate_guest_sea(fault_ipa, vcpu_esr) == 0)
+		return;
+
+	sigbus_on_sea = test_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA,
+				 &(vcpu->kvm->arch.flags));
+
+	/*
+	 * In addition to userspace opt-in, SIGBUS only makes sense if the
+	 * abort is NOT about translation table walk and NOT about hardware
+	 * update of translation table.
+	 */
+	sigbus_on_sea &= (fsc == ESR_ELx_FSC_EXTABT || fsc == ESR_ELx_FSC_SECC);
+
+	/* Pass the error directly into the guest. */
+	if (!sigbus_on_sea) {
+		kvm_inject_vabt(vcpu);
+		return;
+	}
+
+	if (kvm_vcpu_sea_far_valid(vcpu)) {
+		idx = srcu_read_lock(&vcpu->kvm->srcu);
+		hva = gfn_to_hva(vcpu->kvm, gfn);
+		srcu_read_unlock(&vcpu->kvm->srcu, idx);
+	}
+
+	/*
+	 * Send a SIGBUS BUS_OBJERR to vCPU thread (the userspace thread that
+	 * runs KVM_RUN) or VMM, which aligns with what host kernel do_sea()
+	 * does if apei_claim_sea() fails.
+	 */
+	arm64_notify_die("synchronous external abort",
+			 current_pt_regs(), SIGBUS, BUS_OBJERR, hva, vcpu_esr);
+}
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index a71fe6f6bd90f..f5335953827ec 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1766,13 +1766,7 @@  int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
 
 	/* Synchronous External Abort? */
 	if (kvm_vcpu_abt_issea(vcpu)) {
-		/*
-		 * For RAS the host kernel may handle this abort.
-		 * There is no need to pass the error into the guest.
-		 */
-		if (kvm_handle_guest_sea(fault_ipa, kvm_vcpu_get_esr(vcpu)))
-			kvm_inject_vabt(vcpu);
-
+		kvm_handle_guest_sea(vcpu);
 		return 1;
 	}
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 637efc0551453..fe3ca787e72fa 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -933,6 +933,7 @@  struct kvm_enable_cap {
 #define KVM_CAP_PRE_FAULT_MEMORY 236
 #define KVM_CAP_X86_APIC_BUS_CYCLES_NS 237
 #define KVM_CAP_X86_GUEST_MODE 238
+#define KVM_CAP_ARM_SIGBUS_ON_SEA 239
 
 struct kvm_irq_routing_irqchip {
 	__u32 irqchip;