[RFC,RESEND] kvm: arm64: export memory error recovery capability to user space
diff mbox series

Message ID 1544782537-13377-1-git-send-email-gengdongjiu@huawei.com
State New
Headers show
Series
  • [RFC,RESEND] kvm: arm64: export memory error recovery capability to user space
Related show

Commit Message

gengdongjiu Dec. 14, 2018, 10:15 a.m. UTC
When user space do memory recovery, it will check whether KVM and
guest support the error recovery, only when both of them support,
user space will do the error recovery. This patch exports this
capability of KVM to user space.

Cc: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Dongjiu Geng <gengdongjiu@huawei.com>
---
User space needs to check this capability of KVM is suggested by Peter[1],
this patch as RFC tag because user space patches are still under review,
so this kernel patch is firstly sent out for review.

[1]: https://patchwork.codeaurora.org/patch/652261/
---
 Documentation/virtual/kvm/api.txt | 9 +++++++++
 arch/arm64/kvm/reset.c            | 1 +
 include/uapi/linux/kvm.h          | 1 +
 3 files changed, 11 insertions(+)

Comments

James Morse Dec. 14, 2018, 1:55 p.m. UTC | #1
Hi Dongjiu Geng,

On 14/12/2018 10:15, Dongjiu Geng wrote:
> When user space do memory recovery, it will check whether KVM and
> guest support the error recovery, only when both of them support,
> user space will do the error recovery. This patch exports this
> capability of KVM to user space.

I can understand user-space only wanting to do the work if host and guest
support the feature. But 'error recovery' isn't a KVM feature, its a Linux
kernel feature.

KVM will send it's user-space a SIGBUS with MCEERR code whenever its trying to
map a page at stage2 that the kernel-mm code refuses this because its poisoned.
(e.g. check_user_page_hwpoison(), get_user_pages() returns -EHWPOISON)

This is exactly the same as happens to a normal user-space process.

I think you really want to know if the host kernel was built with
CONFIG_MEMORY_FAILURE. The not-at-all-portable way to tell this from user-space
is the presence of /proc/sys/vm/memory_failure_* files.
(It looks like the prctl():PR_MCE_KILL/PR_MCE_KILL_GET options silently update
an ignored policy if the kernel isn't built with CONFIG_MEMORY_FAILURE, so they
aren't helpful)


> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index cd209f7..241e2e2 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -4895,3 +4895,12 @@ Architectures: x86
>  This capability indicates that KVM supports paravirtualized Hyper-V IPI send
>  hypercalls:
>  HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
> +
> +8.21 KVM_CAP_ARM_MEMORY_ERROR_RECOVERY
> +
> +Architectures: arm, arm64
> +
> +This capability indicates that guest memory error can be detected by the KVM which
> +supports the error recovery.

KVM doesn't detect these errors.
The hardware detects them and notifies the OS via one of a number of mechanisms.
This gets plumbed into memory_failure(), which sets a flag that the mm code uses
to prevent the page being used again.

KVM is only involved when it tries to map a page at stage2 and the mm code
rejects it with -EHWPOISON. This is the same as the architectures
do_page_fault() checking for (fault & VM_FAULT_HWPOISON) out of
handle_mm_fault(). We don't have a KVM cap for this, nor do we need one.


> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
> index b72a3dd..90d1d9a 100644
> --- a/arch/arm64/kvm/reset.c
> +++ b/arch/arm64/kvm/reset.c
> @@ -82,6 +82,7 @@ int kvm_arch_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  		r = kvm_arm_support_pmu_v3();
>  		break;
>  	case KVM_CAP_ARM_INJECT_SERROR_ESR:
> +	case KVM_CAP_ARM_MEMORY_ERROR_RECOVERY:
>  		r = cpus_have_const_cap(ARM64_HAS_RAS_EXTN);
>  		break;

The CPU RAS Extensions are not at all relevant here. It is perfectly possible to
support memory-failure without them, AMD-Seattle and APM-X-Gene do this. These
systems would report not-supported here, but the kernel does support this stuff.
Just because the CPU supports this, doesn't mean the kernel was built with
CONFIG_MEMORY_FAILURE. The CPU reports may be ignored, or upgraded to SIGKILL.



Thanks,

James
Peter Maydell Dec. 14, 2018, 2:33 p.m. UTC | #2
On Fri, 14 Dec 2018 at 13:56, James Morse <james.morse@arm.com> wrote:
>
> Hi Dongjiu Geng,
>
> On 14/12/2018 10:15, Dongjiu Geng wrote:
> > When user space do memory recovery, it will check whether KVM and
> > guest support the error recovery, only when both of them support,
> > user space will do the error recovery. This patch exports this
> > capability of KVM to user space.
>
> I can understand user-space only wanting to do the work if host and guest
> support the feature. But 'error recovery' isn't a KVM feature, its a Linux
> kernel feature.
>
> KVM will send it's user-space a SIGBUS with MCEERR code whenever its trying to
> map a page at stage2 that the kernel-mm code refuses this because its poisoned.
> (e.g. check_user_page_hwpoison(), get_user_pages() returns -EHWPOISON)
>
> This is exactly the same as happens to a normal user-space process.
>
> I think you really want to know if the host kernel was built with
> CONFIG_MEMORY_FAILURE.

Does userspace need to care about that? Presumably if the host kernel
wasn't built with that support then it will simply never deliver
any memory failure events to QEMU, which is fine.

The point I was trying to make in the email Dongjiu references
(https://patchwork.codeaurora.org/patch/652261/) is simply that
"QEMU gets memory-failure notifications from the host kernel"
does not imply "the guest is prepared to receive memory
failure notifications", and so the code path which handles
the SIGBUS must do some kind of check for whether the guest
CPU is a type which expects them and that the board code
set up the ACPI tables that it wants to fill in.

thanks
-- PMM
gengdongjiu Dec. 14, 2018, 10:31 p.m. UTC | #3
HI James,

      Thanks for the mail and comments, I will reply to you in the next mail.

2018-12-14 21:55 GMT+08:00, James Morse <james.morse@arm.com>:
> Hi Dongjiu Geng,
>
> On 14/12/2018 10:15, Dongjiu Geng wrote:
>> When user space do memory recovery, it will check whether KVM and
>> guest support the error recovery, only when both of them support,
>> user space will do the error recovery. This patch exports this
>> capability of KVM to user space.
>
> I can understand user-space only wanting to do the work if host and guest
> support the feature. But 'error recovery' isn't a KVM feature, its a Linux
> kernel feature.
>
> KVM will send it's user-space a SIGBUS with MCEERR code whenever its trying
> to
> map a page at stage2 that the kernel-mm code refuses this because its
> poisoned.
> (e.g. check_user_page_hwpoison(), get_user_pages() returns -EHWPOISON)
>
> This is exactly the same as happens to a normal user-space process.
>
> I think you really want to know if the host kernel was built with
> CONFIG_MEMORY_FAILURE. The not-at-all-portable way to tell this from
> user-space
> is the presence of /proc/sys/vm/memory_failure_* files.
> (It looks like the prctl():PR_MCE_KILL/PR_MCE_KILL_GET options silently
> update
> an ignored policy if the kernel isn't built with CONFIG_MEMORY_FAILURE, so
> they
> aren't helpful)
>
>
>> diff --git a/Documentation/virtual/kvm/api.txt
>> b/Documentation/virtual/kvm/api.txt
>> index cd209f7..241e2e2 100644
>> --- a/Documentation/virtual/kvm/api.txt
>> +++ b/Documentation/virtual/kvm/api.txt
>> @@ -4895,3 +4895,12 @@ Architectures: x86
>>  This capability indicates that KVM supports paravirtualized Hyper-V IPI
>> send
>>  hypercalls:
>>  HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
>> +
>> +8.21 KVM_CAP_ARM_MEMORY_ERROR_RECOVERY
>> +
>> +Architectures: arm, arm64
>> +
>> +This capability indicates that guest memory error can be detected by the
>> KVM which
>> +supports the error recovery.
>
> KVM doesn't detect these errors.
> The hardware detects them and notifies the OS via one of a number of
> mechanisms.
> This gets plumbed into memory_failure(), which sets a flag that the mm code
> uses
> to prevent the page being used again.
>
> KVM is only involved when it tries to map a page at stage2 and the mm code
> rejects it with -EHWPOISON. This is the same as the architectures
> do_page_fault() checking for (fault & VM_FAULT_HWPOISON) out of
> handle_mm_fault(). We don't have a KVM cap for this, nor do we need one.
>
>
>> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
>> index b72a3dd..90d1d9a 100644
>> --- a/arch/arm64/kvm/reset.c
>> +++ b/arch/arm64/kvm/reset.c
>> @@ -82,6 +82,7 @@ int kvm_arch_vm_ioctl_check_extension(struct kvm *kvm,
>> long ext)
>>  		r = kvm_arm_support_pmu_v3();
>>  		break;
>>  	case KVM_CAP_ARM_INJECT_SERROR_ESR:
>> +	case KVM_CAP_ARM_MEMORY_ERROR_RECOVERY:
>>  		r = cpus_have_const_cap(ARM64_HAS_RAS_EXTN);
>>  		break;
>
> The CPU RAS Extensions are not at all relevant here. It is perfectly
> possible to
> support memory-failure without them, AMD-Seattle and APM-X-Gene do this.
> These
> systems would report not-supported here, but the kernel does support this
> stuff.
> Just because the CPU supports this, doesn't mean the kernel was built with
> CONFIG_MEMORY_FAILURE. The CPU reports may be ignored, or upgraded to
> SIGKILL.
>
>
>
> Thanks,
>
> James
>
James Morse Dec. 17, 2018, 3:55 p.m. UTC | #4
Hi Peter,

On 14/12/2018 14:33, Peter Maydell wrote:
> On Fri, 14 Dec 2018 at 13:56, James Morse <james.morse@arm.com> wrote:
>> On 14/12/2018 10:15, Dongjiu Geng wrote:
>>> When user space do memory recovery, it will check whether KVM and
>>> guest support the error recovery, only when both of them support,
>>> user space will do the error recovery. This patch exports this
>>> capability of KVM to user space.
>>
>> I can understand user-space only wanting to do the work if host and guest
>> support the feature. But 'error recovery' isn't a KVM feature, its a Linux
>> kernel feature.
>>
>> KVM will send it's user-space a SIGBUS with MCEERR code whenever its trying to
>> map a page at stage2 that the kernel-mm code refuses this because its poisoned.
>> (e.g. check_user_page_hwpoison(), get_user_pages() returns -EHWPOISON)
>>
>> This is exactly the same as happens to a normal user-space process.
>>
>> I think you really want to know if the host kernel was built with
>> CONFIG_MEMORY_FAILURE.
> 
> Does userspace need to care about that? Presumably if the host kernel
> wasn't built with that support then it will simply never deliver
> any memory failure events to QEMU, which is fine.

Aha, I thought this is what you wanted.
Always being prepared to handle the signals is the best choice.


> The point I was trying to make in the email Dongjiu references
> (https://patchwork.codeaurora.org/patch/652261/) is simply that
> "QEMU gets memory-failure notifications from the host kernel"
> does not imply "the guest is prepared to receive memory
> failure notifications", and so the code path which handles
> the SIGBUS must do some kind of check for whether the guest

> CPU is a type which expects them

I don't understand this bit.

The CPU support is just about barriers for containment and reporting a
standardised classification to software. Firmware-first replaces all this. It
doesn't depend on any CPU feature.
APM-X-Gene has firmware-first support, it uses some kind of external processor
that takes the error-interrupt from DRAM and generates CPER records, before
triggering the firmware-first notification.

> and that the board code
> set up the ACPI tables that it wants to fill in.
ACPI has some complex stuff around claiming 'platform-wide capabilities'. Qemu
could use this to know if the guest understands APEI.

Section 6.2.11.2 "Platform-Wide OSPM Capabilities" of ACPI v6.2 describes the
\_SB._OSC method, which has an APEI support bit. This is used in some kind of
handshake.

Linux does this during boot if its built with APEI GHES support. Linux seems to
think the APEI bit enables firmware-first:
| [   63.804907] GHES: APEI firmware first mode is enabled by APEI bit.

... but its not clear from the spec. (APEI is more than firmware-first)

(where do these things go? Platform AML in the DSDT)


I don't think this controls anything on a real system, (we've seen X-Gene
generate CPER records before Linux started booting), and I don't think it really
matters as 'what happens if the guest doesn't know' falls out of the way these
SIGBUS codes map back onto the firmware-first notifications:

For 'AO' signals you can dump CPER records in a NOTIFY_POLLed area. If the guest
doesn't care, it can avert is eyes. If you used one of the NOTIFY_$(interrupt)
types, the guest can not-register the interrupt.

The AR signals map to external-abort. On a firmware-first system EL3 takes
these, generates some extra metadata using CPER records in the agreed location,
and re-injects an emulated external-abort.
If Qemu takes an AR signal, this is effectively an external-abort, the page has
been accessed and the kernel will not map it because the page is poisoned. These
would have been an external-abort on a real system, its not a problem if the
guest doesn't know about the extra CPER metadata.

Centriq is an example of a system that does this external-abort+CPER-metadata
without the v8.2 CPU extensions.

All v8.0 CPUs have synchronous/asynchronous external abort, there is nothing new
going on here, its just extra metadata. (critically: the physical address of the
fault)


Thanks,

James

Patch
diff mbox series

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index cd209f7..241e2e2 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -4895,3 +4895,12 @@  Architectures: x86
 This capability indicates that KVM supports paravirtualized Hyper-V IPI send
 hypercalls:
 HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
+
+8.21 KVM_CAP_ARM_MEMORY_ERROR_RECOVERY
+
+Architectures: arm, arm64
+
+This capability indicates that guest memory error can be detected by the KVM which
+supports the error recovery. When user space do recovery, such as QEMU, it will
+check whether KVM and guest support memory error recovery, only when both of them
+support, user space will do the error recovery.
diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
index b72a3dd..90d1d9a 100644
--- a/arch/arm64/kvm/reset.c
+++ b/arch/arm64/kvm/reset.c
@@ -82,6 +82,7 @@  int kvm_arch_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 		r = kvm_arm_support_pmu_v3();
 		break;
 	case KVM_CAP_ARM_INJECT_SERROR_ESR:
+	case KVM_CAP_ARM_MEMORY_ERROR_RECOVERY:
 		r = cpus_have_const_cap(ARM64_HAS_RAS_EXTN);
 		break;
 	case KVM_CAP_SET_GUEST_DEBUG:
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 2b7a652..3b19580 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -975,6 +975,7 @@  struct kvm_ppc_resize_hpt {
 #define KVM_CAP_HYPERV_ENLIGHTENED_VMCS 163
 #define KVM_CAP_EXCEPTION_PAYLOAD 164
 #define KVM_CAP_ARM_VM_IPA_SIZE 165
+#define KVM_CAP_ARM_MEMORY_ERROR_RECOVERY 166
 
 #ifdef KVM_CAP_IRQ_ROUTING