Message ID | 20231027182217.3615211-10-seanjc@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | KVM: guest_memfd() and per-page attributes | expand |
On 10/27/23 20:21, Sean Christopherson wrote: > From: Chao Peng <chao.p.peng@linux.intel.com> > > Add a new KVM exit type to allow userspace to handle memory faults that > KVM cannot resolve, but that userspace *may* be able to handle (without > terminating the guest). > > KVM will initially use KVM_EXIT_MEMORY_FAULT to report implicit > conversions between private and shared memory. With guest private memory, > there will be two kind of memory conversions: > > - explicit conversion: happens when the guest explicitly calls into KVM > to map a range (as private or shared) > > - implicit conversion: happens when the guest attempts to access a gfn > that is configured in the "wrong" state (private vs. shared) > > On x86 (first architecture to support guest private memory), explicit > conversions will be reported via KVM_EXIT_HYPERCALL+KVM_HC_MAP_GPA_RANGE, > but reporting KVM_EXIT_HYPERCALL for implicit conversions is undesriable > as there is (obviously) no hypercall, and there is no guarantee that the > guest actually intends to convert between private and shared, i.e. what > KVM thinks is an implicit conversion "request" could actually be the > result of a guest code bug. > > KVM_EXIT_MEMORY_FAULT will be used to report memory faults that appear to > be implicit conversions. > > Note! To allow for future possibilities where KVM reports > KVM_EXIT_MEMORY_FAULT and fills run->memory_fault on _any_ unresolved > fault, KVM returns "-EFAULT" (-1 with errno == EFAULT from userspace's > perspective), not '0'! Due to historical baggage within KVM, exiting to > userspace with '0' from deep callstacks, e.g. in emulation paths, is > infeasible as doing so would require a near-complete overhaul of KVM, > whereas KVM already propagates -errno return codes to userspace even when > the -errno originated in a low level helper. > > Report the gpa+size instead of a single gfn even though the initial usage > is expected to always report single pages. It's entirely possible, likely > even, that KVM will someday support sub-page granularity faults, e.g. > Intel's sub-page protection feature allows for additional protections at > 128-byte granularity. > > Link: https://lore.kernel.org/all/20230908222905.1321305-5-amoorthy@google.com > Link: https://lore.kernel.org/all/ZQ3AmLO2SYv3DszH@google.com > Cc: Anish Moorthy <amoorthy@google.com> > Cc: David Matlack <dmatlack@google.com> > Suggested-by: Sean Christopherson <seanjc@google.com> > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> > Co-developed-by: Sean Christopherson <seanjc@google.com> > Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> > --- > Documentation/virt/kvm/api.rst | 41 ++++++++++++++++++++++++++++++++++ > arch/x86/kvm/x86.c | 1 + > include/linux/kvm_host.h | 11 +++++++++ > include/uapi/linux/kvm.h | 8 +++++++ > 4 files changed, 61 insertions(+) > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst > index ace984acc125..860216536810 100644 > --- a/Documentation/virt/kvm/api.rst > +++ b/Documentation/virt/kvm/api.rst > @@ -6723,6 +6723,26 @@ array field represents return values. The userspace should update the return > values of SBI call before resuming the VCPU. For more details on RISC-V SBI > spec refer, https://github.com/riscv/riscv-sbi-doc. > > +:: > + > + /* KVM_EXIT_MEMORY_FAULT */ > + struct { > + __u64 flags; > + __u64 gpa; > + __u64 size; > + } memory; > + > +KVM_EXIT_MEMORY_FAULT indicates the vCPU has encountered a memory fault that > +could not be resolved by KVM. The 'gpa' and 'size' (in bytes) describe the > +guest physical address range [gpa, gpa + size) of the fault. The 'flags' field > +describes properties of the faulting access that are likely pertinent. > +Currently, no flags are defined. > + > +Note! KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it > +accompanies a return code of '-1', not '0'! errno will always be set to EFAULT > +or EHWPOISON when KVM exits with KVM_EXIT_MEMORY_FAULT, userspace should assume > +kvm_run.exit_reason is stale/undefined for all other error numbers. > + > :: > > /* KVM_EXIT_NOTIFY */ > @@ -7757,6 +7777,27 @@ This capability is aimed to mitigate the threat that malicious VMs can > cause CPU stuck (due to event windows don't open up) and make the CPU > unavailable to host or other VMs. > > +7.34 KVM_CAP_MEMORY_FAULT_INFO > +------------------------------ > + > +:Architectures: x86 > +:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP. > + > +The presence of this capability indicates that KVM_RUN will fill > +kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if > +there is a valid memslot but no backing VMA for the corresponding host virtual > +address. > + > +The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns > +an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set > +to KVM_EXIT_MEMORY_FAULT. > + > +Note: Userspaces which attempt to resolve memory faults so that they can retry > +KVM_RUN are encouraged to guard against repeatedly receiving the same > +error/annotated fault. > + > +See KVM_EXIT_MEMORY_FAULT for more information. > + > 8. Other capabilities. > ====================== > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 6409914428ca..ee3cd8c3c0ef 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -4518,6 +4518,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) > case KVM_CAP_ENABLE_CAP: > case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES: > case KVM_CAP_IRQFD_RESAMPLE: > + case KVM_CAP_MEMORY_FAULT_INFO: > r = 1; > break; > case KVM_CAP_EXIT_HYPERCALL: > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index 4e741ff27af3..96aa930536b1 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -2327,4 +2327,15 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr) > /* Max number of entries allowed for each kvm dirty ring */ > #define KVM_DIRTY_RING_MAX_ENTRIES 65536 > > +static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu, > + gpa_t gpa, gpa_t size) > +{ > + vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT; > + vcpu->run->memory_fault.gpa = gpa; > + vcpu->run->memory_fault.size = size; > + > + /* Flags are not (yet) defined or communicated to userspace. */ > + vcpu->run->memory_fault.flags = 0; > +} > + > #endif > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h > index bd1abe067f28..7ae9987b48dd 100644 > --- a/include/uapi/linux/kvm.h > +++ b/include/uapi/linux/kvm.h > @@ -274,6 +274,7 @@ struct kvm_xen_exit { > #define KVM_EXIT_RISCV_SBI 35 > #define KVM_EXIT_RISCV_CSR 36 > #define KVM_EXIT_NOTIFY 37 > +#define KVM_EXIT_MEMORY_FAULT 38 > > /* For KVM_EXIT_INTERNAL_ERROR */ > /* Emulate instruction failed. */ > @@ -520,6 +521,12 @@ struct kvm_run { > #define KVM_NOTIFY_CONTEXT_INVALID (1 << 0) > __u32 flags; > } notify; > + /* KVM_EXIT_MEMORY_FAULT */ > + struct { > + __u64 flags; > + __u64 gpa; > + __u64 size; > + } memory_fault; > /* Fix the size of the union. */ > char padding[256]; > }; > @@ -1203,6 +1210,7 @@ struct kvm_ppc_resize_hpt { > #define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228 > #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229 > #define KVM_CAP_USER_MEMORY2 230 > +#define KVM_CAP_MEMORY_FAULT_INFO 231 > > #ifdef KVM_CAP_IRQ_ROUTING >
On 10/28/2023 2:21 AM, Sean Christopherson wrote: > From: Chao Peng <chao.p.peng@linux.intel.com> > > Add a new KVM exit type to allow userspace to handle memory faults that > KVM cannot resolve, but that userspace *may* be able to handle (without > terminating the guest). > > KVM will initially use KVM_EXIT_MEMORY_FAULT to report implicit > conversions between private and shared memory. With guest private memory, > there will be two kind of memory conversions: > > - explicit conversion: happens when the guest explicitly calls into KVM > to map a range (as private or shared) > > - implicit conversion: happens when the guest attempts to access a gfn > that is configured in the "wrong" state (private vs. shared) > > On x86 (first architecture to support guest private memory), explicit > conversions will be reported via KVM_EXIT_HYPERCALL+KVM_HC_MAP_GPA_RANGE, > but reporting KVM_EXIT_HYPERCALL for implicit conversions is undesriable > as there is (obviously) no hypercall, and there is no guarantee that the > guest actually intends to convert between private and shared, i.e. what > KVM thinks is an implicit conversion "request" could actually be the > result of a guest code bug. > > KVM_EXIT_MEMORY_FAULT will be used to report memory faults that appear to > be implicit conversions. > > Note! To allow for future possibilities where KVM reports > KVM_EXIT_MEMORY_FAULT and fills run->memory_fault on _any_ unresolved > fault, KVM returns "-EFAULT" (-1 with errno == EFAULT from userspace's > perspective), not '0'! Is "-EHWPOISON" case not considered unresolved, so it is not mentioned here? > Due to historical baggage within KVM, exiting to > userspace with '0' from deep callstacks, e.g. in emulation paths, is > infeasible as doing so would require a near-complete overhaul of KVM, > whereas KVM already propagates -errno return codes to userspace even when > the -errno originated in a low level helper. > > Report the gpa+size instead of a single gfn even though the initial usage > is expected to always report single pages. It's entirely possible, likely > even, that KVM will someday support sub-page granularity faults, e.g. > Intel's sub-page protection feature allows for additional protections at > 128-byte granularity. > > Link: https://lore.kernel.org/all/20230908222905.1321305-5-amoorthy@google.com > Link: https://lore.kernel.org/all/ZQ3AmLO2SYv3DszH@google.com > Cc: Anish Moorthy <amoorthy@google.com> > Cc: David Matlack <dmatlack@google.com> > Suggested-by: Sean Christopherson <seanjc@google.com> > Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> > Co-developed-by: Sean Christopherson <seanjc@google.com> > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- > Documentation/virt/kvm/api.rst | 41 ++++++++++++++++++++++++++++++++++ > arch/x86/kvm/x86.c | 1 + > include/linux/kvm_host.h | 11 +++++++++ > include/uapi/linux/kvm.h | 8 +++++++ > 4 files changed, 61 insertions(+) > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst > index ace984acc125..860216536810 100644 > --- a/Documentation/virt/kvm/api.rst > +++ b/Documentation/virt/kvm/api.rst > @@ -6723,6 +6723,26 @@ array field represents return values. The userspace should update the return > values of SBI call before resuming the VCPU. For more details on RISC-V SBI > spec refer, https://github.com/riscv/riscv-sbi-doc. > > +:: > + > + /* KVM_EXIT_MEMORY_FAULT */ > + struct { > + __u64 flags; > + __u64 gpa; > + __u64 size; > + } memory; > + > +KVM_EXIT_MEMORY_FAULT indicates the vCPU has encountered a memory fault that > +could not be resolved by KVM. The 'gpa' and 'size' (in bytes) describe the > +guest physical address range [gpa, gpa + size) of the fault. The 'flags' field > +describes properties of the faulting access that are likely pertinent. > +Currently, no flags are defined. > + > +Note! KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it > +accompanies a return code of '-1', not '0'! errno will always be set to EFAULT > +or EHWPOISON when KVM exits with KVM_EXIT_MEMORY_FAULT, userspace should assume > +kvm_run.exit_reason is stale/undefined for all other error numbers. > + > :: > > /* KVM_EXIT_NOTIFY */ > @@ -7757,6 +7777,27 @@ This capability is aimed to mitigate the threat that malicious VMs can > cause CPU stuck (due to event windows don't open up) and make the CPU > unavailable to host or other VMs. > > +7.34 KVM_CAP_MEMORY_FAULT_INFO > +------------------------------ > + > +:Architectures: x86 > +:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP. > + > +The presence of this capability indicates that KVM_RUN will fill > +kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if > +there is a valid memslot but no backing VMA for the corresponding host virtual > +address. > + > +The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns > +an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set > +to KVM_EXIT_MEMORY_FAULT. > + > +Note: Userspaces which attempt to resolve memory faults so that they can retry > +KVM_RUN are encouraged to guard against repeatedly receiving the same > +error/annotated fault. > + > +See KVM_EXIT_MEMORY_FAULT for more information. > + > 8. Other capabilities. > ====================== > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 6409914428ca..ee3cd8c3c0ef 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -4518,6 +4518,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) > case KVM_CAP_ENABLE_CAP: > case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES: > case KVM_CAP_IRQFD_RESAMPLE: > + case KVM_CAP_MEMORY_FAULT_INFO: > r = 1; > break; > case KVM_CAP_EXIT_HYPERCALL: > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index 4e741ff27af3..96aa930536b1 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -2327,4 +2327,15 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr) > /* Max number of entries allowed for each kvm dirty ring */ > #define KVM_DIRTY_RING_MAX_ENTRIES 65536 > > +static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu, > + gpa_t gpa, gpa_t size) > +{ > + vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT; > + vcpu->run->memory_fault.gpa = gpa; > + vcpu->run->memory_fault.size = size; > + > + /* Flags are not (yet) defined or communicated to userspace. */ > + vcpu->run->memory_fault.flags = 0; > +} > + > #endif > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h > index bd1abe067f28..7ae9987b48dd 100644 > --- a/include/uapi/linux/kvm.h > +++ b/include/uapi/linux/kvm.h > @@ -274,6 +274,7 @@ struct kvm_xen_exit { > #define KVM_EXIT_RISCV_SBI 35 > #define KVM_EXIT_RISCV_CSR 36 > #define KVM_EXIT_NOTIFY 37 > +#define KVM_EXIT_MEMORY_FAULT 38 > > /* For KVM_EXIT_INTERNAL_ERROR */ > /* Emulate instruction failed. */ > @@ -520,6 +521,12 @@ struct kvm_run { > #define KVM_NOTIFY_CONTEXT_INVALID (1 << 0) > __u32 flags; > } notify; > + /* KVM_EXIT_MEMORY_FAULT */ > + struct { > + __u64 flags; > + __u64 gpa; > + __u64 size; > + } memory_fault; > /* Fix the size of the union. */ > char padding[256]; > }; > @@ -1203,6 +1210,7 @@ struct kvm_ppc_resize_hpt { > #define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228 > #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229 > #define KVM_CAP_USER_MEMORY2 230 > +#define KVM_CAP_MEMORY_FAULT_INFO 231 > > #ifdef KVM_CAP_IRQ_ROUTING >
> +7.34 KVM_CAP_MEMORY_FAULT_INFO > +------------------------------ > + > +:Architectures: x86 > +:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP. > + > +The presence of this capability indicates that KVM_RUN will fill > +kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if > +there is a valid memslot but no backing VMA for the corresponding host virtual > +address. > + > +The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns > +an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set > +to KVM_EXIT_MEMORY_FAULT. IIUC returning -EFAULT or whatever -errno is sort of KVM internal implementation. Is it better to relax the validity of kvm_run.memory_fault when KVM_RUN returns any -errno? [...] > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -2327,4 +2327,15 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr) > /* Max number of entries allowed for each kvm dirty ring */ > #define KVM_DIRTY_RING_MAX_ENTRIES 65536 > > +static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu, > + gpa_t gpa, gpa_t size) > +{ > + vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT; > + vcpu->run->memory_fault.gpa = gpa; > + vcpu->run->memory_fault.size = size; > + > + /* Flags are not (yet) defined or communicated to userspace. */ > + vcpu->run->memory_fault.flags = 0; > +} > + KVM_CAP_MEMORY_FAULT_INFO is x86 only, is it better to put this function to <asm/kvm_host.h>?
On Wed, Nov 01, 2023, Kai Huang wrote: > > > +7.34 KVM_CAP_MEMORY_FAULT_INFO > > +------------------------------ > > + > > +:Architectures: x86 > > +:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP. > > + > > +The presence of this capability indicates that KVM_RUN will fill > > +kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if > > +there is a valid memslot but no backing VMA for the corresponding host virtual > > +address. > > + > > +The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns > > +an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set > > +to KVM_EXIT_MEMORY_FAULT. > > IIUC returning -EFAULT or whatever -errno is sort of KVM internal > implementation. The errno that is returned to userspace is ABI. In KVM, it's a _very_ poorly defined ABI for the vast majority of ioctls(), but it's still technically ABI. KVM gets away with being cavalier with errno because the vast majority of errors are considered fatal by userespace, i.e. in most cases, userspace simply doesn't care about the exact errno. A good example is KVM_RUN with -EINTR; if KVM were to return something other than -EINTR on a pending signal or vcpu->run->immediate_exit, userspace would fall over. > Is it better to relax the validity of kvm_run.memory_fault when > KVM_RUN returns any -errno? Not unless there's a need to do so, and if there is then we can update the documentation accordingly. If KVM's ABI is that kvm_run.memory_fault is valid for any errno, then KVM would need to purge kvm_run.exit_reason super early in KVM_RUN, e.g. to prevent an -EINTR return due to immediate_exit from being misinterpreted as KVM_EXIT_MEMORY_FAULT. And purging exit_reason super early is subtly tricky because KVM's (again, poorly documented) ABI is that *some* exit reasons are preserved across KVM_RUN with vcpu->run->immediate_exit (or with a pending signal). https://lore.kernel.org/all/ZFFbwOXZ5uI%2Fgdaf@google.com > [...] > > > > --- a/include/linux/kvm_host.h > > +++ b/include/linux/kvm_host.h > > @@ -2327,4 +2327,15 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr) > > /* Max number of entries allowed for each kvm dirty ring */ > > #define KVM_DIRTY_RING_MAX_ENTRIES 65536 > > > > +static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu, > > + gpa_t gpa, gpa_t size) > > +{ > > + vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT; > > + vcpu->run->memory_fault.gpa = gpa; > > + vcpu->run->memory_fault.size = size; > > + > > + /* Flags are not (yet) defined or communicated to userspace. */ > > + vcpu->run->memory_fault.flags = 0; > > +} > > + > > KVM_CAP_MEMORY_FAULT_INFO is x86 only, is it better to put this function to > <asm/kvm_host.h>? I'd prefer to keep it in generic code, as it's highly likely to end up there sooner than later. There's a known use case for ARM (exit to userspace on missing userspace mapping[*]), and I'm guessing pKVM (also ARM) will also utilize this API. [*] https://lore.kernel.org/all/20230908222905.1321305-8-amoorthy@google.com
On 11/2/2023 1:36 AM, Sean Christopherson wrote: >> KVM_CAP_MEMORY_FAULT_INFO is x86 only, is it better to put this function to >> <asm/kvm_host.h>? > I'd prefer to keep it in generic code, as it's highly likely to end up there > sooner than later. There's a known use case for ARM (exit to userspace on missing > userspace mapping[*]), and I'm guessing pKVM (also ARM) will also utilize this API. > > [*]https://lore.kernel.org/all/20230908222905.1321305-8-amoorthy@google.com I wonder how this CAP is supposed to be checked in userspace, for guest memfd case? something like this? if (!kvm_check_extension(s, KVM_CAP_MEMORY_FAULT_INFO) && run->exit_reason == KVM_EXIT_MEMORY_FAULT) abort("unexpected KVM_EXIT_MEMORY_FAULT"); In my implementation of QEMU patches, I find it's unnecessary. When userspace gets an exit with KVM_EXIT_MEMORY_FAULT, it implies "KVM_CAP_MEMORY_FAULT_INFO". So I don't see how it is necessary in this series. Whether it's necessary or not for [*], I don't have the answer but we can leave the discussion to that patch series.
On Wed, 2023-11-01 at 10:36 -0700, Sean Christopherson wrote: > On Wed, Nov 01, 2023, Kai Huang wrote: > > > > > +7.34 KVM_CAP_MEMORY_FAULT_INFO > > > +------------------------------ > > > + > > > +:Architectures: x86 > > > +:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP. > > > + > > > +The presence of this capability indicates that KVM_RUN will fill > > > +kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if > > > +there is a valid memslot but no backing VMA for the corresponding host virtual > > > +address. > > > + > > > +The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns > > > +an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set > > > +to KVM_EXIT_MEMORY_FAULT. > > > > IIUC returning -EFAULT or whatever -errno is sort of KVM internal > > implementation. > > The errno that is returned to userspace is ABI. In KVM, it's a _very_ poorly > defined ABI for the vast majority of ioctls(), but it's still technically ABI. > KVM gets away with being cavalier with errno because the vast majority of errors > are considered fatal by userespace, i.e. in most cases, userspace simply doesn't > care about the exact errno. > > A good example is KVM_RUN with -EINTR; if KVM were to return something other than > -EINTR on a pending signal or vcpu->run->immediate_exit, userspace would fall over. > > > Is it better to relax the validity of kvm_run.memory_fault when > > KVM_RUN returns any -errno? > > Not unless there's a need to do so, and if there is then we can update the > documentation accordingly. If KVM's ABI is that kvm_run.memory_fault is valid > for any errno, then KVM would need to purge kvm_run.exit_reason super early in > KVM_RUN, e.g. to prevent an -EINTR return due to immediate_exit from being > misinterpreted as KVM_EXIT_MEMORY_FAULT. And purging exit_reason super early is > subtly tricky because KVM's (again, poorly documented) ABI is that *some* exit > reasons are preserved across KVM_RUN with vcpu->run->immediate_exit (or with a > pending signal). > > https://lore.kernel.org/all/ZFFbwOXZ5uI%2Fgdaf@google.com > > Agreed with not to relax to any errno. However using -EFAULT as part of ABI definition seems a little bit dangerous, e.g., someone could accidentally or mistakenly return -EFAULT in KVM_RUN at early time and/or in a completely different code path, etc. -EINTR has well defined meaning, but -EFAULT (which is "Bad address") seems doesn't but I am not sure either. :-) One example is, for backing VMA with VM_IO | VM_PFNMAP, hva_to_pfn() returns KVM_PFN_ERR_FAULT when the kernel cannot get a valid PFN (e.g. when SGX vepc fault handler failed to allocate EPC) and kvm_handle_error_pfn() will just return -EFAULT. If kvm_run.exit_reason isn't purged early then is it possible to have some issue here?
On Thu, 2023-11-02 at 03:17 +0000, Huang, Kai wrote: > On Wed, 2023-11-01 at 10:36 -0700, Sean Christopherson wrote: > > On Wed, Nov 01, 2023, Kai Huang wrote: > > > > > > > +7.34 KVM_CAP_MEMORY_FAULT_INFO > > > > +------------------------------ > > > > + > > > > +:Architectures: x86 > > > > +:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP. > > > > + > > > > +The presence of this capability indicates that KVM_RUN will fill > > > > +kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if > > > > +there is a valid memslot but no backing VMA for the corresponding host virtual > > > > +address. > > > > + > > > > +The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns > > > > +an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set > > > > +to KVM_EXIT_MEMORY_FAULT. > > > > > > IIUC returning -EFAULT or whatever -errno is sort of KVM internal > > > implementation. > > > > The errno that is returned to userspace is ABI. In KVM, it's a _very_ poorly > > defined ABI for the vast majority of ioctls(), but it's still technically ABI. > > KVM gets away with being cavalier with errno because the vast majority of errors > > are considered fatal by userespace, i.e. in most cases, userspace simply doesn't > > care about the exact errno. > > > > A good example is KVM_RUN with -EINTR; if KVM were to return something other than > > -EINTR on a pending signal or vcpu->run->immediate_exit, userspace would fall over. > > > > > Is it better to relax the validity of kvm_run.memory_fault when > > > KVM_RUN returns any -errno? > > > > Not unless there's a need to do so, and if there is then we can update the > > documentation accordingly. If KVM's ABI is that kvm_run.memory_fault is valid > > for any errno, then KVM would need to purge kvm_run.exit_reason super early in > > KVM_RUN, e.g. to prevent an -EINTR return due to immediate_exit from being > > misinterpreted as KVM_EXIT_MEMORY_FAULT. And purging exit_reason super early is > > subtly tricky because KVM's (again, poorly documented) ABI is that *some* exit > > reasons are preserved across KVM_RUN with vcpu->run->immediate_exit (or with a > > pending signal). > > > > https://lore.kernel.org/all/ZFFbwOXZ5uI%2Fgdaf@google.com > > > > > > Agreed with not to relax to any errno. However using -EFAULT as part of ABI > definition seems a little bit dangerous, e.g., someone could accidentally or > mistakenly return -EFAULT in KVM_RUN at early time and/or in a completely > different code path, etc. -EINTR has well defined meaning, but -EFAULT (which > is "Bad address") seems doesn't but I am not sure either. :-) > > One example is, for backing VMA with VM_IO | VM_PFNMAP, hva_to_pfn() returns > KVM_PFN_ERR_FAULT when the kernel cannot get a valid PFN (e.g. when SGX vepc > fault handler failed to allocate EPC) and kvm_handle_error_pfn() will just > return -EFAULT. If kvm_run.exit_reason isn't purged early then is it possible > to have some issue here? > Also, regardless whether -EFAULT is too ambiguous to be part of ABI, could you elaborate the EHWPOISON part? IIUC KVM can already handle the case of poisoned page by sending signal to user app: static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) { ... if (fault->pfn == KVM_PFN_ERR_HWPOISON) { kvm_send_hwpoison_signal(fault->slot, fault->gfn); return RET_PF_RETRY; } } And (sorry to hijack) I am thinking whether "SGX vepc unable to allocate EPC" can also use this memory_fault mechanism. Currently as mentioned above when vepc fault handler cannot allocate EPC page KVM returns -EFAULT to Qemu, and Qemu prints ... ...: Bad address <dump guest cpu registers> ... which is nonsense. If we can use memory_fault.flags (or is 'fault_reason' a better name?) to carry a specific value for EPC to let Qemu know and Qemu can then do more reasonable things.
On 11/1/23 18:36, Sean Christopherson wrote: > A good example is KVM_RUN with -EINTR; if KVM were to return something other than > -EINTR on a pending signal or vcpu->run->immediate_exit, userspace would fall over. And dually if KVM were to return KVM_EXIT_INTR together with something other than -EINTR. > And purging exit_reason super early is subtly tricky because KVM's > (again, poorly documented) ABI is that *some* exit reasons are preserved > across KVM_RUN with vcpu->run->immediate_exit (or with a pending > signal). https://lore.kernel.org/all/ZFFbwOXZ5uI%2Fgdaf@google.com vcpu->run->immediate_exit preserves all exit reasons, but it's not a good idea that immediate_exit behaves different from a pending signal on entry to KVM_RUN (remember that immediate_exit was meant to be a better performing alternative to KVM_SET_SIGNAL_MASK). In principle, vcpu->run->immediate_exit could return KVM_EXIT_INTR (perhaps even _should_, except that breaks selftests so at this point it is ABI). Paolo
On 11/2/23 10:35, Huang, Kai wrote: > IIUC KVM can already handle the case of poisoned > page by sending signal to user app: > > static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, > struct kvm_page_fault *fault) > { > ... > > if (fault->pfn == KVM_PFN_ERR_HWPOISON) { > kvm_send_hwpoison_signal(fault->slot, fault->gfn); > return RET_PF_RETRY; > } > } EHWPOISON is not implemented by this series, so it should be left out of the documentation. > Currently as mentioned above when > vepc fault handler cannot allocate EPC page KVM returns -EFAULT to Qemu, and > Qemu prints ... > > ...: Bad address > <dump guest cpu registers> > > ... which is nonsense. > > If we can use memory_fault.flags (or is 'fault_reason' a better name?) to carry > a specific value for EPC to let Qemu know and Qemu can then do more reasonable > things. Yes, that's a good idea that can be implemented on top. Paolo
On Thu, Nov 02, 2023, Paolo Bonzini wrote: > On 11/2/23 10:35, Huang, Kai wrote: > > IIUC KVM can already handle the case of poisoned > > page by sending signal to user app: > > > > static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct > > kvm_page_fault *fault) { > > ... > > > > if (fault->pfn == KVM_PFN_ERR_HWPOISON) { > > kvm_send_hwpoison_signal(fault->slot, fault->gfn); No, this doesn't work, because that signals the host virtual address unsigned long hva = gfn_to_hva_memslot(slot, gfn); send_sig_mceerr(BUS_MCEERR_AR, (void __user *)hva, PAGE_SHIFT, current); which is the *shared* page. > > return RET_PF_RETRY; > > } > > } > > EHWPOISON is not implemented by this series, so it should be left out of the > documentation. EHWPOISON *is* implemented. kvm_gmem_get_pfn() returns -EWPOISON as appropriate, and kvm_faultin_pfn() returns that directly without going through kvm_handle_error_pfn(). kvm_faultin_pfn_private() | |-> kvm_gmem_get_pfn() | |-> if (folio_test_hwpoison(folio)) { r = -EHWPOISON; goto out_unlock; } | |-> r = kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn, &fault->pfn, &max_order); if (r) { kvm_mmu_prepare_memory_fault_exit(vcpu, fault); return r; } | |-> ret = __kvm_faultin_pfn(vcpu, fault); if (ret != RET_PF_CONTINUE) return ret; if (unlikely(is_error_pfn(fault->pfn))) return kvm_handle_error_pfn(vcpu, fault);
On Thu, Nov 02, 2023, Xiaoyao Li wrote: > On 11/2/2023 1:36 AM, Sean Christopherson wrote: > > > KVM_CAP_MEMORY_FAULT_INFO is x86 only, is it better to put this function to > > > <asm/kvm_host.h>? > > I'd prefer to keep it in generic code, as it's highly likely to end up there > > sooner than later. There's a known use case for ARM (exit to userspace on missing > > userspace mapping[*]), and I'm guessing pKVM (also ARM) will also utilize this API. > > > > [*]https://lore.kernel.org/all/20230908222905.1321305-8-amoorthy@google.com > > I wonder how this CAP is supposed to be checked in userspace, for guest > memfd case? It's basically useless for guest_memfd. > if (!kvm_check_extension(s, KVM_CAP_MEMORY_FAULT_INFO) && > run->exit_reason == KVM_EXIT_MEMORY_FAULT) > abort("unexpected KVM_EXIT_MEMORY_FAULT"); > > In my implementation of QEMU patches, I find it's unnecessary. When > userspace gets an exit with KVM_EXIT_MEMORY_FAULT, it implies > "KVM_CAP_MEMORY_FAULT_INFO". > > So I don't see how it is necessary in this series. Whether it's necessary or > not for [*], I don't have the answer but we can leave the discussion to that > patch series. It's not strictly necessary there either. However, Oliver felt (and presumably still feels) quite strongly, and I agree, that neither reporting extra information shouldn't be tightly coupled to KVM_CAP_EXIT_ON_MISSING or KVM_CAP_GUEST_MEMFD. E.g. if userspace develops a "standalone" use case for KVM_CAP_MEMORY_FAULT_INFO, userspace should be able to check for support without having to take a dependency on KVM_CAP_GUEST_MEMFD, especially since because KVM_CAP_GUEST_MEMFD may not be supported, i.e. userspace should be able to do: if (!kvm_check_extension(s, KVM_CAP_MEMORY_FAULT_INFO)) abort("KVM_CAP_MEMORY_FAULT_INFO required for fancy feature XYZ");
On Thu, Nov 02, 2023, Kai Huang wrote: > On Wed, 2023-11-01 at 10:36 -0700, Sean Christopherson wrote: > > On Wed, Nov 01, 2023, Kai Huang wrote: > > > > > > > +7.34 KVM_CAP_MEMORY_FAULT_INFO > > > > +------------------------------ > > > > + > > > > +:Architectures: x86 > > > > +:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP. > > > > + > > > > +The presence of this capability indicates that KVM_RUN will fill > > > > +kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if > > > > +there is a valid memslot but no backing VMA for the corresponding host virtual > > > > +address. > > > > + > > > > +The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns > > > > +an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set > > > > +to KVM_EXIT_MEMORY_FAULT. > > > > > > IIUC returning -EFAULT or whatever -errno is sort of KVM internal > > > implementation. > > > > The errno that is returned to userspace is ABI. In KVM, it's a _very_ poorly > > defined ABI for the vast majority of ioctls(), but it's still technically ABI. > > KVM gets away with being cavalier with errno because the vast majority of errors > > are considered fatal by userespace, i.e. in most cases, userspace simply doesn't > > care about the exact errno. > > > > A good example is KVM_RUN with -EINTR; if KVM were to return something other than > > -EINTR on a pending signal or vcpu->run->immediate_exit, userspace would fall over. > > > > > Is it better to relax the validity of kvm_run.memory_fault when > > > KVM_RUN returns any -errno? > > > > Not unless there's a need to do so, and if there is then we can update the > > documentation accordingly. If KVM's ABI is that kvm_run.memory_fault is valid > > for any errno, then KVM would need to purge kvm_run.exit_reason super early in > > KVM_RUN, e.g. to prevent an -EINTR return due to immediate_exit from being > > misinterpreted as KVM_EXIT_MEMORY_FAULT. And purging exit_reason super early is > > subtly tricky because KVM's (again, poorly documented) ABI is that *some* exit > > reasons are preserved across KVM_RUN with vcpu->run->immediate_exit (or with a > > pending signal). > > > > https://lore.kernel.org/all/ZFFbwOXZ5uI%2Fgdaf@google.com > > > > > > Agreed with not to relax to any errno. However using -EFAULT as part of ABI > definition seems a little bit dangerous, e.g., someone could accidentally or > mistakenly return -EFAULT in KVM_RUN at early time and/or in a completely > different code path, etc. -EINTR has well defined meaning, but -EFAULT (which > is "Bad address") seems doesn't but I am not sure either. :-) KVM has returned -EFAULT since forever, i.e. it's effectively already part of the ABI. I doubt there's a userspace that relies precisely on -EFAULT, but userspace definitely will be confused if KVM returns '0' where KVM used to return -EFAULT. And so if we want to return '0', it needs to be opt-in, which means forcing userspace to enable a capability *and* requires code in KVM to conditionally return '0' instead of -EFAULT/-EHWPOISON. > One example is, for backing VMA with VM_IO | VM_PFNMAP, hva_to_pfn() returns > KVM_PFN_ERR_FAULT when the kernel cannot get a valid PFN (e.g. when SGX vepc > fault handler failed to allocate EPC) and kvm_handle_error_pfn() will just > return -EFAULT. If kvm_run.exit_reason isn't purged early then is it possible > to have some issue here? Well, yeah, but that's exactly why this series has a patch to reset exit_reason. The solution to "if KVM is buggy then bad things happen" is to not have KVM bugs :-)
On Thu, 2023-11-02 at 08:44 -0700, Sean Christopherson wrote: > On Thu, Nov 02, 2023, Paolo Bonzini wrote: > > On 11/2/23 10:35, Huang, Kai wrote: > > > IIUC KVM can already handle the case of poisoned > > > page by sending signal to user app: > > > > > > static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct > > > kvm_page_fault *fault) { > > > ... > > > > > > if (fault->pfn == KVM_PFN_ERR_HWPOISON) { > > > kvm_send_hwpoison_signal(fault->slot, fault->gfn); > > No, this doesn't work, because that signals the host virtual address Ah, right :-)
On Fri, Oct 27, 2023 at 11:21:51AM -0700, Sean Christopherson wrote: > --- a/Documentation/virt/kvm/api.rst > +++ b/Documentation/virt/kvm/api.rst > @@ -6723,6 +6723,26 @@ array field represents return values. The userspace should update the return > values of SBI call before resuming the VCPU. For more details on RISC-V SBI > spec refer, https://github.com/riscv/riscv-sbi-doc. > > +:: > + > + /* KVM_EXIT_MEMORY_FAULT */ > + struct { > + __u64 flags; > + __u64 gpa; > + __u64 size; > + } memory; ^ Should update to "memory_fault" to align with other places. [...] > @@ -520,6 +521,12 @@ struct kvm_run { > #define KVM_NOTIFY_CONTEXT_INVALID (1 << 0) > __u32 flags; > } notify; > + /* KVM_EXIT_MEMORY_FAULT */ > + struct { > + __u64 flags; > + __u64 gpa; > + __u64 size; > + } memory_fault; > /* Fix the size of the union. */ > char padding[256]; > }; Thanks, Yilun >
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index ace984acc125..860216536810 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -6723,6 +6723,26 @@ array field represents return values. The userspace should update the return values of SBI call before resuming the VCPU. For more details on RISC-V SBI spec refer, https://github.com/riscv/riscv-sbi-doc. +:: + + /* KVM_EXIT_MEMORY_FAULT */ + struct { + __u64 flags; + __u64 gpa; + __u64 size; + } memory; + +KVM_EXIT_MEMORY_FAULT indicates the vCPU has encountered a memory fault that +could not be resolved by KVM. The 'gpa' and 'size' (in bytes) describe the +guest physical address range [gpa, gpa + size) of the fault. The 'flags' field +describes properties of the faulting access that are likely pertinent. +Currently, no flags are defined. + +Note! KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it +accompanies a return code of '-1', not '0'! errno will always be set to EFAULT +or EHWPOISON when KVM exits with KVM_EXIT_MEMORY_FAULT, userspace should assume +kvm_run.exit_reason is stale/undefined for all other error numbers. + :: /* KVM_EXIT_NOTIFY */ @@ -7757,6 +7777,27 @@ This capability is aimed to mitigate the threat that malicious VMs can cause CPU stuck (due to event windows don't open up) and make the CPU unavailable to host or other VMs. +7.34 KVM_CAP_MEMORY_FAULT_INFO +------------------------------ + +:Architectures: x86 +:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP. + +The presence of this capability indicates that KVM_RUN will fill +kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if +there is a valid memslot but no backing VMA for the corresponding host virtual +address. + +The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns +an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set +to KVM_EXIT_MEMORY_FAULT. + +Note: Userspaces which attempt to resolve memory faults so that they can retry +KVM_RUN are encouraged to guard against repeatedly receiving the same +error/annotated fault. + +See KVM_EXIT_MEMORY_FAULT for more information. + 8. Other capabilities. ====================== diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 6409914428ca..ee3cd8c3c0ef 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4518,6 +4518,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_ENABLE_CAP: case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES: case KVM_CAP_IRQFD_RESAMPLE: + case KVM_CAP_MEMORY_FAULT_INFO: r = 1; break; case KVM_CAP_EXIT_HYPERCALL: diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 4e741ff27af3..96aa930536b1 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -2327,4 +2327,15 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr) /* Max number of entries allowed for each kvm dirty ring */ #define KVM_DIRTY_RING_MAX_ENTRIES 65536 +static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu, + gpa_t gpa, gpa_t size) +{ + vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT; + vcpu->run->memory_fault.gpa = gpa; + vcpu->run->memory_fault.size = size; + + /* Flags are not (yet) defined or communicated to userspace. */ + vcpu->run->memory_fault.flags = 0; +} + #endif diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index bd1abe067f28..7ae9987b48dd 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -274,6 +274,7 @@ struct kvm_xen_exit { #define KVM_EXIT_RISCV_SBI 35 #define KVM_EXIT_RISCV_CSR 36 #define KVM_EXIT_NOTIFY 37 +#define KVM_EXIT_MEMORY_FAULT 38 /* For KVM_EXIT_INTERNAL_ERROR */ /* Emulate instruction failed. */ @@ -520,6 +521,12 @@ struct kvm_run { #define KVM_NOTIFY_CONTEXT_INVALID (1 << 0) __u32 flags; } notify; + /* KVM_EXIT_MEMORY_FAULT */ + struct { + __u64 flags; + __u64 gpa; + __u64 size; + } memory_fault; /* Fix the size of the union. */ char padding[256]; }; @@ -1203,6 +1210,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228 #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229 #define KVM_CAP_USER_MEMORY2 230 +#define KVM_CAP_MEMORY_FAULT_INFO 231 #ifdef KVM_CAP_IRQ_ROUTING