Message ID | 20230908222905.1321305-5-amoorthy@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Improve KVM + userfaultfd live migration via annotated memory faults. | expand |
On Fri, Sep 08, 2023, Anish Moorthy wrote: > @@ -2318,4 +2324,33 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr) > /* Max number of entries allowed for each kvm dirty ring */ > #define KVM_DIRTY_RING_MAX_ENTRIES 65536 > > +/* > + * Attempts to set the run struct's exit reason to KVM_EXIT_MEMORY_FAULT and > + * populate the memory_fault field with the given information. > + * > + * WARNs and does nothing if the speculative exit canary has already been set > + * or if 'vcpu' is not the current running vcpu. > + */ > +static inline void kvm_handle_guest_uaccess_fault(struct kvm_vcpu *vcpu, > + uint64_t gpa, uint64_t len, uint64_t flags) After a lot of fiddling and leading you on a wild goose chase, I think the least awful name is kvm_prepare_memory_fault_exit(). Like kvm_prepare_emulation_failure_exit(), this doesn't actually "handle" anything, it just preps for the exit. If it actually returned something then maybe kvm_handle_guest_uaccess_fault() would be an ok name (IIRC, that was my original intent, but we wandered in a different direction). And peeking at future patches, pass in the RWX flags as bools, that way this helper can deal with the bools=>flags conversion. Oh, and fill the flags with bitwise ORs, that way future conflicts with private memory will be trivial to resolve. E.g. static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu, gpa_t gpa, gpa_t size, bool is_write, bool is_exec) { vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT; vcpu->run->memory_fault.gpa = gpa; vcpu->run->memory_fault.size = size; vcpu->run->memory_fault.flags = 0; if (is_write) vcpu->run->memory_fault.flags |= KVM_MEMORY_FAULT_FLAG_WRITE; else if (is_exec) vcpu->run->memory_fault.flags |= KVM_MEMORY_FAULT_FLAG_EXEC; else vcpu->run->memory_fault.flags |= KVM_MEMORY_FAULT_FLAG_READ; } > +{ > + /* > + * Ensure that an unloaded vCPU's run struct isn't being modified "unloaded" isn't accurate, e.g. the vCPU could be loaded, just not on this vCPU. I'd just drop the comment entirely, this one is fairly self-explanatory. > + */ > + if (WARN_ON_ONCE(vcpu != kvm_get_running_vcpu())) > + return; > + > + /* > + * Warn when overwriting an already-populated run struct. > + */ For future reference, use this style /* * */ only if the comment spans multiple lines. For single line comments, just: /* Warn when overwriting an already-populated run struct. */ > + WARN_ON_ONCE(vcpu->speculative_exit_canary != KVM_SPEC_EXIT_UNUSED); As mentioned in the guest_memfd thread[1], this WARN can be triggered by userspace, e.g. by getting KVM to fill the union but not exit, which is sadly not too hard because emulator_write_phys() incorrectly treats all failures as MMIO. I'm not even sure how to fix that in a race-free, sane way. E.g. rechecking the memslots doesn't work because a memslot could appear between __kvm_write_guest_page() failing and rechecking in emulator_read_write_onepage(). Hmm, maybe we could get away with returning a different errno, e.g. -ENXIO? And then emulator_write_phys() and emulator_read_write_onepage() can be taught to handle different errors accordingly. Anyways, I highly recommend just dropping the canary for now, trying to clean up the emulator and get this fully functional probably won't be a smooth process. > diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h > index f089ab290978..d19aa7965392 100644 > --- a/tools/include/uapi/linux/kvm.h > +++ b/tools/include/uapi/linux/kvm.h > @@ -278,6 +278,9 @@ struct kvm_xen_exit { > /* Flags that describe what fields in emulation_failure hold valid data. */ > #define KVM_INTERNAL_ERROR_EMULATION_FLAG_INSTRUCTION_BYTES (1ULL << 0) > > +/* KVM_CAP_MEMORY_FAULT_INFO flag for kvm_run.flags */ > +#define KVM_RUN_MEMORY_FAULT_FILLED (1 << 8) > + > /* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */ > struct kvm_run { > /* in */ > @@ -531,6 +534,27 @@ struct kvm_run { > struct kvm_sync_regs regs; > char padding[SYNC_REGS_SIZE_BYTES]; > } s; > + > + /* > + * This second exit union holds structs for exits which may be triggered > + * after KVM has already initiated a different exit, and/or may be > + * filled speculatively by KVM. > + * > + * For instance, because of limitations in KVM's uAPI, a memory fault > + * may be encounterd after an MMIO exit is initiated and exit_reason and > + * kvm_run.mmio are filled: isolating the speculative exits here ensures > + * that KVM won't clobber information for the original exit. > + */ > + union { > + /* KVM_RUN_MEMORY_FAULT_FILLED + EFAULT */ > + struct { > + __u64 flags; > + __u64 gpa; > + __u64 len; > + } memory_fault; > + /* Fix the size of the union. */ > + char speculative_exit_padding[256]; > + }; > }; As proposed in the guest_memfd thread[2], I think we should scrap the second union and just commit to achieving 100% accuracy only for page fault paths in the initial merge. I'll send you a clean-ish patch to use as a starting point sometime next week. [1] https://lore.kernel.org/all/ZRtxoaJdVF1C2Mvy@google.com [2] https://lore.kernel.org/all/ZQ3AmLO2SYv3DszH@google.com
On Wed, Oct 4, 2023 at 6:14 PM Sean Christopherson <seanjc@google.com> wrote: > > And peeking at future patches, pass in the RWX flags as bools, that way this > helper can deal with the bools=>flags conversion. Oh, and fill the flags with > bitwise ORs, that way future conflicts with private memory will be trivial to > resolve. > > E.g. > > static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu, > gpa_t gpa, gpa_t size, > bool is_write, bool is_exec) > { > vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT; > vcpu->run->memory_fault.gpa = gpa; > vcpu->run->memory_fault.size = size; > > vcpu->run->memory_fault.flags = 0; > if (is_write) > vcpu->run->memory_fault.flags |= KVM_MEMORY_FAULT_FLAG_WRITE; > else if (is_exec) > vcpu->run->memory_fault.flags |= KVM_MEMORY_FAULT_FLAG_EXEC; > else > vcpu->run->memory_fault.flags |= KVM_MEMORY_FAULT_FLAG_READ; > } Is a BUG/VM_BUG_ON() warranted in the (is_write && is_exec) case do you think? I see that user_mem_abort already VM_BUG_ON()s for this case, but if there's one in the x86 page fault path I don't immediately see it. Also this helper could be called from other paths, so maybe there's some value in it. > I'll send you a clean-ish patch to use as a starting point sometime next week. You mean something containing the next spin of the guest memfd stuff? Or parts of David's series as well?
On Thu, Oct 05, 2023, Anish Moorthy wrote: > On Wed, Oct 4, 2023 at 6:14 PM Sean Christopherson <seanjc@google.com> wrote: > > > > And peeking at future patches, pass in the RWX flags as bools, that way this > > helper can deal with the bools=>flags conversion. Oh, and fill the flags with > > bitwise ORs, that way future conflicts with private memory will be trivial to > > resolve. > > > > E.g. > > > > static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu, > > gpa_t gpa, gpa_t size, > > bool is_write, bool is_exec) > > { > > vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT; > > vcpu->run->memory_fault.gpa = gpa; > > vcpu->run->memory_fault.size = size; > > > > vcpu->run->memory_fault.flags = 0; > > if (is_write) > > vcpu->run->memory_fault.flags |= KVM_MEMORY_FAULT_FLAG_WRITE; > > else if (is_exec) > > vcpu->run->memory_fault.flags |= KVM_MEMORY_FAULT_FLAG_EXEC; > > else > > vcpu->run->memory_fault.flags |= KVM_MEMORY_FAULT_FLAG_READ; > > } > > Is a BUG/VM_BUG_ON() warranted in the (is_write && is_exec) case do > you think? Nah, not here. If we really wanted to add a sanity check, a KVM_MMU_WARN_ON() in kvm_mmu_page_fault() would be the way to go. > I see that user_mem_abort already VM_BUG_ON()s for this case, but if there's > one in the x86 page fault path I don't immediately see it. Also this helper > could be called from other paths, so maybe there's some value in it. > > > I'll send you a clean-ish patch to use as a starting point sometime next week. > > You mean something containing the next spin of the guest memfd stuff? > Or parts of David's series as well? guest_memfd stuff. Effectively this patch, just massaged to splice together all of the fixups.
On Fri, Sep 8, 2023 at 3:30 PM Anish Moorthy <amoorthy@google.com> wrote: > > KVM_CAP_MEMORY_FAULT_INFO allows kvm_run to return useful information > besides a return value of -1 and errno of EFAULT when a vCPU fails an > access to guest memory which may be resolvable by userspace. > > Add documentation, updates to the KVM headers, and a helper function > (kvm_handle_guest_uaccess_fault()) for implementing the capability. > > Mark KVM_CAP_MEMORY_FAULT_INFO as available on arm64 and x86, even > though EFAULT annotation are currently totally absent. Picking a point > to declare the implementation "done" is difficult because > > 1. Annotations will be performed incrementally in subsequent commits > across both core and arch-specific KVM. > 2. The initial series will very likely miss some cases which need > annotation. Although these omissions are to be fixed in the future, > userspace thus still needs to expect and be able to handle > unannotated EFAULTs. > > Given these qualifications, just marking it available here seems the > least arbitrary thing to do. > > Suggested-by: Sean Christopherson <seanjc@google.com> > Signed-off-by: Anish Moorthy <amoorthy@google.com> > --- [...] > +:: > + union { > + /* KVM_SPEC_EXIT_MEMORY_FAULT */ > + struct { > + __u64 flags; > + __u64 gpa; > + __u64 len; /* in bytes */ I wonder if `gpa` and `len` should just be replaced with `gfn`. - We don't seem to care about returning an exact `gpa` out to userspace since this series just returns gpa = gfn * PAGE_SIZE out to userspace. - The len we return seems kind of arbitrary. PAGE_SIZE on x86 and vma_pagesize on ARM64. But at the end of the day we're not asking the kernel to fault in any specific length of mapping. We're just asking for gfn-to-pfn for a specific gfn. - I'm not sure userspace will want to do anything with this information.
On Tue, Oct 10, 2023, David Matlack wrote: > On Fri, Sep 8, 2023 at 3:30 PM Anish Moorthy <amoorthy@google.com> wrote: > > > > KVM_CAP_MEMORY_FAULT_INFO allows kvm_run to return useful information > > besides a return value of -1 and errno of EFAULT when a vCPU fails an > > access to guest memory which may be resolvable by userspace. > > > > Add documentation, updates to the KVM headers, and a helper function > > (kvm_handle_guest_uaccess_fault()) for implementing the capability. > > > > Mark KVM_CAP_MEMORY_FAULT_INFO as available on arm64 and x86, even > > though EFAULT annotation are currently totally absent. Picking a point > > to declare the implementation "done" is difficult because > > > > 1. Annotations will be performed incrementally in subsequent commits > > across both core and arch-specific KVM. > > 2. The initial series will very likely miss some cases which need > > annotation. Although these omissions are to be fixed in the future, > > userspace thus still needs to expect and be able to handle > > unannotated EFAULTs. > > > > Given these qualifications, just marking it available here seems the > > least arbitrary thing to do. > > > > Suggested-by: Sean Christopherson <seanjc@google.com> > > Signed-off-by: Anish Moorthy <amoorthy@google.com> > > --- > [...] > > +:: > > + union { > > + /* KVM_SPEC_EXIT_MEMORY_FAULT */ > > + struct { > > + __u64 flags; > > + __u64 gpa; > > + __u64 len; /* in bytes */ > > I wonder if `gpa` and `len` should just be replaced with `gfn`. > > - We don't seem to care about returning an exact `gpa` out to > userspace since this series just returns gpa = gfn * PAGE_SIZE out to > userspace. > - The len we return seems kind of arbitrary. PAGE_SIZE on x86 and > vma_pagesize on ARM64. But at the end of the day we're not asking the > kernel to fault in any specific length of mapping. We're just asking > for gfn-to-pfn for a specific gfn. > - I'm not sure userspace will want to do anything with this information. Extending ABI is tricky. E.g. if a use case comes along that needs/wants to return a range, then we'd need to add a flag and also update userspace to actually do the right thing. The page fault path doesn't need such information because hardware gives a very precise faulting address. But if we ever get to a point where KVM provides info for uaccess failures, then we'll likely want to provide the range. E.g. if a uaccess splits a page, on x86, we'd either need to register our own exception fixup and use custom uaccess macros (eww), or convice the world that extending ex_handler_uaccess() and all of the uaccess macros that they need to provide the exact address that failed. And for SNP and TDX, I believe the range will be used when the guest uses a hardware-vendor-defined hypercall to request conversions between private and shared. Or maybe the plan is to funnel those into KVM_HC_MAP_GPA_RANGE?
On Tue, Oct 10, 2023 at 4:40 PM Sean Christopherson <seanjc@google.com> wrote: > > On Tue, Oct 10, 2023, David Matlack wrote: > > On Fri, Sep 8, 2023 at 3:30 PM Anish Moorthy <amoorthy@google.com> wrote: > > > > > > +:: > > > + union { > > > + /* KVM_SPEC_EXIT_MEMORY_FAULT */ > > > + struct { > > > + __u64 flags; > > > + __u64 gpa; > > > + __u64 len; /* in bytes */ > > > > I wonder if `gpa` and `len` should just be replaced with `gfn`. > > > > - We don't seem to care about returning an exact `gpa` out to > > userspace since this series just returns gpa = gfn * PAGE_SIZE out to > > userspace. > > - The len we return seems kind of arbitrary. PAGE_SIZE on x86 and > > vma_pagesize on ARM64. But at the end of the day we're not asking the > > kernel to fault in any specific length of mapping. We're just asking > > for gfn-to-pfn for a specific gfn. > > - I'm not sure userspace will want to do anything with this information. > > Extending ABI is tricky. E.g. if a use case comes along that needs/wants to > return a range, then we'd need to add a flag and also update userspace to actually > do the right thing. > > The page fault path doesn't need such information because hardware gives a very > precise faulting address. But if we ever get to a point where KVM provides info > for uaccess failures, then we'll likely want to provide the range. E.g. if a > uaccess splits a page, on x86, we'd either need to register our own exception > fixup and use custom uaccess macros (eww), or convice the world that extending > ex_handler_uaccess() and all of the uaccess macros that they need to provide the > exact address that failed. I wonder if userspace might need a precise fault address in some situations? e.g. If KVM returns -HWPOISON for an access that spans a page boundary, userspace won't know which is poisoned. Maybe SNP/TDX need precise fault addresses as well? I don't know enough about how SNP and TDX plan to use this UAPI. > > And for SNP and TDX, I believe the range will be used when the guest uses a > hardware-vendor-defined hypercall to request conversions between private and > shared. Or maybe the plan is to funnel those into KVM_HC_MAP_GPA_RANGE?
On Mon, Oct 16, 2023, David Matlack wrote: > On Tue, Oct 10, 2023 at 4:40 PM Sean Christopherson <seanjc@google.com> wrote: > > > > On Tue, Oct 10, 2023, David Matlack wrote: > > > On Fri, Sep 8, 2023 at 3:30 PM Anish Moorthy <amoorthy@google.com> wrote: > > > > > > > > +:: > > > > + union { > > > > + /* KVM_SPEC_EXIT_MEMORY_FAULT */ > > > > + struct { > > > > + __u64 flags; > > > > + __u64 gpa; > > > > + __u64 len; /* in bytes */ > > > > > > I wonder if `gpa` and `len` should just be replaced with `gfn`. > > > > > > - We don't seem to care about returning an exact `gpa` out to > > > userspace since this series just returns gpa = gfn * PAGE_SIZE out to > > > userspace. > > > - The len we return seems kind of arbitrary. PAGE_SIZE on x86 and > > > vma_pagesize on ARM64. But at the end of the day we're not asking the > > > kernel to fault in any specific length of mapping. We're just asking > > > for gfn-to-pfn for a specific gfn. > > > - I'm not sure userspace will want to do anything with this information. > > > > Extending ABI is tricky. E.g. if a use case comes along that needs/wants to > > return a range, then we'd need to add a flag and also update userspace to actually > > do the right thing. > > > > The page fault path doesn't need such information because hardware gives a very > > precise faulting address. But if we ever get to a point where KVM provides info > > for uaccess failures, then we'll likely want to provide the range. E.g. if a > > uaccess splits a page, on x86, we'd either need to register our own exception > > fixup and use custom uaccess macros (eww), or convice the world that extending > > ex_handler_uaccess() and all of the uaccess macros that they need to provide the > > exact address that failed. > > I wonder if userspace might need a precise fault address in some > situations? e.g. If KVM returns -HWPOISON for an access that spans a > page boundary, userspace won't know which is poisoned. As things currently stand, the -EHWPOISON case is guaranteed to be precise because uaccess failures only ever return -EFAULT. The resulting BUS_MCEERR_AR from the kernel's #MC handler will provide the necessary precision to userspace. Though even if -EHWPOISON were imprecise, userspace should be able to figure out which page is poisoned, e.g. by probing each possible page (gross, but doable). Ah, and a much more concrete reason to report gpa+len is that it's possible that KVM may someday support faults at sub-page granularity, e.g. if something like HEKI[*] wants to use Intel's Sub-Page Write Permissions to make a minimal amount of guest code writable when the guest kernel is doing code patching. > Maybe SNP/TDX need precise fault addresses as well? I don't know enough about > how SNP and TDX plan to use this UAPI. FWIW, SNP and TDX usage are limited to the KVM page fault path, i.e. always do precise, single-page reporting. [*] https://lore.kernel.org/all/20230505152046.6575-1-mic@digikod.net
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 660d9ca7a251..92fd3faa6bab 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -6130,8 +6130,10 @@ local APIC is not used. __u16 flags; -More architecture-specific flags detailing state of the VCPU that may -affect the device's behavior. Current defined flags:: +Flags detailing state of the VCPU. The lower/upper bytes encode archiecture +specific/agnostic bytes respectively. Current defined flags + +:: /* x86, set if the VCPU is in system management mode */ #define KVM_RUN_X86_SMM (1 << 0) @@ -6140,6 +6142,9 @@ affect the device's behavior. Current defined flags:: /* arm64, set for KVM_EXIT_DEBUG */ #define KVM_DEBUG_ARCH_HSR_HIGH_VALID (1 << 0) + /* Arch-agnostic, see KVM_CAP_MEMORY_FAULT_INFO */ + #define KVM_RUN_MEMORY_FAULT_FILLED (1 << 8) + :: /* in (pre_kvm_run), out (post_kvm_run) */ @@ -6750,6 +6755,18 @@ kvm_valid_regs for specific bits. These bits are architecture specific and usually define the validity of a groups of registers. (e.g. one bit for general purpose registers) +:: + union { + /* KVM_SPEC_EXIT_MEMORY_FAULT */ + struct { + __u64 flags; + __u64 gpa; + __u64 len; /* in bytes */ + } memory_fault; + +Indicates a memory fault on the guest physical address range +[gpa, gpa + len). See KVM_CAP_MEMORY_FAULT_INFO for more details. + Please note that the kernel is allowed to use the kvm_run structure as the primary storage for certain register types. Therefore, the kernel may use the values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set. @@ -7736,6 +7753,34 @@ This capability is aimed to mitigate the threat that malicious VMs can cause CPU stuck (due to event windows don't open up) and make the CPU unavailable to host or other VMs. +7.34 KVM_CAP_MEMORY_FAULT_INFO +------------------------------ + +:Architectures: x86, arm64 +:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP. + +The presence of this capability indicates that KVM_RUN may fill +kvm_run.memory_fault in response to failed guest memory accesses in a vCPU +context. + +When KVM_RUN returns an error with errno=EFAULT, (kvm_run.flags | +KVM_RUN_MEMORY_FAULT_FILLED) indicates that the kvm_run.memory_fault field is +valid. This capability is only partially implemented in that not all EFAULTs +from KVM_RUN may be annotated in this way: these "bare" EFAULTs should be +considered bugs and reported to the maintainers. + +The 'gpa' and 'len' (in bytes) fields of kvm_run.memory_fault describe the range +of guest physical memory to which access failed, i.e. [gpa, gpa + len). 'flags' +is a bitfield indicating the nature of the access: valid masks are + + - KVM_MEMORY_FAULT_FLAG_READ: The failed access was a read. + - KVM_MEMORY_FAULT_FLAG_WRITE: The failed access was a write. + - KVM_MEMORY_FAULT_FLAG_EXEC: The failed access was an exec. + +Note: Userspaces which attempt to resolve memory faults so that they can retry +KVM_RUN are encouraged to guard against repeatedly receiving the same +error/annotated fault. + 8. Other capabilities. ====================== diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index fb6c6109fdca..9206ac944d31 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -392,6 +392,12 @@ struct kvm_vcpu { */ struct kvm_memory_slot *last_used_slot; u64 last_used_slot_gen; + + /* + * KVM_RUN initializes this value to KVM_SPEC_EXIT_UNUSED on entry and + * sets it to something else when it fills the speculative_exit struct. + */ + u8 speculative_exit_canary; }; /* @@ -2318,4 +2324,33 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr) /* Max number of entries allowed for each kvm dirty ring */ #define KVM_DIRTY_RING_MAX_ENTRIES 65536 +/* + * Attempts to set the run struct's exit reason to KVM_EXIT_MEMORY_FAULT and + * populate the memory_fault field with the given information. + * + * WARNs and does nothing if the speculative exit canary has already been set + * or if 'vcpu' is not the current running vcpu. + */ +static inline void kvm_handle_guest_uaccess_fault(struct kvm_vcpu *vcpu, + uint64_t gpa, uint64_t len, uint64_t flags) +{ + /* + * Ensure that an unloaded vCPU's run struct isn't being modified + */ + if (WARN_ON_ONCE(vcpu != kvm_get_running_vcpu())) + return; + + /* + * Warn when overwriting an already-populated run struct. + */ + WARN_ON_ONCE(vcpu->speculative_exit_canary != KVM_SPEC_EXIT_UNUSED); + + vcpu->speculative_exit_canary = KVM_SPEC_EXIT_MEMORY_FAULT; + + vcpu->run->flags |= KVM_RUN_MEMORY_FAULT_FILLED; + vcpu->run->memory_fault.gpa = gpa; + vcpu->run->memory_fault.len = len; + vcpu->run->memory_fault.flags = flags; +} + #endif diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index f089ab290978..b2e4ac83b5a8 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -265,6 +265,9 @@ struct kvm_xen_exit { #define KVM_EXIT_RISCV_CSR 36 #define KVM_EXIT_NOTIFY 37 +#define KVM_SPEC_EXIT_UNUSED 0 +#define KVM_SPEC_EXIT_MEMORY_FAULT 1 + /* For KVM_EXIT_INTERNAL_ERROR */ /* Emulate instruction failed. */ #define KVM_INTERNAL_ERROR_EMULATION 1 @@ -278,6 +281,9 @@ struct kvm_xen_exit { /* Flags that describe what fields in emulation_failure hold valid data. */ #define KVM_INTERNAL_ERROR_EMULATION_FLAG_INSTRUCTION_BYTES (1ULL << 0) +/* KVM_CAP_MEMORY_FAULT_INFO flag for kvm_run.flags */ +#define KVM_RUN_MEMORY_FAULT_FILLED (1 << 8) + /* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */ struct kvm_run { /* in */ @@ -531,6 +537,27 @@ struct kvm_run { struct kvm_sync_regs regs; char padding[SYNC_REGS_SIZE_BYTES]; } s; + + /* + * This second exit union holds structs for exits which may be triggered + * after KVM has already initiated a different exit, and/or may be + * filled speculatively by KVM. + * + * For instance, because of limitations in KVM's uAPI, a memory fault + * may be encounterd after an MMIO exit is initiated and exit_reason and + * kvm_run.mmio are filled: isolating the speculative exits here ensures + * that KVM won't clobber information for the original exit. + */ + union { + /* KVM_SPEC_EXIT_MEMORY_FAULT */ + struct { + __u64 flags; + __u64 gpa; + __u64 len; + } memory_fault; + /* Fix the size of the union. */ + char speculative_exit_padding[256]; + }; }; /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */ @@ -1192,6 +1219,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_COUNTER_OFFSET 227 #define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228 #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229 +#define KVM_CAP_MEMORY_FAULT_INFO 230 #ifdef KVM_CAP_IRQ_ROUTING @@ -2249,4 +2277,10 @@ struct kvm_s390_zpci_op { /* flags for kvm_s390_zpci_op->u.reg_aen.flags */ #define KVM_S390_ZPCIOP_REGAEN_HOST (1 << 0) +/* flags for KVM_CAP_MEMORY_FAULT_INFO */ + +#define KVM_MEMORY_FAULT_FLAG_READ (1 << 0) +#define KVM_MEMORY_FAULT_FLAG_WRITE (1 << 1) +#define KVM_MEMORY_FAULT_FLAG_EXEC (1 << 2) + #endif /* __LINUX_KVM_H */ diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h index f089ab290978..d19aa7965392 100644 --- a/tools/include/uapi/linux/kvm.h +++ b/tools/include/uapi/linux/kvm.h @@ -278,6 +278,9 @@ struct kvm_xen_exit { /* Flags that describe what fields in emulation_failure hold valid data. */ #define KVM_INTERNAL_ERROR_EMULATION_FLAG_INSTRUCTION_BYTES (1ULL << 0) +/* KVM_CAP_MEMORY_FAULT_INFO flag for kvm_run.flags */ +#define KVM_RUN_MEMORY_FAULT_FILLED (1 << 8) + /* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */ struct kvm_run { /* in */ @@ -531,6 +534,27 @@ struct kvm_run { struct kvm_sync_regs regs; char padding[SYNC_REGS_SIZE_BYTES]; } s; + + /* + * This second exit union holds structs for exits which may be triggered + * after KVM has already initiated a different exit, and/or may be + * filled speculatively by KVM. + * + * For instance, because of limitations in KVM's uAPI, a memory fault + * may be encounterd after an MMIO exit is initiated and exit_reason and + * kvm_run.mmio are filled: isolating the speculative exits here ensures + * that KVM won't clobber information for the original exit. + */ + union { + /* KVM_RUN_MEMORY_FAULT_FILLED + EFAULT */ + struct { + __u64 flags; + __u64 gpa; + __u64 len; + } memory_fault; + /* Fix the size of the union. */ + char speculative_exit_padding[256]; + }; }; /* for KVM_REGISTER_COALESCED_MMIO / KVM_UNREGISTER_COALESCED_MMIO */ diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 8b2d5aab32bf..e31435179764 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -4151,6 +4151,8 @@ static long kvm_vcpu_ioctl(struct file *filp, synchronize_rcu(); put_pid(oldpid); } + vcpu->speculative_exit_canary = KVM_SPEC_EXIT_UNUSED; + vcpu->run->flags &= ~KVM_RUN_MEMORY_FAULT_FILLED; r = kvm_arch_vcpu_ioctl_run(vcpu); trace_kvm_userspace_exit(vcpu->run->exit_reason, r); break; @@ -4539,6 +4541,7 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg) case KVM_CAP_CHECK_EXTENSION_VM: case KVM_CAP_ENABLE_CAP_VM: case KVM_CAP_HALT_POLL: + case KVM_CAP_MEMORY_FAULT_INFO: return 1; #ifdef CONFIG_KVM_MMIO case KVM_CAP_COALESCED_MMIO:
KVM_CAP_MEMORY_FAULT_INFO allows kvm_run to return useful information besides a return value of -1 and errno of EFAULT when a vCPU fails an access to guest memory which may be resolvable by userspace. Add documentation, updates to the KVM headers, and a helper function (kvm_handle_guest_uaccess_fault()) for implementing the capability. Mark KVM_CAP_MEMORY_FAULT_INFO as available on arm64 and x86, even though EFAULT annotation are currently totally absent. Picking a point to declare the implementation "done" is difficult because 1. Annotations will be performed incrementally in subsequent commits across both core and arch-specific KVM. 2. The initial series will very likely miss some cases which need annotation. Although these omissions are to be fixed in the future, userspace thus still needs to expect and be able to handle unannotated EFAULTs. Given these qualifications, just marking it available here seems the least arbitrary thing to do. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Anish Moorthy <amoorthy@google.com> --- Documentation/virt/kvm/api.rst | 49 ++++++++++++++++++++++++++++++++-- include/linux/kvm_host.h | 35 ++++++++++++++++++++++++ include/uapi/linux/kvm.h | 34 +++++++++++++++++++++++ tools/include/uapi/linux/kvm.h | 24 +++++++++++++++++ virt/kvm/kvm_main.c | 3 +++ 5 files changed, 143 insertions(+), 2 deletions(-)