Message ID | 20230602161921.208564-10-amoorthy@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Improve scalability of KVM + userfaultfd live migration via annotated memory faults. | expand |
On Fri, Jun 02, 2023, Anish Moorthy wrote: > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 05d6e7e3994d..2c276d4d0821 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -1527,6 +1527,9 @@ static int check_memory_region_flags(const struct kvm_userspace_memory_region *m > valid_flags |= KVM_MEM_READONLY; > #endif > > + if (kvm_vm_ioctl_check_extension(NULL, KVM_CAP_NOWAIT_ON_FAULT)) Rather than force architectures to add the extension, probably better to use a config HAVE_KVM_NOWAIT_ON_FAULT bool and select that from arch Kconfigs. That way the enumeration can be done in common code, and then this can be computed at compile time doesn't need to do a rather weird invocation of kvm_dev_ioctl() with KVM_CHECK_EXTENSION. FWIW, you should be able to do if (IS_ENABLED(CONFIG_HAVE_KVM_NOWAIT_ON_FAULT)) and avoid more #ifdefs.
On Fri, Jun 02, 2023, Anish Moorthy wrote: > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index 69a221f71914..abbc5dd72292 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -2297,4 +2297,10 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr) > */ > inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu, > uint64_t gpa, uint64_t len, uint64_t flags); > + > +static inline bool kvm_slot_nowait_on_fault( > + const struct kvm_memory_slot *slot) Just when I was starting to think that we had beat all of the Google3 out of you :-) And predicate helpers in KVM typically have "is" or "has" in the name, so that it's clear the helper queries, versus e.g. sets the flag or something. > +{ > + return slot->flags & KVM_MEM_NOWAIT_ON_FAULT; KVM is anything but consistent, but if the helper is likely to be called without a known good memslot, it probably makes sense to have the helper check for NULL, e.g. static inline bool kvm_is_slot_nowait_on_fault(const struct kvm_memory_slot *slot) { return slot && slot->flags & KVM_MEM_NOWAIT_ON_FAULT; } However, do we actually need to force vendor code to query nowait? At a glance, the only external (relative to kvm_main.c) users of __gfn_to_pfn_memslot() are in flows that play nice with nowait or that don't support it at all. So I *think* we can do this? diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index be06b09e9104..78024318286d 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2403,6 +2403,11 @@ static bool memslot_is_readonly(const struct kvm_memory_slot *slot) return slot->flags & KVM_MEM_READONLY; } +static bool memslot_is_nowait_on_fault(const struct kvm_memory_slot *slot) +{ + return slot->flags & KVM_MEM_NOWAIT_ON_FAULT; +} + static unsigned long __gfn_to_hva_many(const struct kvm_memory_slot *slot, gfn_t gfn, gfn_t *nr_pages, bool write) { @@ -2730,6 +2735,11 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn, writable = NULL; } + if (async && memslot_is_nowait_on_fault(slot)) { + *async = false; + async = NULL; + } + return hva_to_pfn(addr, atomic, interruptible, async, write_fault, writable); } Ah, crud. The above highlights something I missed in v3. The memslot NOWAIT flag isn't tied to FOLL_NOWAIT, it's really truly a "fast-only" flag. And even more confusingly, KVM does set FOLL_NOWAIT, but for the async #PF case, which will get even more confusing if/when KVM uses FOLL_NOWAIT internally. Drat. I really like the NOWAIT name, but unfortunately it doesn't do what as the name says. I still don't love "fast-only" as that bleeds kernel internals to userspace. Anyone have ideas? Maybe something about not installing new mappings?
On Wed, Jun 14, 2023, Sean Christopherson wrote: > On Fri, Jun 02, 2023, Anish Moorthy wrote: > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > > index 69a221f71914..abbc5dd72292 100644 > > --- a/include/linux/kvm_host.h > > +++ b/include/linux/kvm_host.h > > @@ -2297,4 +2297,10 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr) > > */ > > inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu, > > uint64_t gpa, uint64_t len, uint64_t flags); > > + > > +static inline bool kvm_slot_nowait_on_fault( > > + const struct kvm_memory_slot *slot) > > Just when I was starting to think that we had beat all of the Google3 out of you :-) > > And predicate helpers in KVM typically have "is" or "has" in the name, so that it's > clear the helper queries, versus e.g. sets the flag or something. > > > +{ > > + return slot->flags & KVM_MEM_NOWAIT_ON_FAULT; > > KVM is anything but consistent, but if the helper is likely to be called without > a known good memslot, it probably makes sense to have the helper check for NULL, > e.g. > > static inline bool kvm_is_slot_nowait_on_fault(const struct kvm_memory_slot *slot) > { > return slot && slot->flags & KVM_MEM_NOWAIT_ON_FAULT; > } > > However, do we actually need to force vendor code to query nowait? At a glance, > the only external (relative to kvm_main.c) users of __gfn_to_pfn_memslot() are > in flows that play nice with nowait or that don't support it at all. So I *think* > we can do this? > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index be06b09e9104..78024318286d 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -2403,6 +2403,11 @@ static bool memslot_is_readonly(const struct kvm_memory_slot *slot) > return slot->flags & KVM_MEM_READONLY; > } > > +static bool memslot_is_nowait_on_fault(const struct kvm_memory_slot *slot) > +{ > + return slot->flags & KVM_MEM_NOWAIT_ON_FAULT; > +} > + > static unsigned long __gfn_to_hva_many(const struct kvm_memory_slot *slot, gfn_t gfn, > gfn_t *nr_pages, bool write) > { > @@ -2730,6 +2735,11 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn, > writable = NULL; > } > > + if (async && memslot_is_nowait_on_fault(slot)) { > + *async = false; > + async = NULL; > + } Gah, got turned around and forgot to account for @atomic. So this? if (!atomic && memslot_is_nowait_on_fault(slot)) { atomic = true; if (async) { *async = false; async = NULL; } } > + > return hva_to_pfn(addr, atomic, interruptible, async, write_fault, > writable); > } > > Ah, crud. The above highlights something I missed in v3. The memslot NOWAIT > flag isn't tied to FOLL_NOWAIT, it's really truly a "fast-only" flag. And even > more confusingly, KVM does set FOLL_NOWAIT, but for the async #PF case, which will > get even more confusing if/when KVM uses FOLL_NOWAIT internally. > > Drat. I really like the NOWAIT name, but unfortunately it doesn't do what as the > name says. > > I still don't love "fast-only" as that bleeds kernel internals to userspace. > Anyone have ideas? Maybe something about not installing new mappings?
On Thursday, June 15, 2023 5:21 AM, Sean Christopherson wrote: > On Fri, Jun 02, 2023, Anish Moorthy wrote: > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index > > 69a221f71914..abbc5dd72292 100644 > > --- a/include/linux/kvm_host.h > > +++ b/include/linux/kvm_host.h > > @@ -2297,4 +2297,10 @@ static inline void > kvm_account_pgtable_pages(void *virt, int nr) > > */ > > inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu, > > uint64_t gpa, uint64_t len, uint64_t flags); > > + > > +static inline bool kvm_slot_nowait_on_fault( > > + const struct kvm_memory_slot *slot) > > Just when I was starting to think that we had beat all of the Google3 out of > you :-) > > And predicate helpers in KVM typically have "is" or "has" in the name, so that it's > clear the helper queries, versus e.g. sets the flag or something. > > > +{ > > + return slot->flags & KVM_MEM_NOWAIT_ON_FAULT; > > KVM is anything but consistent, but if the helper is likely to be called without a > known good memslot, it probably makes sense to have the helper check for > NULL, e.g. > > static inline bool kvm_is_slot_nowait_on_fault(const struct kvm_memory_slot > *slot) { > return slot && slot->flags & KVM_MEM_NOWAIT_ON_FAULT; } > > However, do we actually need to force vendor code to query nowait? At a > glance, the only external (relative to kvm_main.c) users of > __gfn_to_pfn_memslot() are in flows that play nice with nowait or that don't > support it at all. So I *think* we can do this? > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index > be06b09e9104..78024318286d 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -2403,6 +2403,11 @@ static bool memslot_is_readonly(const struct > kvm_memory_slot *slot) > return slot->flags & KVM_MEM_READONLY; } > > +static bool memslot_is_nowait_on_fault(const struct kvm_memory_slot > +*slot) { > + return slot->flags & KVM_MEM_NOWAIT_ON_FAULT; } > + > static unsigned long __gfn_to_hva_many(const struct kvm_memory_slot *slot, > gfn_t gfn, > gfn_t *nr_pages, bool write) { @@ -2730,6 +2735,11 @@ > kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t > gfn, > writable = NULL; > } > > + if (async && memslot_is_nowait_on_fault(slot)) { > + *async = false; > + async = NULL; > + } > + > return hva_to_pfn(addr, atomic, interruptible, async, write_fault, > writable); > } > > Ah, crud. The above highlights something I missed in v3. The memslot NOWAIT > flag isn't tied to FOLL_NOWAIT, it's really truly a "fast-only" flag. And even > more confusingly, KVM does set FOLL_NOWAIT, but for the async #PF case, > which will get even more confusing if/when KVM uses FOLL_NOWAIT internally. > > Drat. I really like the NOWAIT name, but unfortunately it doesn't do what as the > name says. > > I still don't love "fast-only" as that bleeds kernel internals to userspace. > Anyone have ideas? Maybe something about not installing new mappings? Yes, "NOWAIT" sounds a bit confusing here. If this is a patch applied to userfaultfd to solve the "wait" issue on queuing/handling faults, then it would make sense. But this is a KVM specific solution, which is not directly related to userfaultfd, and it's not related to FOLL_NOWAIT. There seems nothing to wait in the KVM context here. Why not just name the cap as what it does (i.e. something to indicate the cap of having the fault exited to userspace to handle), e.g. KVM_CAP_EXIT_ON_FAULT or KVM_CAP_USERSPACE_FAULT.
On Thu, Jun 15, 2023, Wei W Wang wrote: > On Thursday, June 15, 2023 5:21 AM, Sean Christopherson wrote: > > Ah, crud. The above highlights something I missed in v3. The memslot NOWAIT > > flag isn't tied to FOLL_NOWAIT, it's really truly a "fast-only" flag. And even > > more confusingly, KVM does set FOLL_NOWAIT, but for the async #PF case, > > which will get even more confusing if/when KVM uses FOLL_NOWAIT internally. > > > > Drat. I really like the NOWAIT name, but unfortunately it doesn't do what as the > > name says. > > > > I still don't love "fast-only" as that bleeds kernel internals to userspace. > > Anyone have ideas? Maybe something about not installing new mappings? > > Yes, "NOWAIT" sounds a bit confusing here. If this is a patch applied to userfaultfd > to solve the "wait" issue on queuing/handling faults, then it would make sense. > But this is a KVM specific solution, which is not directly related to userfaultfd, and > it's not related to FOLL_NOWAIT. There seems nothing to wait in the KVM context > here. > > Why not just name the cap as what it does (i.e. something to indicate the cap of > having the fault exited to userspace to handle), e.g. KVM_CAP_EXIT_ON_FAULT > or KVM_CAP_USERSPACE_FAULT. Because that's even further away from the truth when accounting for the fact that the flag controls behavior when handling are *guest* faults. The memslot flag doesn't cause KVM to exit on every guest fault. And USERSPACE_FAULT is far too vague; KVM constantly faults in userspace mappings, the flag needs to communicate that KVM *won't* do that for guest accesses. Something like KVM_MEM_NO_USERFAULT_ON_GUEST_ACCESS? Ridiculously verbose, but I think it captures the KVM behavior, and "guest access" instead of "guest fault" gives KVM some wiggle room, e.g. the name won't become stale if we figure out a way to apply the behavior to KVM emulation of guest accesses in the future.
On Thursday, June 15, 2023 10:56 PM, Sean Christopherson wrote: > Because that's even further away from the truth when accounting for the fact > that the flag controls behavior when handling are *guest* faults. The What do you mean by guest faults here? I think more precisely, it's host page fault triggered by guest access (through host GUP though), isn't it? When the flag is set, we want to have this fault to be handled by userspace? > memslot flag doesn't cause KVM to exit on every guest fault. And > USERSPACE_FAULT is far too vague; KVM constantly faults in userspace > mappings, the flag needs to communicate that KVM *won't* do that for guest > accesses. I was trying to meant USERSPACE_FAULT_HANDLING. > > Something like KVM_MEM_NO_USERFAULT_ON_GUEST_ACCESS? Ridiculously Yeah, it's kind of verbose. Was your intension for "NO_USERFAULT" to mean bypassing the userfaultfd mechanism?
> Rather than force architectures to add the extension, probably better to use a > > config HAVE_KVM_NOWAIT_ON_FAULT > bool > > and select that from arch Kconfigs. That way the enumeration can be done in > common code, and then this can be computed at compile time doesn't need to do a > rather weird invocation of kvm_dev_ioctl() with KVM_CHECK_EXTENSION. Done. If I'm reading this correctly you also want me to move the logic for checking the cap out of the arch-specific kvm_vm_ioctl_check_extension() and into kvm_vm_ioctl_check_extension_generic(), which I've done too.
On Wed, Jun 14, 2023 at 2:20 PM Sean Christopherson <seanjc@google.com> wrote: > > > +static inline bool kvm_slot_nowait_on_fault( > > + const struct kvm_memory_slot *slot) > > Just when I was starting to think that we had beat all of the Google3 out of you :-) I was trying to stay under the line limit here :( But you've commented on that elsewhere. Fixed (hopefully :) > And predicate helpers in KVM typically have "is" or "has" in the name, so that it's > clear the helper queries, versus e.g. sets the flag or something. Done > KVM is anything but consistent, but if the helper is likely to be called without > a known good memslot, it probably makes sense to have the helper check for NULL, > e.g. Done: I was doing the null checks in other ways in the arch-specific code, but yeah it's easier to centralize that here. > However, do we actually need to force vendor code to query nowait? At a glance, > the only external (relative to kvm_main.c) users of __gfn_to_pfn_memslot() are > in flows that play nice with nowait or that don't support it at all. So I *think* > we can do this? > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index be06b09e9104..78024318286d 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -2403,6 +2403,11 @@ static bool memslot_is_readonly(const struct kvm_memory_slot *slot) > return slot->flags & KVM_MEM_READONLY; > } > > +static bool memslot_is_nowait_on_fault(const struct kvm_memory_slot *slot) > +{ > + return slot->flags & KVM_MEM_NOWAIT_ON_FAULT; > +} > + > static unsigned long __gfn_to_hva_many(const struct kvm_memory_slot *slot, gfn_t gfn, > gfn_t *nr_pages, bool write) > { > @@ -2730,6 +2735,11 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn, > writable = NULL; > } > > + if (async && memslot_is_nowait_on_fault(slot)) { > + *async = false; > + async = NULL; > + } > + > return hva_to_pfn(addr, atomic, interruptible, async, write_fault, > writable); > } Hmm, well not having to modify the vendor code would be nice... but I'll have to look more at __gfn_to_pfn_memslot()'s callers (and probably send more questions your way :). Hopefully it works out more like what you suggest.
> > However, do we actually need to force vendor code to query nowait? At a glance, > > the only external (relative to kvm_main.c) users of __gfn_to_pfn_memslot() are > > in flows that play nice with nowait or that don't support it at all. So I *think* > > we can do this? > > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > index be06b09e9104..78024318286d 100644 > > --- a/virt/kvm/kvm_main.c > > +++ b/virt/kvm/kvm_main.c > > @@ -2403,6 +2403,11 @@ static bool memslot_is_readonly(const struct kvm_memory_slot *slot) > > return slot->flags & KVM_MEM_READONLY; > > } > > > > +static bool memslot_is_nowait_on_fault(const struct kvm_memory_slot *slot) > > +{ > > + return slot->flags & KVM_MEM_NOWAIT_ON_FAULT; > > +} > > + > > static unsigned long __gfn_to_hva_many(const struct kvm_memory_slot *slot, gfn_t gfn, > > gfn_t *nr_pages, bool write) > > { > > @@ -2730,6 +2735,11 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn, > > writable = NULL; > > } > > > > + if (async && memslot_is_nowait_on_fault(slot)) { > > + *async = false; > > + async = NULL; > > + } > > + > > return hva_to_pfn(addr, atomic, interruptible, async, write_fault, > > writable); > > } > > Hmm, well not having to modify the vendor code would be nice... but > I'll have to look more at __gfn_to_pfn_memslot()'s callers (and > probably send more questions your way :). Hopefully it works out more > like what you suggest. I took a look of my own, and I don't think moving the nowait query into __gfn_to_pfn_memslot() would work. At issue is the actual behavior of KVM_CAP_NOWAIT_ON_FAULT, which I documented as follows: > The presence of this capability indicates that userspace may pass the > KVM_MEM_NOWAIT_ON_FAULT flag to KVM_SET_USER_MEMORY_REGION to cause KVM_RUN > to fail (-EFAULT) in response to page faults for which resolution would require > the faulting thread to sleep. Moving the nowait check out of __kvm_faultin_pfn()/user_mem_abort() and into __gfn_to_pfn_memslot() means that, obviously, other callers will start to see behavior changes. Some of that is probably actually necessary for that documentation to be accurate (since any usages of __gfn_to_pfn_memslot() under KVM_RUN should respect the memslot flag), but I think there are consumers of __gfn_to_pfn_memslot() from outside KVM_RUN. Anyways, after some searching on my end: I think the only caller of __gfn_to_pfn_memslot() in core kvm/x86/arm64 where moving the "nowait" check into the function actually changes anything is gfn_to_pfn(). But that function gets called from vmx_vcpu_create() (through kvm_alloc_apic_access_page()), and *that* certainly doesn't look like something KVM_RUN does or would ever call.
On Fri, Jul 07, 2023, Anish Moorthy wrote: > > Hmm, well not having to modify the vendor code would be nice... but > > I'll have to look more at __gfn_to_pfn_memslot()'s callers (and > > probably send more questions your way :). Hopefully it works out more > > like what you suggest. > > I took a look of my own, and I don't think moving the nowait query > into __gfn_to_pfn_memslot() would work. At issue is the actual > behavior of KVM_CAP_NOWAIT_ON_FAULT, which I documented as follows: > > > The presence of this capability indicates that userspace may pass the > > KVM_MEM_NOWAIT_ON_FAULT flag to KVM_SET_USER_MEMORY_REGION to cause KVM_RUN > > to fail (-EFAULT) in response to page faults for which resolution would require > > the faulting thread to sleep. Well, that description is wrong for other reasons. As mentioned in my reply (got snipped), the behavior is not tied to sleeping or waiting on I/O. > Moving the nowait check out of __kvm_faultin_pfn()/user_mem_abort() > and into __gfn_to_pfn_memslot() means that, obviously, other callers > will start to see behavior changes. Some of that is probably actually > necessary for that documentation to be accurate (since any usages of > __gfn_to_pfn_memslot() under KVM_RUN should respect the memslot flag), > but I think there are consumers of __gfn_to_pfn_memslot() from outside > KVM_RUN. Yeah, replace "in response to page faults" with something along the lines of "if an access in guest context ..." > Anyways, after some searching on my end: I think the only caller of > __gfn_to_pfn_memslot() in core kvm/x86/arm64 where moving the "nowait" > check into the function actually changes anything is gfn_to_pfn(). But > that function gets called from vmx_vcpu_create() (through > kvm_alloc_apic_access_page()), and *that* certainly doesn't look like > something KVM_RUN does or would ever call. Correct, but that particular gfn_to_pfn() works on a KVM-internal memslot, i.e. will never have the "fast-only" flag set. hva = __x86_set_memory_region(kvm, APIC_ACCESS_PAGE_PRIVATE_MEMSLOT, <=== APIC_DEFAULT_PHYS_BASE, PAGE_SIZE); if (IS_ERR(hva)) { ret = PTR_ERR(hva); goto out; } page = gfn_to_page(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT); if (is_error_page(page)) { ret = -EFAULT; goto out; } On x86, there should not be any other usages of user memslots outside of KVM_RUN. arm64 is unfortunately a different story (see this thread[*]), but we may be able to solve that with a documentation update. I *think* the accesses are limited to the sub-ioctl KVM_DEV_ARM_VGIC_GRP_CTRL, and more precisely the sub-sub-ioctls KVM_DEV_ARM_ITS_{SAVE,RESTORE}_TABLES and KVM_DEV_ARM_VGIC_SAVE_PENDING_TABLES. [*] https://lore.kernel.org/all/Y1ghIKrAsRFwSFsO@google.com
On Wed, Jun 14, 2023 at 2:23 PM Sean Christopherson <seanjc@google.com> wrote: > > Gah, got turned around and forgot to account for @atomic. So this? > > if (!atomic && memslot_is_nowait_on_fault(slot)) { > atomic = true; > if (async) { > *async = false; > async = NULL; > } > } > > > + > > return hva_to_pfn(addr, atomic, interruptible, async, write_fault, > > writable); > > } Makes sense to me, although I think the documentation for hva_to_pfn(), where those async/atomic parameters eventually feed into, is slightly off > /* > * Pin guest page in memory and return its pfn. > * @addr: host virtual address which maps memory to the guest > * @atomic: whether this function can sleep > ... > */ > kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible, > bool *async, bool write_fault, bool *writable) I initially read this as "atomic == true if function can sleep," but I think it actually means to say "atomic == true if function can *not* sleep". So I'll add a patch to change the line to > @atomic: whether this function is disallowed from sleeping I'm pretty sure I have things straight: if I don't though, then we can't upgrade the __gfn_to_pfn_memslot() calls to "atomic=true" like you suggested above.
On Tue, Jul 11, 2023 at 8:29 AM Sean Christopherson <seanjc@google.com> wrote: > > Well, that description is wrong for other reasons. As mentioned in my reply > (got snipped), the behavior is not tied to sleeping or waiting on I/O. > > > Moving the nowait check out of __kvm_faultin_pfn()/user_mem_abort() > > and into __gfn_to_pfn_memslot() means that, obviously, other callers > > will start to see behavior changes. Some of that is probably actually > > necessary for that documentation to be accurate (since any usages of > > __gfn_to_pfn_memslot() under KVM_RUN should respect the memslot flag), > > but I think there are consumers of __gfn_to_pfn_memslot() from outside > > KVM_RUN. > > Yeah, replace "in response to page faults" with something along the lines of "if > an access in guest context ..." Alright, how about + KVM_MEM_NO_USERFAULT_ON_GUEST_ACCESS + The presence of this capability indicates that userspace may pass the + KVM_MEM_NO_USERFAULT_ON_GUEST_ACCESS flag to + KVM_SET_USER_MEMORY_REGION. Said flag will cause KVM_RUN to fail (-EFAULT) + in response to guest-context memory accesses which would require KVM + to page fault on the userspace mapping. Although, as Wang mentioned, USERFAULT seems to suggest something related to userfaultfd which is a liiiiitle too specific. Perhaps we should use USERSPACE_FAULT (*cries*) instead? On Wed, Jun 14, 2023 at 2:20 PM Sean Christopherson <seanjc@google.com> wrote: > > However, do we actually need to force vendor code to query nowait? At a glance, > the only external (relative to kvm_main.c) users of __gfn_to_pfn_memslot() are > in flows that play nice with nowait or that don't support it at all. So I *think* > we can do this? On Wed, Jun 14, 2023 at 2:23 PM Sean Christopherson <seanjc@google.com> wrote: > > Gah, got turned around and forgot to account for @atomic. So this? > > if (!atomic && memslot_is_nowait_on_fault(slot)) { > atomic = true; > if (async) { > *async = false; > async = NULL; > } > } > > > + > > return hva_to_pfn(addr, atomic, interruptible, async, write_fault, > > writable); > > } Took another look at this and I *think* it works too (I added my notes at the end here so if anyone wants to double-check they can). But there are a couple of quirks 1. The memslot flag can cause new __gfn_to_pfn_memslot() failures in kvm_vcpu_map() (good thing!) but those failures result in an EINVAL (bad!). It kinda looks like the current is_error_noslot_pfn() check in that function should be returning EFAULT anyways though, any opinions? 2. kvm_vm_ioctl_mte_copy_tags() will see new failures. This function has come up before (a) and it doesn't seem like an access in a guest context. Is this something to just be documented away? 3. I don't think I've caught parts of the who-calls tree that are in drivers/. The one part I know about is the gfn_to_pfn() call in is_2MB_gtt_possible() (drivers/gpu/drm/i915/gvt/gtt.c), and I'm not sure what to do about it. Again, doesn't look like a guest-context access. (a) https://lore.kernel.org/kvm/ZIoiLGotFsDDvRN3@google.com/T/#u --------------------------------------------------- Notes: Tracing the usages of __gfn_to_pfn_memslot() "shove" = "moving the nowait check inside of __gfn_to_pfn_memslot * [x86] __gfn_to_pfn_memslot() has 5 callers ** DONE kvm_faultin_pfn() accounts for two calls, shove will cause bail (as intended) after first ** DONE __gfn_to_pfn_prot(): No usages on x86 ** DONE __gfn_to_pfn_memslot_atomic: already requires atomic access :) ** gfn_to_pfn_memslot() has two callers *** DONE kvm_vcpu_gfn_to_pfn(): No callers *** gfn_to_pfn() has two callers **** TODO kvm_vcpu_map() When memslot flag trips will get KVM_PFN_ERR_FAULT, error is handled HOWEVER it will return -EINVAL which is kinda... not right **** gfn_to_page() has two callers, but both operate on APIC_DEFAULT_PHYS_BASE addr ** Ok so that's "done," as long as my LSP is reliable * [arm64] __gfn_to_pfn_memslot() has 4 callers ** DONE user_mem_abort() has one, accounted for by the subsequent is_error_noslot_pfn() ** DONE __gfn_to_pfn_memslot_atomic() fine as above ** TODO gfn_to_pfn_prot() One caller in kvm_vm_ioctl_mte_copy_tags() There's a is_error_noslot_pfn() to catch the failure, but there's no vCPU floating around to annotate a fault in! ** gfn_to_pfn_memslot() two callers *** DONE kvm_vcpu_gfn_to_pfn() no callers *** gfn_to_pfn() two callers **** kvm_vcpu_map() as above **** DONE gfn_to_page() no callers * TODO Weird driver code reference I discovered only via ripgrep?
On Thu, Aug 24, 2023, Anish Moorthy wrote: > On Tue, Jul 11, 2023 at 8:29 AM Sean Christopherson <seanjc@google.com> wrote: > > > > Well, that description is wrong for other reasons. As mentioned in my reply > > (got snipped), the behavior is not tied to sleeping or waiting on I/O. > > > > > Moving the nowait check out of __kvm_faultin_pfn()/user_mem_abort() > > > and into __gfn_to_pfn_memslot() means that, obviously, other callers > > > will start to see behavior changes. Some of that is probably actually > > > necessary for that documentation to be accurate (since any usages of > > > __gfn_to_pfn_memslot() under KVM_RUN should respect the memslot flag), > > > but I think there are consumers of __gfn_to_pfn_memslot() from outside > > > KVM_RUN. > > > > Yeah, replace "in response to page faults" with something along the lines of "if > > an access in guest context ..." > > Alright, how about > > + KVM_MEM_NO_USERFAULT_ON_GUEST_ACCESS > + The presence of this capability indicates that userspace may pass the > + KVM_MEM_NO_USERFAULT_ON_GUEST_ACCESS flag to > + KVM_SET_USER_MEMORY_REGION. Said flag will cause KVM_RUN to fail (-EFAULT) > + in response to guest-context memory accesses which would require KVM > + to page fault on the userspace mapping. > > Although, as Wang mentioned, USERFAULT seems to suggest something > related to userfaultfd which is a liiiiitle too specific. Perhaps we > should use USERSPACE_FAULT (*cries*) instead? Heh, it's not strictly on guest accesses though. At this point, I'm tempted to come up with some completely arbitrary name for the feature and give up on trying to describe its effects in the name itself. > On Wed, Jun 14, 2023 at 2:20 PM Sean Christopherson <seanjc@google.com> wrote: > > > > However, do we actually need to force vendor code to query nowait? At a glance, > > the only external (relative to kvm_main.c) users of __gfn_to_pfn_memslot() are > > in flows that play nice with nowait or that don't support it at all. So I *think* > > we can do this? > > On Wed, Jun 14, 2023 at 2:23 PM Sean Christopherson <seanjc@google.com> wrote: > > > > Gah, got turned around and forgot to account for @atomic. So this? > > > > if (!atomic && memslot_is_nowait_on_fault(slot)) { > > atomic = true; > > if (async) { > > *async = false; > > async = NULL; > > } > > } > > > > > + > > > return hva_to_pfn(addr, atomic, interruptible, async, write_fault, > > > writable); > > > } > > Took another look at this and I *think* it works too (I added my notes > at the end here so if anyone wants to double-check they can). But > there are a couple of quirks > > 1. The memslot flag can cause new __gfn_to_pfn_memslot() failures in > kvm_vcpu_map() (good thing!) but those failures result in an EINVAL > (bad!). It kinda looks like the current is_error_noslot_pfn() check in > that function should be returning EFAULT anyways though, any opinions? Argh, it "needs" to return -EINVAL because KVM "needs" to inject a #GP if the guest accesses a non-existent PFN in various nested SVM flows. It's somewhat of a moot point though, because kvm_vcpu_map() can't fail, KVM just isn't equipped to report such failures out to userspace. > 2. kvm_vm_ioctl_mte_copy_tags() will see new failures. This function > has come up before (a) and it doesn't seem like an access in a guest > context. Is this something to just be documented away? We'll need a way to way for KVM to opt-out for kvm_vcpu_map(), at which point it makes sense to opt-out for kvm_vm_ioctl_mte_copy_tags() as well. > 3. I don't think I've caught parts of the who-calls tree that are in > drivers/. The one part I know about is the gfn_to_pfn() call in > is_2MB_gtt_possible() (drivers/gpu/drm/i915/gvt/gtt.c), and I'm not > sure what to do about it. Again, doesn't look like a guest-context > access. On x86, that _was_ the only one. You're welcome ;-) https://lore.kernel.org/all/20230729013535.1070024-9-seanjc@google.com
On Tue, Aug 29, 2023 at 3:42 PM Sean Christopherson <seanjc@google.com> wrote: > > On Thu, Aug 24, 2023, Anish Moorthy wrote: > > On Tue, Jul 11, 2023 at 8:29 AM Sean Christopherson <seanjc@google.com> wrote: > > > > > > Well, that description is wrong for other reasons. As mentioned in my reply > > > (got snipped), the behavior is not tied to sleeping or waiting on I/O. > > > > > > > Moving the nowait check out of __kvm_faultin_pfn()/user_mem_abort() > > > > and into __gfn_to_pfn_memslot() means that, obviously, other callers > > > > will start to see behavior changes. Some of that is probably actually > > > > necessary for that documentation to be accurate (since any usages of > > > > __gfn_to_pfn_memslot() under KVM_RUN should respect the memslot flag), > > > > but I think there are consumers of __gfn_to_pfn_memslot() from outside > > > > KVM_RUN. > > > > > > Yeah, replace "in response to page faults" with something along the lines of "if > > > an access in guest context ..." > > > > Alright, how about > > > > + KVM_MEM_NO_USERFAULT_ON_GUEST_ACCESS > > + The presence of this capability indicates that userspace may pass the > > + KVM_MEM_NO_USERFAULT_ON_GUEST_ACCESS flag to > > + KVM_SET_USER_MEMORY_REGION. Said flag will cause KVM_RUN to fail (-EFAULT) > > + in response to guest-context memory accesses which would require KVM > > + to page fault on the userspace mapping. > > > > Although, as Wang mentioned, USERFAULT seems to suggest something > > related to userfaultfd which is a liiiiitle too specific. Perhaps we > > should use USERSPACE_FAULT (*cries*) instead? > > Heh, it's not strictly on guest accesses though. Is the inaccuracy just because of the KVM_DEV_ARM_VGIC_GRP_CTRL disclaimer, or something else? I thought that "guest-context accesses" would capture the flag affecting memory accesses that KVM emulates for the guest as well, in addition to the "normal" EPT-violation -> page fault path. But if that's still not totally accurate then you should probably just spell this out for me. > At this point, I'm tempted to come up with some completely arbitrary name for the > feature and give up on trying to describe its effects in the name itself. You know, Oliver may have made an inspired suggestion a while back... On Mon, Mar 20, 2023 at 3:22 PM Oliver Upton <oliver.upton@linux.dev> wrote: > > I couldn't care less about what the user-facing portion of this thing is > called, TBH. We could just refer to it as KVM_MEM_BIT_2 /s > > On Wed, Jun 14, 2023 at 2:20 PM Sean Christopherson <seanjc@google.com> wrote: > > We'll need a way to way for KVM to opt-out for kvm_vcpu_map(), at which point it > makes sense to opt-out for kvm_vm_ioctl_mte_copy_tags() as well. Uh oh, I sense another parameter to __gfn_to_pfn_memslot(). Although I did see that David Stevens has been proposing cleanups to that code [1]. Is proper practice here to take a dependency on his series, do we just resolve the conflicts when the series are merged, or something else? [1] https://lore.kernel.org/kvm/20230824080408.2933205-1-stevensd@google.com/ > > 3. I don't think I've caught parts of the who-calls tree that are in > > drivers/. The one part I know about is the gfn_to_pfn() call in > > is_2MB_gtt_possible() (drivers/gpu/drm/i915/gvt/gtt.c), and I'm not > > sure what to do about it. Again, doesn't look like a guest-context > > access. > > On x86, that _was_ the only one. You're welcome ;-) > > https://lore.kernel.org/all/20230729013535.1070024-9-seanjc@google.com Much obliged :D
On Wed, Aug 30, 2023, Anish Moorthy wrote: > On Tue, Aug 29, 2023 at 3:42 PM Sean Christopherson <seanjc@google.com> wrote: > > > > On Thu, Aug 24, 2023, Anish Moorthy wrote: > > > On Tue, Jul 11, 2023 at 8:29 AM Sean Christopherson <seanjc@google.com> wrote: > > > > > > > > Well, that description is wrong for other reasons. As mentioned in my reply > > > > (got snipped), the behavior is not tied to sleeping or waiting on I/O. > > > > > > > > > Moving the nowait check out of __kvm_faultin_pfn()/user_mem_abort() > > > > > and into __gfn_to_pfn_memslot() means that, obviously, other callers > > > > > will start to see behavior changes. Some of that is probably actually > > > > > necessary for that documentation to be accurate (since any usages of > > > > > __gfn_to_pfn_memslot() under KVM_RUN should respect the memslot flag), > > > > > but I think there are consumers of __gfn_to_pfn_memslot() from outside > > > > > KVM_RUN. > > > > > > > > Yeah, replace "in response to page faults" with something along the lines of "if > > > > an access in guest context ..." > > > > > > Alright, how about > > > > > > + KVM_MEM_NO_USERFAULT_ON_GUEST_ACCESS > > > + The presence of this capability indicates that userspace may pass the > > > + KVM_MEM_NO_USERFAULT_ON_GUEST_ACCESS flag to > > > + KVM_SET_USER_MEMORY_REGION. Said flag will cause KVM_RUN to fail (-EFAULT) > > > + in response to guest-context memory accesses which would require KVM > > > + to page fault on the userspace mapping. > > > > > > Although, as Wang mentioned, USERFAULT seems to suggest something > > > related to userfaultfd which is a liiiiitle too specific. Perhaps we > > > should use USERSPACE_FAULT (*cries*) instead? > > > > Heh, it's not strictly on guest accesses though. > > Is the inaccuracy just because of the KVM_DEV_ARM_VGIC_GRP_CTRL > disclaimer, or something else? I thought that "guest-context accesses" > would capture the flag affecting memory accesses that KVM emulates for > the guest as well, in addition to the "normal" EPT-violation -> page > fault path. But if that's still not totally accurate then you should > probably just spell this out for me. A pedantic interpretation of "on guest access" could be that the flag would only apply to accesses from the guest itself, i.e. not any emulated accesses. In general, I think we should avoid having the name describe when KVM honors the flag, because it'll limit our ability to extend KVM functionality, and I doubt we'll ever be 100% accurate, e.g. guest emulation that "needs" kvm_vcpu_map() will ignore the flag. Regarding USERFAULT, why not lean into that instead of trying to avoid it? The behavior *is* related to userfaultfd; not in code, but certainly in its purpose. I don't think it's a stretch to say that userfault doesn't _just_ mean the fault is induced by userspace, it also means that the fault is relayed to userspace. And we can even borrow some amount of UFFD nomenclature to make it easier for userspace to understand the purpose. For initial support, I'm thinking KVM_MEM_USERFAULT_ON_MISSING i.e. generate a "user fault" when the mapping is missing. That would give us leeway for future expansion, e.g. if someday there's a use case for generating a userfault exit on major faults but not on missing mappings or minor fault, we could add KVM_MEM_USERFAULT_ON_MAJOR. > > > On Wed, Jun 14, 2023 at 2:20 PM Sean Christopherson <seanjc@google.com> wrote: > > > > We'll need a way to way for KVM to opt-out for kvm_vcpu_map(), at which point it > > makes sense to opt-out for kvm_vm_ioctl_mte_copy_tags() as well. > > Uh oh, I sense another parameter to __gfn_to_pfn_memslot(). Although I > did see that David Stevens has been proposing cleanups to that code > [1]. Is proper practice here to take a dependency on his series, do we > just resolve the conflicts when the series are merged, or something > else? No, don't take a dependency. At this point, it's a coin toss as to which series will be ready first, taking a dependency could unnecessarily slow this series down and/or generate pointless work. Whoever "loses" is likely going to have a somewhat painful rebase to deal with, but I can help on that front if/when the time comes. As for what is "proper practice", it's always a case-by-case basis, but a good rule of thumb is to default to letting the maintainer handle conflicts (though definitely call out any known conflicts to make life easier for everyone), and if you suspect that your series will have non-trivial conflicts, ask for guidance (like you just did).
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 5b24059143b3..9daadbe2c7ed 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -1312,6 +1312,7 @@ yet and must be cleared on entry. /* for kvm_userspace_memory_region::flags */ #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0) #define KVM_MEM_READONLY (1UL << 1) + #define KVM_MEM_NOWAIT_ON_FAULT (1UL << 2) This ioctl allows the user to create, modify or delete a guest physical memory slot. Bits 0-15 of "slot" specify the slot id and this value @@ -1342,12 +1343,15 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr be identical. This allows large pages in the guest to be backed by large pages in the host. -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and -KVM_MEM_READONLY. The former can be set to instruct KVM to keep track of +The flags field supports three flags + +1. KVM_MEM_LOG_DIRTY_PAGES: can be set to instruct KVM to keep track of writes to memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to -use it. The latter can be set, if KVM_CAP_READONLY_MEM capability allows it, +use it. +2. KVM_MEM_READONLY: can be set, if KVM_CAP_READONLY_MEM capability allows it, to make a new slot read-only. In this case, writes to this memory will be posted to userspace as KVM_EXIT_MMIO exits. +3. KVM_MEM_NOWAIT_ON_FAULT: see KVM_CAP_NOWAIT_ON_FAULT for details. When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of the memory region are automatically reflected into the guest. For example, an @@ -7776,6 +7780,28 @@ userspace may receive "bare" EFAULTs (i.e. exit reason != KVM_EXIT_MEMORY_FAULT) from KVM_RUN for failures which may be resolvable. These should be considered bugs and reported to the maintainers so that annotations can be added. +7.35 KVM_CAP_NOWAIT_ON_FAULT +---------------------------- + +:Architectures: None +:Returns: -EINVAL. + +The presence of this capability indicates that userspace may pass the +KVM_MEM_NOWAIT_ON_FAULT flag to KVM_SET_USER_MEMORY_REGION to cause KVM_RUN +to fail (-EFAULT) in response to page faults for which resolution would require +the faulting thread to sleep. + +The range of guest physical memory causing the fault is advertised to userspace +through KVM_CAP_MEMORY_FAULT_INFO. + +Userspace should determine how best to make the mapping present, then take +appropriate action. For instance establishing the mapping could involve a +MADV_POPULATE_READ|WRITE, in the context of userfaultfd a UFFDIO_COPY|CONTINUE +could be appropriate, etc. After establishing the mapping, userspace can return +to KVM to retry the memory access. + +Attempts to enable this capability directly will fail. + 8. Other capabilities. ====================== diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 69a221f71914..abbc5dd72292 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -2297,4 +2297,10 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr) */ inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu, uint64_t gpa, uint64_t len, uint64_t flags); + +static inline bool kvm_slot_nowait_on_fault( + const struct kvm_memory_slot *slot) +{ + return slot->flags & KVM_MEM_NOWAIT_ON_FAULT; +} #endif diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 143abb334f56..595c3d7d36aa 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -102,6 +102,7 @@ struct kvm_userspace_memory_region { */ #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0) #define KVM_MEM_READONLY (1UL << 1) +#define KVM_MEM_NOWAIT_ON_FAULT (1UL << 2) /* for KVM_IRQ_LINE */ struct kvm_irq_level { @@ -1198,6 +1199,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_PMU_EVENT_MASKED_EVENTS 226 #define KVM_CAP_COUNTER_OFFSET 227 #define KVM_CAP_MEMORY_FAULT_INFO 228 +#define KVM_CAP_NOWAIT_ON_FAULT 229 #ifdef KVM_CAP_IRQ_ROUTING diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h index 5476fe169921..f64845cd599f 100644 --- a/tools/include/uapi/linux/kvm.h +++ b/tools/include/uapi/linux/kvm.h @@ -102,6 +102,7 @@ struct kvm_userspace_memory_region { */ #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0) #define KVM_MEM_READONLY (1UL << 1) +#define KVM_MEM_NOWAIT_ON_FAULT (1UL << 2) /* for KVM_IRQ_LINE */ struct kvm_irq_level { diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 05d6e7e3994d..2c276d4d0821 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1527,6 +1527,9 @@ static int check_memory_region_flags(const struct kvm_userspace_memory_region *m valid_flags |= KVM_MEM_READONLY; #endif + if (kvm_vm_ioctl_check_extension(NULL, KVM_CAP_NOWAIT_ON_FAULT)) + valid_flags |= KVM_MEM_NOWAIT_ON_FAULT; + if (mem->flags & ~valid_flags) return -EINVAL;
Add documentation, memslot flags, useful helper functions, and the actual new capability itself. Memory fault exits on absent mappings are particularly useful for userfaultfd-based postcopy live migration. When many vCPUs fault on a single userfaultfd the faults can take a while to surface to userspace due to having to contend for uffd wait queue locks. Bypassing the uffd entirely by returning information directly to the vCPU exit avoids this contention and improves the fault rate. Suggested-by: James Houghton <jthoughton@google.com> Signed-off-by: Anish Moorthy <amoorthy@google.com> --- Documentation/virt/kvm/api.rst | 32 +++++++++++++++++++++++++++++--- include/linux/kvm_host.h | 6 ++++++ include/uapi/linux/kvm.h | 2 ++ tools/include/uapi/linux/kvm.h | 1 + virt/kvm/kvm_main.c | 3 +++ 5 files changed, 41 insertions(+), 3 deletions(-)