Message ID | 20241004140810.34231-3-nikwip@amazon.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: x86: Introduce new ioctl KVM_HYPERV_SET_TLB_FLUSH_INHIBIT | expand |
Nikolas Wipper <nikwip@amazon.de> writes: > Introduce a suspension state for Hyper-V enlightened vCPUs. Microsoft's > "Hypervisor Top Level Functional Specification" (TLFS) introduces this > state as a "vCPU that is stopped on a instruction guest boundary, either > explicitly or implicitly due to an intercept". The only documented > explicit suspension is caused in response to a TLB Flush hypercall, which > will use the state switching API in subsequent patches. > > Each Hyper-V vCPU now has a 'suspended' boolean field, which is checked > before entering the guest. When set, it forces the vCPU to block. Once in > that state, the vCPU ignores any events. The vCPU is unsuspended by > clearing 'suspend' and issuing a request to force the vCPU out of sleep. > > Suspensions are issued as a mechanism to halt execution until state change > is observed on a remote vCPU. Hyper-V vCPUs track this with 'waiting_on', > which holds the 'vcpu_id' of the remote vCPU that forced the vCPU to enter > the suspended state. It's the remote vCPU's responsibility to wake up the > suspended vCPUs when ready. 'waiting_on' ensures the remote vCPU can > selectively unsuspend vCPUs that blocked on its behalf while leaving other > suspended vCPUs undisturbed. One vCPU can only be suspended due to a > single remote vCPU, but different vCPUs can be suspended on behalf of > different or the same remote vCPU(s). The guest is responsible for > avoiding circular dependencies between suspended vCPUs. > > Callers of the suspend API are responsible for ensuring that suspend and > unsuspend aren't called in parallel while targeting the same pair of > vCPUs. Otherwise kvm_hv_vcpu_{un}suspend_tlb_flush() ensure 'waiting_on' > and 'suspended' are updated and accessed in the correct order. This, for > example, avoids races where the unsuspended vCPU re-suspends before > kvm_hv_vcpu_unsuspend_tlb_flush() is done updating 'waiting_on'. > > Signed-off-by: Nikolas Wipper <nikwip@amazon.de> > --- > arch/x86/include/asm/kvm_host.h | 3 +++ > arch/x86/kvm/hyperv.c | 30 ++++++++++++++++++++++++++++++ > arch/x86/kvm/hyperv.h | 17 +++++++++++++++++ > arch/x86/kvm/x86.c | 4 +++- > 4 files changed, 53 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index 46e0a466d7fb..7571ac578884 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -695,6 +695,9 @@ struct kvm_vcpu_hv { > u64 vm_id; > u32 vp_id; > } nested; > + > + bool suspended; > + int waiting_on; I don't quite understand why we need 'suspended' at all. Isn't it always suspended when 'waiting_on != -1'? I can see we always update these two in pair. Also, I would suggest we use a more descriptive name. 'waiting_on_vcpu_id', for example. > }; > > struct kvm_hypervisor_cpuid { > diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c > index 4f0a94346d00..6e7941ed25ae 100644 > --- a/arch/x86/kvm/hyperv.c > +++ b/arch/x86/kvm/hyperv.c > @@ -971,6 +971,7 @@ int kvm_hv_vcpu_init(struct kvm_vcpu *vcpu) > > vcpu->arch.hyperv = hv_vcpu; > hv_vcpu->vcpu = vcpu; > + hv_vcpu->waiting_on = -1; > > synic_init(&hv_vcpu->synic); > > @@ -2915,3 +2916,32 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid, > > return 0; > } > + > +void kvm_hv_vcpu_suspend_tlb_flush(struct kvm_vcpu *vcpu, int vcpu_id) Can we make parameter's name 'waiting_on_vcpu_id' as well? Because as-is I'm getting confused which CPU of these two is actually getting suspended) Also, why do we need '_tlb_flush' in the name? The mechanism seems to be fairly generic, it's just that we use it for TLB flushes. > +{ > + /* waiting_on's store should happen before suspended's */ > + WRITE_ONCE(vcpu->arch.hyperv->waiting_on, vcpu_id); > + WRITE_ONCE(vcpu->arch.hyperv->suspended, true); > +} > + > +void kvm_hv_vcpu_unsuspend_tlb_flush(struct kvm_vcpu *vcpu) And here someone may expect this means 'unsuspend vcpu' but in reality this means 'unsuspend all vCPUs which are waiting on 'vcpu'). I guess we need a rename. How about void kvm_hv_unsuspend_vcpus(struct kvm_vcpu *waiting_on_vcpu) ? > +{ > + DECLARE_BITMAP(vcpu_mask, KVM_MAX_VCPUS); > + struct kvm_vcpu_hv *vcpu_hv; > + struct kvm_vcpu *v; > + unsigned long i; > + > + kvm_for_each_vcpu(i, v, vcpu->kvm) { > + vcpu_hv = to_hv_vcpu(v); > + > + if (kvm_hv_vcpu_suspended(v) && > + READ_ONCE(vcpu_hv->waiting_on) == vcpu->vcpu_id) { > + /* waiting_on's store should happen before suspended's */ > + WRITE_ONCE(v->arch.hyperv->waiting_on, -1); > + WRITE_ONCE(v->arch.hyperv->suspended, false); > + __set_bit(i, vcpu_mask); > + } > + } > + > + kvm_make_vcpus_request_mask(vcpu->kvm, KVM_REQ_EVENT, vcpu_mask); > +} > diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h > index 913bfc96959c..a55832cea221 100644 > --- a/arch/x86/kvm/hyperv.h > +++ b/arch/x86/kvm/hyperv.h > @@ -265,6 +265,15 @@ static inline void kvm_hv_nested_transtion_tlb_flush(struct kvm_vcpu *vcpu, > } > > int kvm_hv_vcpu_flush_tlb(struct kvm_vcpu *vcpu); > + > +static inline bool kvm_hv_vcpu_suspended(struct kvm_vcpu *vcpu) > +{ > + return vcpu->arch.hyperv_enabled && > + READ_ONCE(vcpu->arch.hyperv->suspended); I don't think READ_ONCE() means anything here, does it? > +} > + > +void kvm_hv_vcpu_suspend_tlb_flush(struct kvm_vcpu *vcpu, int vcpu_id); > +void kvm_hv_vcpu_unsuspend_tlb_flush(struct kvm_vcpu *vcpu); > #else /* CONFIG_KVM_HYPERV */ > static inline void kvm_hv_setup_tsc_page(struct kvm *kvm, > struct pvclock_vcpu_time_info *hv_clock) {} > @@ -321,6 +330,14 @@ static inline u32 kvm_hv_get_vpindex(struct kvm_vcpu *vcpu) > return vcpu->vcpu_idx; > } > static inline void kvm_hv_nested_transtion_tlb_flush(struct kvm_vcpu *vcpu, bool tdp_enabled) {} > + > +static inline bool kvm_hv_vcpu_suspended(struct kvm_vcpu *vcpu) > +{ > + return false; > +} > + > +static inline void kvm_hv_vcpu_suspend_tlb_flush(struct kvm_vcpu *vcpu, int vcpu_id) {} > +static inline void kvm_hv_vcpu_unsuspend_tlb_flush(struct kvm_vcpu *vcpu) {} > #endif /* CONFIG_KVM_HYPERV */ > > #endif /* __ARCH_X86_KVM_HYPERV_H__ */ > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 15080385b8fe..18d0a300e79a 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -135,6 +135,8 @@ static void store_regs(struct kvm_vcpu *vcpu); > static int sync_regs(struct kvm_vcpu *vcpu); > static int kvm_vcpu_do_singlestep(struct kvm_vcpu *vcpu); > > +static inline bool kvm_hv_vcpu_suspended(struct kvm_vcpu *vcpu); > + > static int __set_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2); > static void __get_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2); > > @@ -11107,7 +11109,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) > static bool kvm_vcpu_running(struct kvm_vcpu *vcpu) > { > return (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE && > - !vcpu->arch.apf.halted); > + !vcpu->arch.apf.halted && !kvm_hv_vcpu_suspended(vcpu)); > } > > static bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
On 10.10.24 10:57, Vitaly Kuznetsov wrote: > Nikolas Wipper <nikwip@amazon.de> writes: >> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h >> index 46e0a466d7fb..7571ac578884 100644 >> --- a/arch/x86/include/asm/kvm_host.h >> +++ b/arch/x86/include/asm/kvm_host.h >> @@ -695,6 +695,9 @@ struct kvm_vcpu_hv { >> u64 vm_id; >> u32 vp_id; >> } nested; >> + >> + bool suspended; >> + int waiting_on; > > I don't quite understand why we need 'suspended' at all. Isn't it always > suspended when 'waiting_on != -1'? I can see we always update these two > in pair. > This is mainly for future proofing the implementation. You are right, this is currently not required, but it's nice to have a single flags, so that when the suspended state is used in a different context, the whole logic surrounding it still works. > Also, I would suggest we use a more descriptive > name. 'waiting_on_vcpu_id', for example. > Sounds good. >> }; >> >> struct kvm_hypervisor_cpuid { >> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c >> index 4f0a94346d00..6e7941ed25ae 100644 >> --- a/arch/x86/kvm/hyperv.c >> +++ b/arch/x86/kvm/hyperv.c >> @@ -971,6 +971,7 @@ int kvm_hv_vcpu_init(struct kvm_vcpu *vcpu) >> >> vcpu->arch.hyperv = hv_vcpu; >> hv_vcpu->vcpu = vcpu; >> + hv_vcpu->waiting_on = -1; >> >> synic_init(&hv_vcpu->synic); >> >> @@ -2915,3 +2916,32 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid, >> >> return 0; >> } >> + >> +void kvm_hv_vcpu_suspend_tlb_flush(struct kvm_vcpu *vcpu, int vcpu_id) > > Can we make parameter's name 'waiting_on_vcpu_id' as well? Because as-is > I'm getting confused which CPU of these two is actually getting > suspended) > Yup, that would certainly help readability. > Also, why do we need '_tlb_flush' in the name? The mechanism seems to be > fairly generic, it's just that we use it for TLB flushes. > The 'waiting_on' part is TLB flushing specific. >> +{ >> + /* waiting_on's store should happen before suspended's */ >> + WRITE_ONCE(vcpu->arch.hyperv->waiting_on, vcpu_id); >> + WRITE_ONCE(vcpu->arch.hyperv->suspended, true); >> +} >> + >> +void kvm_hv_vcpu_unsuspend_tlb_flush(struct kvm_vcpu *vcpu) > > And here someone may expect this means 'unsuspend vcpu' but in reality > this means 'unsuspend all vCPUs which are waiting on 'vcpu'). I guess we > need a rename. How about > > void kvm_hv_unsuspend_vcpus(struct kvm_vcpu *waiting_on_vcpu) > > ? > Also sounds good. >> +{ >> + DECLARE_BITMAP(vcpu_mask, KVM_MAX_VCPUS); >> + struct kvm_vcpu_hv *vcpu_hv; >> + struct kvm_vcpu *v; >> + unsigned long i; >> + >> + kvm_for_each_vcpu(i, v, vcpu->kvm) { >> + vcpu_hv = to_hv_vcpu(v); >> + >> + if (kvm_hv_vcpu_suspended(v) && >> + READ_ONCE(vcpu_hv->waiting_on) == vcpu->vcpu_id) { >> + /* waiting_on's store should happen before suspended's */ >> + WRITE_ONCE(v->arch.hyperv->waiting_on, -1); >> + WRITE_ONCE(v->arch.hyperv->suspended, false); >> + __set_bit(i, vcpu_mask); >> + } >> + } >> + >> + kvm_make_vcpus_request_mask(vcpu->kvm, KVM_REQ_EVENT, vcpu_mask); >> +} >> diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h >> index 913bfc96959c..a55832cea221 100644 >> --- a/arch/x86/kvm/hyperv.h >> +++ b/arch/x86/kvm/hyperv.h >> @@ -265,6 +265,15 @@ static inline void kvm_hv_nested_transtion_tlb_flush(struct kvm_vcpu *vcpu, >> } >> >> int kvm_hv_vcpu_flush_tlb(struct kvm_vcpu *vcpu); >> + >> +static inline bool kvm_hv_vcpu_suspended(struct kvm_vcpu *vcpu) >> +{ >> + return vcpu->arch.hyperv_enabled && >> + READ_ONCE(vcpu->arch.hyperv->suspended); > > I don't think READ_ONCE() means anything here, does it? > It does prevent compiler optimisations and is actually required[1]. Also it makes clear that this variable is shared, and may be accessed from remote CPUs. [1] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0124r6.html#Variable%20Access Nikolas
Nikolas Wipper <nik.wipper@gmx.de> writes: > On 10.10.24 10:57, Vitaly Kuznetsov wrote: ... >>> int kvm_hv_vcpu_flush_tlb(struct kvm_vcpu *vcpu); >>> + >>> +static inline bool kvm_hv_vcpu_suspended(struct kvm_vcpu *vcpu) >>> +{ >>> + return vcpu->arch.hyperv_enabled && >>> + READ_ONCE(vcpu->arch.hyperv->suspended); >> >> I don't think READ_ONCE() means anything here, does it? >> > > It does prevent compiler optimisations and is actually required[1]. Also > it makes clear that this variable is shared, and may be accessed from > remote CPUs. > > [1] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0124r6.html#Variable%20Access It certainly does no harm but I think if we follow 'Loads from and stores to shared (but non-atomic) variables should be protected with the READ_ONCE(), WRITE_ONCE()' rule literally we will need to sprinkle them all over KVM/kernel ;-) And personally, this makes reading the code harder. To my (very limited) knowledge, we really need READ_ONCE()s when we need to have some sort of a serialization, e.g. the moment when this read happens actually makes a difference. If we can e.g. use a local variable in the beginning of a function and replace all READ_ONCE()s with reading this local variable -- then we don't need READ_ONCE()s and are OK with possible compiler optimizations. Similar (reversed) thoughts go to WRITE_ONCE(). I think it's OK to keep them but it would be nice (not mandatory IMO, but nice) to have a comment describing which particular synchronization we are achieving (== the compiler optimization scenario we are protecting against). In this particular case, kvm_hv_vcpu_suspended() is inline so I briefly looked at all kvm_hv_vcpu_suspended() call sites (there are three) in your series but couldn't think of a place where the READ_ONCE() makes a real difference. kvm_hv_hypercall_complete() looks pretty safe anyway. kvm_hv_vcpu_unsuspend_tlb_flush() will be simplified significantly if we merge 'suspended' with 'waiting_on': instead of kvm_for_each_vcpu(i, v, vcpu->kvm) { vcpu_hv = to_hv_vcpu(v); if (kvm_hv_vcpu_suspended(v) && READ_ONCE(vcpu_hv->waiting_on) == vcpu->vcpu_id) { ... you will have just kvm_for_each_vcpu(i, v, vcpu->kvm) { vcpu_hv = to_hv_vcpu(v); if (vcpu_hv && vcpu_hv->waiting_on == vcpu->vcpu_id) { ... (and yes, I also think that READ_ONCE() is superfluous here, as real (non-speculative) write below can't happen _before_ the check ) The last one, kvm_vcpu_running(), should also be indifferent to READ_ONCE() in kvm_hv_vcpu_suspended(). I may had missed something, of course, but I hope you got my line of thought.
On Tue, Oct 15, 2024, Vitaly Kuznetsov wrote: > Nikolas Wipper <nik.wipper@gmx.de> writes: > > > On 10.10.24 10:57, Vitaly Kuznetsov wrote: > > ... > > >>> int kvm_hv_vcpu_flush_tlb(struct kvm_vcpu *vcpu); > >>> + > >>> +static inline bool kvm_hv_vcpu_suspended(struct kvm_vcpu *vcpu) > >>> +{ > >>> + return vcpu->arch.hyperv_enabled && > >>> + READ_ONCE(vcpu->arch.hyperv->suspended); > >> > >> I don't think READ_ONCE() means anything here, does it? > >> > > > > It does prevent compiler optimisations and is actually required[1]. Also > > it makes clear that this variable is shared, and may be accessed from > > remote CPUs. > > > > [1] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0124r6.html#Variable%20Access > > It certainly does no harm but I think if we follow 'Loads from and > stores to shared (but non-atomic) variables should be protected with the > READ_ONCE(), WRITE_ONCE()' rule literally we will need to sprinkle them > all over KVM/kernel ;-) And personally, this makes reading the code > harder. > > To my (very limited) knowledge, we really need READ_ONCE()s when we need > to have some sort of a serialization, e.g. the moment when this read > happens actually makes a difference. If we can e.g. use a local variable > in the beginning of a function and replace all READ_ONCE()s with > reading this local variable -- then we don't need READ_ONCE()s and are > OK with possible compiler optimizations. Similar (reversed) thoughts go > to WRITE_ONCE(). > > I think it's OK to keep them but it would be nice (not mandatory IMO, > but nice) to have a comment describing which particular synchronization > we are achieving (== the compiler optimization scenario we are protecting > against). > > In this particular case, kvm_hv_vcpu_suspended() is inline so I briefly > looked at all kvm_hv_vcpu_suspended() call sites (there are three) in > your series but couldn't think of a place where the READ_ONCE() makes a > real difference. kvm_hv_hypercall_complete() looks pretty safe > anyway. kvm_hv_vcpu_unsuspend_tlb_flush() will be simplified > significantly if we merge 'suspended' with 'waiting_on': instead of > > kvm_for_each_vcpu(i, v, vcpu->kvm) { > vcpu_hv = to_hv_vcpu(v); > > if (kvm_hv_vcpu_suspended(v) && > READ_ONCE(vcpu_hv->waiting_on) == vcpu->vcpu_id) { > ... > > you will have just > > kvm_for_each_vcpu(i, v, vcpu->kvm) { > vcpu_hv = to_hv_vcpu(v); > > if (vcpu_hv && vcpu_hv->waiting_on == vcpu->vcpu_id) { > ... > (and yes, I also think that READ_ONCE() is superfluous here, as real > (non-speculative) write below can't happen _before_ the check ) > > The last one, kvm_vcpu_running(), should also be indifferent to > READ_ONCE() in kvm_hv_vcpu_suspended(). I may had missed something, of > course, but I hope you got my line of thought. I don't think you're missing anything. In general, all of this code is more than a bit heavy-handed and lacks any kind of precision, which makes it *really* hard to see what actually guarantees a vCPU won't get stuck blocking. Writers synchronize SRCU and readers are required to acquire SRCU, but there's no actual data tagged as being protected by SRCU, i.e. tlb_flush_inhibit should be __rcu. All of the {READ,WRITE}_ONCE() stuff provides some implicit compiler barriers, but the actual protection to ensure a vCPU either observes inhibit=false or a wake event is provided by the smp_wmb() in __kvm_make_request(). And from a performance perspective, synchronizing on kvm->srcu is going to be susceptible to random slowdowns, because writers will have to wait until all vCPUs drop SRCU, even if they have nothing to do with PV TLB flushes. E.g. if vCPUs are faulting in memory from swap, uninhibiting a TLB flushes could be stalled unnecessarily for an extended duration. Lastly, KVM_REQ_EVENT is a big hammer (triggers a lot of processing) and semantically misleading (there is no event to process). At a glance, KVM_REQ_UNBLOCK is likely more appropriate. Before we spend too much time cleaning things up, I want to first settle on the overall design, because it's not clear to me that punting HvTranslateVirtualAddress to userspace is a net positive. We agreed that VTLs should be modeled primarily in userspace, but that doesn't automatically make punting everything to userspace the best option, especially given the discussion at KVM Forum with respect to mplementing VTLs, VMPLs, TD partitions, etc. The cover letters for this series and KVM_TRANSLATE2 simply say they're needed for HvTranslateVirtualAddress, but neither series nor Nicolas' patch to punt HVCALL_TRANSLATE_VIRTUAL_ADDRESS[*] justifies the split between userspace and KVM. And it very much is a split, because there are obviously a lot of details around TlbFlushInhibit that bleed into KVM. Side topic, what actually clears HvRegisterInterceptSuspend.TlbFlushInhibit? The TLFS just says After the memory intercept routine performs instruction completion, it should clear the TlbFlushInhibit bit of the HvRegisterInterceptSuspend register. but I can't find anything that says _how_ it clears TlbFlushInhibit. [*] https://lore.kernel.org/all/20240609154945.55332-8-nsaenz@amazon.com
Hi Sean, On Tue Oct 15, 2024 at 3:58 PM UTC, Sean Christopherson wrote: > Before we spend too much time cleaning things up, I want to first settle on the > overall design, because it's not clear to me that punting HvTranslateVirtualAddress > to userspace is a net positive. We agreed that VTLs should be modeled primarily > in userspace, but that doesn't automatically make punting everything to userspace > the best option, especially given the discussion at KVM Forum with respect to > mplementing VTLs, VMPLs, TD partitions, etc. Since you mention it, Paolo said he was going to prep a doc with an overview of the design we discussed there. Was it published? Did I miss it? Thanks, Nicolas
On 15.10.24 17:58, Sean Christopherson wrote: > ... > > And from a performance perspective, synchronizing on kvm->srcu is going to be > susceptible to random slowdowns, because writers will have to wait until all vCPUs > drop SRCU, even if they have nothing to do with PV TLB flushes. E.g. if vCPUs > are faulting in memory from swap, uninhibiting a TLB flushes could be stalled > unnecessarily for an extended duration. > This should be an easy fix, right? Just create an SRCU only for the TLB flushes only. > Lastly, KVM_REQ_EVENT is a big hammer (triggers a lot of processing) and semantically > misleading (there is no event to process). At a glance, KVM_REQ_UNBLOCK is likely > more appropriate. > > Before we spend too much time cleaning things up, I want to first settle on the > overall design, because it's not clear to me that punting HvTranslateVirtualAddress > to userspace is a net positive. We agreed that VTLs should be modeled primarily > in userspace, but that doesn't automatically make punting everything to userspace > the best option, especially given the discussion at KVM Forum with respect to > mplementing VTLs, VMPLs, TD partitions, etc. > I wasn't at the discussion, so maybe I'm missing something, but the hypercall still needs VTL awareness. For one, it is primarily executed from VTL0 and primarily targets VTL1 (primarily here means "thats what I see when I boot Windows Server 2019"), so it would need to know which vCPU is the corresponding VTL (this assumes one vCPU per VTL, as in the QEMU implementation). To make matters worse, the hypercall can also arbitrarily choose to target a different VP. This would require a way to map (VP index, VTL) -> (vcpu_id) within KVM. > The cover letters for this series and KVM_TRANSLATE2 simply say they're needed > for HvTranslateVirtualAddress, but neither series nor Nicolas' patch to punt > HVCALL_TRANSLATE_VIRTUAL_ADDRESS[*] justifies the split between userspace and > KVM. And it very much is a split, because there are obviously a lot of details > around TlbFlushInhibit that bleed into KVM. > > Side topic, what actually clears HvRegisterInterceptSuspend.TlbFlushInhibit? The > TLFS just says > > After the memory intercept routine performs instruction completion, it should > clear the TlbFlushInhibit bit of the HvRegisterInterceptSuspend register. > > but I can't find anything that says _how_ it clears TlbFlushInhibit. > The register cannot be accessed using the HvSetVpRegisters hypercall, but the TLFS talks about it elsewhere. I'm assuming this is a formatting issue (there are a few elsewhere). In 15.5.1.3 it says To unlock the TLB, the higher VTL can clear this bit. Also, once a VP returns to a lower VTL, it releases all TLB locks which it holds at the time. The QEMU implementation also just uninhibits on intercept exit, and that, at least, does not crash. Nikolas
On Tue, Oct 15, 2024, Nicolas Saenz Julienne wrote: > Hi Sean, > > On Tue Oct 15, 2024 at 3:58 PM UTC, Sean Christopherson wrote: > > Before we spend too much time cleaning things up, I want to first settle on the > > overall design, because it's not clear to me that punting HvTranslateVirtualAddress > > to userspace is a net positive. We agreed that VTLs should be modeled primarily > > in userspace, but that doesn't automatically make punting everything to userspace > > the best option, especially given the discussion at KVM Forum with respect to > > mplementing VTLs, VMPLs, TD partitions, etc. > > Since you mention it, Paolo said he was going to prep a doc with an > overview of the design we discussed there. Was it published? Did I miss > it? Nope, we're all hitting F5 mercilessly :-)
On Tue, Oct 15, 2024, Nikolas Wipper wrote: > On 15.10.24 17:58, Sean Christopherson wrote: > > ... > > > > And from a performance perspective, synchronizing on kvm->srcu is going to be > > susceptible to random slowdowns, because writers will have to wait until all vCPUs > > drop SRCU, even if they have nothing to do with PV TLB flushes. E.g. if vCPUs > > are faulting in memory from swap, uninhibiting a TLB flushes could be stalled > > unnecessarily for an extended duration. > > > > This should be an easy fix, right? Just create an SRCU only for the TLB flushes only. Yes, this is a very solvable problem. But while SRCU objects aren't expensive, they aren't entirely free either. > > Lastly, KVM_REQ_EVENT is a big hammer (triggers a lot of processing) and semantically > > misleading (there is no event to process). At a glance, KVM_REQ_UNBLOCK is likely > > more appropriate. > > > > Before we spend too much time cleaning things up, I want to first settle on the > > overall design, because it's not clear to me that punting HvTranslateVirtualAddress > > to userspace is a net positive. We agreed that VTLs should be modeled primarily > > in userspace, but that doesn't automatically make punting everything to userspace > > the best option, especially given the discussion at KVM Forum with respect to > > mplementing VTLs, VMPLs, TD partitions, etc. > > > > I wasn't at the discussion, so maybe I'm missing something, but the hypercall > still needs VTL awareness. Yeah, the KVM Forum discussion is relevant, because one of the key takeaways from that discussion was that KVM will need some amount of VTL awareness. > For one, it is primarily executed from VTL0 and primarily targets VTL1 > (primarily here means "thats what I see when I boot Windows Server 2019"), so > it would need to know which vCPU is the corresponding VTL (this assumes one > vCPU per VTL, as in the QEMU implementation). Right, but even without the takeways from KVM Forum, we need to look at big picture and come up with a design that makes the most sense. E.g. if making KVM aware of "struct kvm" objects that represent different VTLs in the same VM greatly simplifies supporting HvTranslateVirtualAddress, then it's likely worth doing. > To make matters worse, the hypercall can also arbitrarily choose to target a > different VP. This would require a way to map (VP index, VTL) -> (vcpu_id) > within KVM. I don't think so. The TLFS definition for TlbFlushInhibit give KVM a _lot_ of wiggle room, e.g. KVM can retry the hypercall as many times as necessary. To honor TlbFlushInhibit, KVM just needs to ensure that flushes are blocked if any vCPU at the target VTL is blocking flushes. And to avoid hanging a vCPU, KVM only needs to ensure a vCPU is awakened if it _might_ be able to make forward progress. I.e. I don't think KVM needs to be super precise when waking blocking vCPUs, and thus there's no need to precisely track who is blocking whom. I think :-) > > The cover letters for this series and KVM_TRANSLATE2 simply say they're needed > > for HvTranslateVirtualAddress, but neither series nor Nicolas' patch to punt > > HVCALL_TRANSLATE_VIRTUAL_ADDRESS[*] justifies the split between userspace and > > KVM. And it very much is a split, because there are obviously a lot of details > > around TlbFlushInhibit that bleed into KVM. > > > > Side topic, what actually clears HvRegisterInterceptSuspend.TlbFlushInhibit? The > > TLFS just says > > > > After the memory intercept routine performs instruction completion, it should > > clear the TlbFlushInhibit bit of the HvRegisterInterceptSuspend register. > > > > but I can't find anything that says _how_ it clears TlbFlushInhibit. > > > > The register cannot be accessed using the HvSetVpRegisters hypercall, but the TLFS > talks about it elsewhere. I'm assuming this is a formatting issue (there are a few > elsewhere). In 15.5.1.3 it says > > To unlock the TLB, the higher VTL can clear this bit. Also, once a VP returns > to a lower VTL, it releases all TLB locks which it holds at the time. > > The QEMU implementation also just uninhibits on intercept exit, and that, at least, > does not crash. Hmm, it would be nice to bottom out on whether the higher VLT or the VMM/hypervisor is responsible for clearing TlbFlushInhibit, because that may influence KVM's design.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 46e0a466d7fb..7571ac578884 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -695,6 +695,9 @@ struct kvm_vcpu_hv { u64 vm_id; u32 vp_id; } nested; + + bool suspended; + int waiting_on; }; struct kvm_hypervisor_cpuid { diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c index 4f0a94346d00..6e7941ed25ae 100644 --- a/arch/x86/kvm/hyperv.c +++ b/arch/x86/kvm/hyperv.c @@ -971,6 +971,7 @@ int kvm_hv_vcpu_init(struct kvm_vcpu *vcpu) vcpu->arch.hyperv = hv_vcpu; hv_vcpu->vcpu = vcpu; + hv_vcpu->waiting_on = -1; synic_init(&hv_vcpu->synic); @@ -2915,3 +2916,32 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid, return 0; } + +void kvm_hv_vcpu_suspend_tlb_flush(struct kvm_vcpu *vcpu, int vcpu_id) +{ + /* waiting_on's store should happen before suspended's */ + WRITE_ONCE(vcpu->arch.hyperv->waiting_on, vcpu_id); + WRITE_ONCE(vcpu->arch.hyperv->suspended, true); +} + +void kvm_hv_vcpu_unsuspend_tlb_flush(struct kvm_vcpu *vcpu) +{ + DECLARE_BITMAP(vcpu_mask, KVM_MAX_VCPUS); + struct kvm_vcpu_hv *vcpu_hv; + struct kvm_vcpu *v; + unsigned long i; + + kvm_for_each_vcpu(i, v, vcpu->kvm) { + vcpu_hv = to_hv_vcpu(v); + + if (kvm_hv_vcpu_suspended(v) && + READ_ONCE(vcpu_hv->waiting_on) == vcpu->vcpu_id) { + /* waiting_on's store should happen before suspended's */ + WRITE_ONCE(v->arch.hyperv->waiting_on, -1); + WRITE_ONCE(v->arch.hyperv->suspended, false); + __set_bit(i, vcpu_mask); + } + } + + kvm_make_vcpus_request_mask(vcpu->kvm, KVM_REQ_EVENT, vcpu_mask); +} diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h index 913bfc96959c..a55832cea221 100644 --- a/arch/x86/kvm/hyperv.h +++ b/arch/x86/kvm/hyperv.h @@ -265,6 +265,15 @@ static inline void kvm_hv_nested_transtion_tlb_flush(struct kvm_vcpu *vcpu, } int kvm_hv_vcpu_flush_tlb(struct kvm_vcpu *vcpu); + +static inline bool kvm_hv_vcpu_suspended(struct kvm_vcpu *vcpu) +{ + return vcpu->arch.hyperv_enabled && + READ_ONCE(vcpu->arch.hyperv->suspended); +} + +void kvm_hv_vcpu_suspend_tlb_flush(struct kvm_vcpu *vcpu, int vcpu_id); +void kvm_hv_vcpu_unsuspend_tlb_flush(struct kvm_vcpu *vcpu); #else /* CONFIG_KVM_HYPERV */ static inline void kvm_hv_setup_tsc_page(struct kvm *kvm, struct pvclock_vcpu_time_info *hv_clock) {} @@ -321,6 +330,14 @@ static inline u32 kvm_hv_get_vpindex(struct kvm_vcpu *vcpu) return vcpu->vcpu_idx; } static inline void kvm_hv_nested_transtion_tlb_flush(struct kvm_vcpu *vcpu, bool tdp_enabled) {} + +static inline bool kvm_hv_vcpu_suspended(struct kvm_vcpu *vcpu) +{ + return false; +} + +static inline void kvm_hv_vcpu_suspend_tlb_flush(struct kvm_vcpu *vcpu, int vcpu_id) {} +static inline void kvm_hv_vcpu_unsuspend_tlb_flush(struct kvm_vcpu *vcpu) {} #endif /* CONFIG_KVM_HYPERV */ #endif /* __ARCH_X86_KVM_HYPERV_H__ */ diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 15080385b8fe..18d0a300e79a 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -135,6 +135,8 @@ static void store_regs(struct kvm_vcpu *vcpu); static int sync_regs(struct kvm_vcpu *vcpu); static int kvm_vcpu_do_singlestep(struct kvm_vcpu *vcpu); +static inline bool kvm_hv_vcpu_suspended(struct kvm_vcpu *vcpu); + static int __set_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2); static void __get_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2); @@ -11107,7 +11109,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) static bool kvm_vcpu_running(struct kvm_vcpu *vcpu) { return (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE && - !vcpu->arch.apf.halted); + !vcpu->arch.apf.halted && !kvm_hv_vcpu_suspended(vcpu)); } static bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
Introduce a suspension state for Hyper-V enlightened vCPUs. Microsoft's "Hypervisor Top Level Functional Specification" (TLFS) introduces this state as a "vCPU that is stopped on a instruction guest boundary, either explicitly or implicitly due to an intercept". The only documented explicit suspension is caused in response to a TLB Flush hypercall, which will use the state switching API in subsequent patches. Each Hyper-V vCPU now has a 'suspended' boolean field, which is checked before entering the guest. When set, it forces the vCPU to block. Once in that state, the vCPU ignores any events. The vCPU is unsuspended by clearing 'suspend' and issuing a request to force the vCPU out of sleep. Suspensions are issued as a mechanism to halt execution until state change is observed on a remote vCPU. Hyper-V vCPUs track this with 'waiting_on', which holds the 'vcpu_id' of the remote vCPU that forced the vCPU to enter the suspended state. It's the remote vCPU's responsibility to wake up the suspended vCPUs when ready. 'waiting_on' ensures the remote vCPU can selectively unsuspend vCPUs that blocked on its behalf while leaving other suspended vCPUs undisturbed. One vCPU can only be suspended due to a single remote vCPU, but different vCPUs can be suspended on behalf of different or the same remote vCPU(s). The guest is responsible for avoiding circular dependencies between suspended vCPUs. Callers of the suspend API are responsible for ensuring that suspend and unsuspend aren't called in parallel while targeting the same pair of vCPUs. Otherwise kvm_hv_vcpu_{un}suspend_tlb_flush() ensure 'waiting_on' and 'suspended' are updated and accessed in the correct order. This, for example, avoids races where the unsuspended vCPU re-suspends before kvm_hv_vcpu_unsuspend_tlb_flush() is done updating 'waiting_on'. Signed-off-by: Nikolas Wipper <nikwip@amazon.de> --- arch/x86/include/asm/kvm_host.h | 3 +++ arch/x86/kvm/hyperv.c | 30 ++++++++++++++++++++++++++++++ arch/x86/kvm/hyperv.h | 17 +++++++++++++++++ arch/x86/kvm/x86.c | 4 +++- 4 files changed, 53 insertions(+), 1 deletion(-)