Message ID | 20250106155652.484001-1-kentaishiguro@sslab.ics.keio.ac.jp (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] Para-virtualized TLB flush for PV-waiting vCPUs | expand |
On Tue, Jan 07, 2025, Kenta Ishiguro wrote: > Signed-off-by: Kenta Ishiguro <kentaishiguro@sslab.ics.keio.ac.jp> > --- > arch/x86/include/uapi/asm/kvm_para.h | 1 + > arch/x86/kernel/kvm.c | 13 +++++++++++-- > 2 files changed, 12 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h > index a1efa7907a0b..db26e167a707 100644 > --- a/arch/x86/include/uapi/asm/kvm_para.h > +++ b/arch/x86/include/uapi/asm/kvm_para.h > @@ -70,6 +70,7 @@ struct kvm_steal_time { > > #define KVM_VCPU_PREEMPTED (1 << 0) > #define KVM_VCPU_FLUSH_TLB (1 << 1) > +#define KVM_VCPU_IN_PVWAIT (1 << 2) > > #define KVM_CLOCK_PAIRING_WALLCLOCK 0 > struct kvm_clock_pairing { > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c > index 21e9e4845354..f17057b7d263 100644 > --- a/arch/x86/kernel/kvm.c > +++ b/arch/x86/kernel/kvm.c > @@ -668,7 +668,8 @@ static void kvm_flush_tlb_multi(const struct cpumask *cpumask, > */ > src = &per_cpu(steal_time, cpu); > state = READ_ONCE(src->preempted); > - if ((state & KVM_VCPU_PREEMPTED)) { > + if ((state & KVM_VCPU_PREEMPTED) || > + (state & KVM_VCPU_IN_PVWAIT)) { > if (try_cmpxchg(&src->preempted, &state, > state | KVM_VCPU_FLUSH_TLB)) This is unsafe. KVM_VCPU_PREEMPTED relies on it being set and cleared by the host to ensure either the host observes KVM_VCPU_FLUSH_TLB before the vCPU enters the guest, i.e. before the vCPU can possibly consume stale TLB entries. If the host is already in the process of re-entering the guest on the waiting vCPU, e.g. because it received a kick, then there will be no KVM_REQ_STEAL_UPDATE in the host and so the host won't process KVM_VCPU_FLUSH_TLB. I also see no reason to limit this to kvm_wait(); the logic you are relying on is really just "let KVM do the flushes if the vCPU is in the host". There's a balance to be had, e.g. toggling a flag on every entry+exit would get expensive, but toggling at kvm_arch_vcpu_put() and then letting kvm_arch_vcpu_load() request a steal_time update seems pretty straightforward. E.g. this will also elide IPIs if the vCPU happens to be in host userspace doing emulation of some kind, or if the vCPU is blocking. diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h index a1efa7907a0b..e3a6e6ecf70b 100644 --- a/arch/x86/include/uapi/asm/kvm_para.h +++ b/arch/x86/include/uapi/asm/kvm_para.h @@ -70,6 +70,7 @@ struct kvm_steal_time { #define KVM_VCPU_PREEMPTED (1 << 0) #define KVM_VCPU_FLUSH_TLB (1 << 1) +#define KVM_VCPU_IN_HOST (1 << 2) #define KVM_CLOCK_PAIRING_WALLCLOCK 0 struct kvm_clock_pairing { diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index acdd72e89bb0..5e3dc209e86c 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5018,8 +5018,11 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu) struct gfn_to_hva_cache *ghc = &vcpu->arch.st.cache; struct kvm_steal_time __user *st; struct kvm_memslots *slots; - static const u8 preempted = KVM_VCPU_PREEMPTED; gpa_t gpa = vcpu->arch.st.msr_val & KVM_STEAL_VALID_BITS; + u8 preempted = KVM_VCPU_IN_HOST; + + if (vcpu->preempted) + preempted |= KVM_VCPU_PREEMPTED; /* * The vCPU can be marked preempted if and only if the VM-Exit was on @@ -5037,7 +5040,7 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu) if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED)) return; - if (vcpu->arch.st.preempted) + if (vcpu->arch.st.preempted == preempted) return; /* This happens on process exit */ @@ -5055,7 +5058,7 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu) BUILD_BUG_ON(sizeof(st->preempted) != sizeof(preempted)); if (!copy_to_user_nofault(&st->preempted, &preempted, sizeof(preempted))) - vcpu->arch.st.preempted = KVM_VCPU_PREEMPTED; + vcpu->arch.st.preempted = preempted; mark_page_dirty_in_slot(vcpu->kvm, ghc->memslot, gpa_to_gfn(ghc->gpa)); } @@ -5064,7 +5067,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) { int idx; - if (vcpu->preempted) { + if (vcpu->preempted || !kvm_xen_msr_enabled(vcpu->kvm)) { /* * Assume protected guests are in-kernel. Inefficient yielding * due to false positives is preferable to never yielding due > __cpumask_clear_cpu(cpu, flushmask); > @@ -1045,6 +1046,9 @@ static void kvm_kick_cpu(int cpu) > > static void kvm_wait(u8 *ptr, u8 val) > { > + u8 state; > + struct kvm_steal_time *src; > + > if (in_nmi()) > return; > > @@ -1054,8 +1058,13 @@ static void kvm_wait(u8 *ptr, u8 val) > * in irq spinlock slowpath and no spurious interrupt occur to save us. > */ > if (irqs_disabled()) { > - if (READ_ONCE(*ptr) == val) > + if (READ_ONCE(*ptr) == val) { > + src = this_cpu_ptr(&steal_time); > + state = READ_ONCE(src->preempted); > + try_cmpxchg(&src->preempted, &state, > + state | KVM_VCPU_IN_PVWAIT); > halt(); > + } > } else { > local_irq_disable(); > > -- > 2.25.1 >
Thank you for your comments. I understood the scenario of why my previous change was unsafe. Also, I like the concept of KVM_VCPU_IN_HOST for tracking whether a vCPU is scheduled out because it can be helpful to understand the vCPU's situation from guests. I have tested the attached changes, but I found that the performance improvement was somewhat limited. The boundary-checking code prevents KVM from setting KVM_VCPU_IN_HOST and KVM_VCPU_PREEMPTED, which may be contributing to this limitation. I think this is a conservative approach to avoid using stale TLB entries. I referred to this patch: https://lore.kernel.org/lkml/20220614021116.1101331-1-sashal@kernel.org/ Since PV waiting causes the most significant overhead, is it possible to allow guests to perform PV flush if vCPUs are PV waiting and scheduled out? > diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h > index a1efa7907a0b..e3a6e6ecf70b 100644 > --- a/arch/x86/include/uapi/asm/kvm_para.h > +++ b/arch/x86/include/uapi/asm/kvm_para.h > @@ -70,6 +70,7 @@ struct kvm_steal_time { > > #define KVM_VCPU_PREEMPTED (1 << 0) > #define KVM_VCPU_FLUSH_TLB (1 << 1) > +#define KVM_VCPU_IN_HOST (1 << 2) > > #define KVM_CLOCK_PAIRING_WALLCLOCK 0 > struct kvm_clock_pairing { > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index acdd72e89bb0..5e3dc209e86c 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -5018,8 +5018,11 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu) > struct gfn_to_hva_cache *ghc = &vcpu->arch.st.cache; > struct kvm_steal_time __user *st; > struct kvm_memslots *slots; > - static const u8 preempted = KVM_VCPU_PREEMPTED; > gpa_t gpa = vcpu->arch.st.msr_val & KVM_STEAL_VALID_BITS; > + u8 preempted = KVM_VCPU_IN_HOST; > + > + if (vcpu->preempted) > + preempted |= KVM_VCPU_PREEMPTED; > > /* > * The vCPU can be marked preempted if and only if the VM-Exit was on > @@ -5037,7 +5040,7 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu) > if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED)) > return; > > - if (vcpu->arch.st.preempted) > + if (vcpu->arch.st.preempted == preempted) This code may clear KVM_VCPU_FLUSH_TLB. https://lore.kernel.org/lkml/20200803121849.869521189@linuxfoundation.org/ > return; > > /* This happens on process exit */ > @@ -5055,7 +5058,7 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu) > BUILD_BUG_ON(sizeof(st->preempted) != sizeof(preempted)); > > if (!copy_to_user_nofault(&st->preempted, &preempted, sizeof(preempted))) > - vcpu->arch.st.preempted = KVM_VCPU_PREEMPTED; > + vcpu->arch.st.preempted = preempted; > > mark_page_dirty_in_slot(vcpu->kvm, ghc->memslot, gpa_to_gfn(ghc->gpa)); > } > @@ -5064,7 +5067,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) > { > int idx; > > - if (vcpu->preempted) { > + if (vcpu->preempted || !kvm_xen_msr_enabled(vcpu->kvm)) { > /* > * Assume protected guests are in-kernel. Inefficient yielding > * due to false positives is preferable to never yielding due --- arch/x86/include/uapi/asm/kvm_para.h | 1 + arch/x86/kernel/kvm.c | 2 +- arch/x86/kvm/x86.c | 11 +++++++---- 3 files changed, 9 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h index a1efa7907a0b..6ef06c3f234b 100644 --- a/arch/x86/include/uapi/asm/kvm_para.h +++ b/arch/x86/include/uapi/asm/kvm_para.h @@ -70,6 +70,7 @@ struct kvm_steal_time { #define KVM_VCPU_PREEMPTED (1 << 0) #define KVM_VCPU_FLUSH_TLB (1 << 1) +#define KVM_VCPU_IN_HOST (1 << 2) #define KVM_CLOCK_PAIRING_WALLCLOCK 0 struct kvm_clock_pairing { diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 21e9e4845354..17ea1111a158 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -668,7 +668,7 @@ static void kvm_flush_tlb_multi(const struct cpumask *cpumask, */ src = &per_cpu(steal_time, cpu); state = READ_ONCE(src->preempted); - if ((state & KVM_VCPU_PREEMPTED)) { + if ((state & KVM_VCPU_PREEMPTED) || (state & KVM_VCPU_IN_HOST)) { if (try_cmpxchg(&src->preempted, &state, state | KVM_VCPU_FLUSH_TLB)) __cpumask_clear_cpu(cpu, flushmask); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index c79a8cc57ba4..27b558ae1ad2 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5019,8 +5019,11 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu) struct gfn_to_hva_cache *ghc = &vcpu->arch.st.cache; struct kvm_steal_time __user *st; struct kvm_memslots *slots; - static const u8 preempted = KVM_VCPU_PREEMPTED; gpa_t gpa = vcpu->arch.st.msr_val & KVM_STEAL_VALID_BITS; + u8 preempted = KVM_VCPU_IN_HOST; + + if (vcpu->preempted) + preempted |= KVM_VCPU_PREEMPTED; /* * The vCPU can be marked preempted if and only if the VM-Exit was on @@ -5038,7 +5041,7 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu) if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED)) return; - if (vcpu->arch.st.preempted) + if (vcpu->arch.st.preempted == preempted) return; /* This happens on process exit */ @@ -5056,7 +5059,7 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu) BUILD_BUG_ON(sizeof(st->preempted) != sizeof(preempted)); if (!copy_to_user_nofault(&st->preempted, &preempted, sizeof(preempted))) - vcpu->arch.st.preempted = KVM_VCPU_PREEMPTED; + vcpu->arch.st.preempted = preempted; mark_page_dirty_in_slot(vcpu->kvm, ghc->memslot, gpa_to_gfn(ghc->gpa)); } @@ -5065,7 +5068,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) { int idx; - if (vcpu->preempted) { + if (vcpu->preempted || !kvm_xen_msr_enabled(vcpu->kvm)) { /* * Assume protected guests are in-kernel. Inefficient yielding * due to false positives is preferable to never yielding due
On Tue, Jan 07, 2025 at 12:56:52AM +0900, Kenta Ishiguro wrote: >In oversubscribed environments, the latency of flushing the remote TLB can >become significant when the destination virtual CPU (vCPU) is the waiter >of a para-virtualized queued spinlock that halts with interrupts disabled. >This occurs because the waiter does not respond to remote function call >requests until it releases the spinlock. As a result, the source vCPU >wastes CPU time performing busy-waiting for a response from the >destination vCPU. > >To mitigate this issue, this patch extends the target of the PV TLB flush >to include vCPUs that are halting to wait on the PV qspinlock. Since the >PV qspinlock waiters voluntarily yield before being preempted by KVM, >their state does not get preempted, and the current PV TLB flush overlooks >them. This change allows vCPUs to bypass waiting for PV qspinlock waiters >during TLB shootdowns. This doesn't seem to be a KVM-specific problem; other hypervisors should have the same problem. So, I think we can implement a more generic solution w/o involving the hypervisor. e.g., the guest can track which vCPUs are waiting on PV qspinlock, delay TLB flush on them and have those vCPUs perform TLB flush after they complete their wait (e.g., right after the halt() in kvm_wait()).
On Mon, Jan 20, 2025, Chao Gao wrote: > On Tue, Jan 07, 2025 at 12:56:52AM +0900, Kenta Ishiguro wrote: > >In oversubscribed environments, the latency of flushing the remote TLB can > >become significant when the destination virtual CPU (vCPU) is the waiter > >of a para-virtualized queued spinlock that halts with interrupts disabled. > >This occurs because the waiter does not respond to remote function call > >requests until it releases the spinlock. As a result, the source vCPU > >wastes CPU time performing busy-waiting for a response from the > >destination vCPU. > > > >To mitigate this issue, this patch extends the target of the PV TLB flush > >to include vCPUs that are halting to wait on the PV qspinlock. Since the > >PV qspinlock waiters voluntarily yield before being preempted by KVM, > >their state does not get preempted, and the current PV TLB flush overlooks > >them. This change allows vCPUs to bypass waiting for PV qspinlock waiters > >during TLB shootdowns. > > This doesn't seem to be a KVM-specific problem; other hypervisors should > have the same problem. So, I think we can implement a more generic solution > w/o involving the hypervisor. e.g., the guest can track which vCPUs are > waiting on PV qspinlock, delay TLB flush on them and have those vCPUs > perform TLB flush after they complete their wait (e.g., right after the > halt() in kvm_wait()). I don't think that works though. E.g. what if the vCPU takes an NMI (guest NMI) while waiting on the spinlock, and the NMI handler accesses the virtual address that was supposed to be flushed? The PV approach works because the hypervisor can guarantee the flush will occur before the vCPU can run *any* code.
On Mon, Jan 20, 2025, Kenta Ishiguro wrote: > Thank you for your comments. > I understood the scenario of why my previous change was unsafe. > > Also, I like the concept of KVM_VCPU_IN_HOST for tracking whether a vCPU is > scheduled out because it can be helpful to understand the vCPU's situation > from guests. > > I have tested the attached changes, but I found that the performance > improvement was somewhat limited. The boundary-checking code prevents > KVM from setting KVM_VCPU_IN_HOST and KVM_VCPU_PREEMPTED, which may be > contributing to this limitation. I think this is a conservative > approach to avoid using stale TLB entries. > I referred to this patch: > https://lore.kernel.org/lkml/20220614021116.1101331-1-sashal@kernel.org/ > Since PV waiting causes the most significant overhead, is it possible to > allow guests to perform PV flush if vCPUs are PV waiting and scheduled out? Hmm, yes? The "right now kvm_vcpu_check_block() is doing memory accesses" issue was mostly addressed by commit 26844fee6ade ("KVM: x86: never write to memory from kvm_vcpu_check_block()"). It would in principle be okay to report the vCPU as preempted also if it is sleeping in kvm_vcpu_block(): a TLB flush IPI will incur the vmentry/vmexit overhead unnecessarily, and optimistic spinning is also unlikely to succeed. However, leave it for later because right now kvm_vcpu_check_block() is doing memory accesses. Even though the TLB flush issue only applies to virtual memory address, it's very much preferrable to be conservative. I say mostly because there are technically reads to guest memory via cached objects, specifically vmx_has_nested_events()'s use of the nested Posted Interrupt Descriptor (PID). But a vCPU's PID is referenced directly via physical address in the VMCS, so there are no TLB flushing concerns on that front. I think this would be safe/correct. Key word "think". diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 5193c3dfbce1..691cf4cfe5d4 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -2350,16 +2350,6 @@ static inline bool kvm_irq_is_postable(struct kvm_lapic_irq *irq) irq->delivery_mode == APIC_DM_LOWEST); } -static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) -{ - kvm_x86_call(vcpu_blocking)(vcpu); -} - -static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) -{ - kvm_x86_call(vcpu_unblocking)(vcpu); -} - static inline int kvm_cpu_get_apicid(int mps_cpu) { #ifdef CONFIG_X86_LOCAL_APIC diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 5f3ad13a8ac7..4f8d0b000e93 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -11170,6 +11170,18 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu) return kvm_vcpu_running(vcpu) || kvm_vcpu_has_events(vcpu); } +void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) +{ + vcpu->arch.at_instruction_boundary = true; + kvm_x86_call(vcpu_blocking)(vcpu); +} + +void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) +{ + kvm_x86_call(vcpu_unblocking)(vcpu); + vcpu->arch.at_instruction_boundary = false; +} + /* Called within kvm->srcu read side. */ static inline int vcpu_block(struct kvm_vcpu *vcpu) {
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h index a1efa7907a0b..db26e167a707 100644 --- a/arch/x86/include/uapi/asm/kvm_para.h +++ b/arch/x86/include/uapi/asm/kvm_para.h @@ -70,6 +70,7 @@ struct kvm_steal_time { #define KVM_VCPU_PREEMPTED (1 << 0) #define KVM_VCPU_FLUSH_TLB (1 << 1) +#define KVM_VCPU_IN_PVWAIT (1 << 2) #define KVM_CLOCK_PAIRING_WALLCLOCK 0 struct kvm_clock_pairing { diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 21e9e4845354..f17057b7d263 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -668,7 +668,8 @@ static void kvm_flush_tlb_multi(const struct cpumask *cpumask, */ src = &per_cpu(steal_time, cpu); state = READ_ONCE(src->preempted); - if ((state & KVM_VCPU_PREEMPTED)) { + if ((state & KVM_VCPU_PREEMPTED) || + (state & KVM_VCPU_IN_PVWAIT)) { if (try_cmpxchg(&src->preempted, &state, state | KVM_VCPU_FLUSH_TLB)) __cpumask_clear_cpu(cpu, flushmask); @@ -1045,6 +1046,9 @@ static void kvm_kick_cpu(int cpu) static void kvm_wait(u8 *ptr, u8 val) { + u8 state; + struct kvm_steal_time *src; + if (in_nmi()) return; @@ -1054,8 +1058,13 @@ static void kvm_wait(u8 *ptr, u8 val) * in irq spinlock slowpath and no spurious interrupt occur to save us. */ if (irqs_disabled()) { - if (READ_ONCE(*ptr) == val) + if (READ_ONCE(*ptr) == val) { + src = this_cpu_ptr(&steal_time); + state = READ_ONCE(src->preempted); + try_cmpxchg(&src->preempted, &state, + state | KVM_VCPU_IN_PVWAIT); halt(); + } } else { local_irq_disable();
In oversubscribed environments, the latency of flushing the remote TLB can become significant when the destination virtual CPU (vCPU) is the waiter of a para-virtualized queued spinlock that halts with interrupts disabled. This occurs because the waiter does not respond to remote function call requests until it releases the spinlock. As a result, the source vCPU wastes CPU time performing busy-waiting for a response from the destination vCPU. To mitigate this issue, this patch extends the target of the PV TLB flush to include vCPUs that are halting to wait on the PV qspinlock. Since the PV qspinlock waiters voluntarily yield before being preempted by KVM, their state does not get preempted, and the current PV TLB flush overlooks them. This change allows vCPUs to bypass waiting for PV qspinlock waiters during TLB shootdowns. This enhancement improves the throughput of the ebizzy workload, which intensively causes spinlock contention and TLB shootdowns, in oversubscribed environments. The following experimental setup was used to evaluate the performance impact: Host Machine: Dell R760, Intel(R) Xeon(R) Platinum 8468 (48C/96T), 256GB memory VM0: ebizzy -M, 96 vCPUs, 32GB memory VM1: busy-loop, 96 vCPUs, 32GB memory Experiments Conducted: 10 iterations Results: - Without Patch: 7702.4 records/second (standard deviation: 295.5) - With Patch: 9110.9 records/second (standard deviation: 528.6) Signed-off-by: Kenta Ishiguro <kentaishiguro@sslab.ics.keio.ac.jp> --- arch/x86/include/uapi/asm/kvm_para.h | 1 + arch/x86/kernel/kvm.c | 13 +++++++++++-- 2 files changed, 12 insertions(+), 2 deletions(-)