Message ID | 1617011036-11734-1-git-send-email-wanpengli@tencent.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: X86: Properly account for guest CPU time when considering context tracking | expand |
+Thomas On Mon, Mar 29, 2021, Wanpeng Li wrote: > From: Wanpeng Li <wanpengli@tencent.com> > > The bugzilla https://bugzilla.kernel.org/show_bug.cgi?id=209831 > reported that the guest time remains 0 when running a while true > loop in the guest. > > The commit 87fa7f3e98a131 ("x86/kvm: Move context tracking where it > belongs") moves guest_exit_irqoff() close to vmexit breaks the > tick-based time accouting when the ticks that happen after IRQs are > disabled are incorrectly accounted to the host/system time. This is > because we exit the guest state too early. > > vtime-based time accounting is tied to context tracking, keep the > guest_exit_irqoff() around vmexit code when both vtime-based time > accounting and specific cpu is context tracking mode active. > Otherwise, leave guest_exit_irqoff() after handle_exit_irqoff() > and explicit IRQ window for tick-based time accouting. > > Fixes: 87fa7f3e98a131 ("x86/kvm: Move context tracking where it belongs") > Cc: Sean Christopherson <seanjc@google.com> > Signed-off-by: Wanpeng Li <wanpengli@tencent.com> > --- > arch/x86/kvm/svm/svm.c | 3 ++- > arch/x86/kvm/vmx/vmx.c | 3 ++- > arch/x86/kvm/x86.c | 2 ++ > 3 files changed, 6 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c > index 58a45bb..55fb5ce 100644 > --- a/arch/x86/kvm/svm/svm.c > +++ b/arch/x86/kvm/svm/svm.c > @@ -3812,7 +3812,8 @@ static noinstr void svm_vcpu_enter_exit(struct kvm_vcpu *vcpu, > * into world and some more. > */ > lockdep_hardirqs_off(CALLER_ADDR0); > - guest_exit_irqoff(); > + if (vtime_accounting_enabled_this_cpu()) > + guest_exit_irqoff(); > > instrumentation_begin(); > trace_hardirqs_off_finish(); > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > index 32cf828..85695b3 100644 > --- a/arch/x86/kvm/vmx/vmx.c > +++ b/arch/x86/kvm/vmx/vmx.c > @@ -6689,7 +6689,8 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu, > * into world and some more. > */ > lockdep_hardirqs_off(CALLER_ADDR0); > - guest_exit_irqoff(); > + if (vtime_accounting_enabled_this_cpu()) > + guest_exit_irqoff(); This looks ok, as CONFIG_CONTEXT_TRACKING and CONFIG_VIRT_CPU_ACCOUNTING_GEN are selected by CONFIG_NO_HZ_FULL=y, and can't be enabled independently, e.g. the rcu_user_exit() call won't be delayed because it will never be called in the !vtime case. But it still feels wrong poking into those details, e.g. it'll be weird and/or wrong guest_exit_irqoff() gains stuff that isn't vtime specific. Maybe that will never happen though? And of course, my hack alternative also pokes into the details[*]. Thomas, do you have an input on the least awful way to handle this? My horrible hack was to force PF_VCPU around the window where KVM handles IRQs after guest exit. diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index d9f931c63293..6ddf341cd755 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -9118,6 +9118,13 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) vcpu->mode = OUTSIDE_GUEST_MODE; smp_wmb(); + /* + * Temporarily pretend this task is running a vCPU when potentially + * processing an IRQ exit, including the below opening of an IRQ + * window. Tick-based accounting of guest time relies on PF_VCPU + * being set when the tick IRQ handler runs. + */ + current->flags |= PF_VCPU; static_call(kvm_x86_handle_exit_irqoff)(vcpu); /* @@ -9132,6 +9139,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) ++vcpu->stat.exits; local_irq_disable(); kvm_after_interrupt(vcpu); + current->flags &= ~PF_VCPU; if (lapic_in_kernel(vcpu)) { s64 delta = vcpu->arch.apic->lapic_timer.advance_expire_delta; [*]https://lkml.kernel.org/r/20210206004218.312023-1-seanjc@google.com > instrumentation_begin(); > trace_hardirqs_off_finish(); > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index fe806e8..234c8b3 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -9185,6 +9185,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) > ++vcpu->stat.exits; > local_irq_disable(); > kvm_after_interrupt(vcpu); > + if (!vtime_accounting_enabled_this_cpu()) > + guest_exit_irqoff(); > > if (lapic_in_kernel(vcpu)) { > s64 delta = vcpu->arch.apic->lapic_timer.advance_expire_delta; > -- > 2.7.4 >
On Tue, 30 Mar 2021 at 01:15, Sean Christopherson <seanjc@google.com> wrote: > > +Thomas > > On Mon, Mar 29, 2021, Wanpeng Li wrote: > > From: Wanpeng Li <wanpengli@tencent.com> > > > > The bugzilla https://bugzilla.kernel.org/show_bug.cgi?id=209831 > > reported that the guest time remains 0 when running a while true > > loop in the guest. > > > > The commit 87fa7f3e98a131 ("x86/kvm: Move context tracking where it > > belongs") moves guest_exit_irqoff() close to vmexit breaks the > > tick-based time accouting when the ticks that happen after IRQs are > > disabled are incorrectly accounted to the host/system time. This is > > because we exit the guest state too early. > > > > vtime-based time accounting is tied to context tracking, keep the > > guest_exit_irqoff() around vmexit code when both vtime-based time > > accounting and specific cpu is context tracking mode active. > > Otherwise, leave guest_exit_irqoff() after handle_exit_irqoff() > > and explicit IRQ window for tick-based time accouting. > > > > Fixes: 87fa7f3e98a131 ("x86/kvm: Move context tracking where it belongs") > > Cc: Sean Christopherson <seanjc@google.com> > > Signed-off-by: Wanpeng Li <wanpengli@tencent.com> > > --- > > arch/x86/kvm/svm/svm.c | 3 ++- > > arch/x86/kvm/vmx/vmx.c | 3 ++- > > arch/x86/kvm/x86.c | 2 ++ > > 3 files changed, 6 insertions(+), 2 deletions(-) > > > > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c > > index 58a45bb..55fb5ce 100644 > > --- a/arch/x86/kvm/svm/svm.c > > +++ b/arch/x86/kvm/svm/svm.c > > @@ -3812,7 +3812,8 @@ static noinstr void svm_vcpu_enter_exit(struct kvm_vcpu *vcpu, > > * into world and some more. > > */ > > lockdep_hardirqs_off(CALLER_ADDR0); > > - guest_exit_irqoff(); > > + if (vtime_accounting_enabled_this_cpu()) > > + guest_exit_irqoff(); > > > > instrumentation_begin(); > > trace_hardirqs_off_finish(); > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > > index 32cf828..85695b3 100644 > > --- a/arch/x86/kvm/vmx/vmx.c > > +++ b/arch/x86/kvm/vmx/vmx.c > > @@ -6689,7 +6689,8 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu, > > * into world and some more. > > */ > > lockdep_hardirqs_off(CALLER_ADDR0); > > - guest_exit_irqoff(); > > + if (vtime_accounting_enabled_this_cpu()) > > + guest_exit_irqoff(); > > This looks ok, as CONFIG_CONTEXT_TRACKING and CONFIG_VIRT_CPU_ACCOUNTING_GEN are > selected by CONFIG_NO_HZ_FULL=y, and can't be enabled independently, e.g. the > rcu_user_exit() call won't be delayed because it will never be called in the > !vtime case. But it still feels wrong poking into those details, e.g. it'll > be weird and/or wrong guest_exit_irqoff() gains stuff that isn't vtime specific. Could you elaborate what's the meaning of "it'll be weird and/or wrong guest_exit_irqoff() gains stuff that isn't vtime specific."? Wanpeng
On Tue, Mar 30, 2021, Wanpeng Li wrote: > On Tue, 30 Mar 2021 at 01:15, Sean Christopherson <seanjc@google.com> wrote: > > > > +Thomas > > > > On Mon, Mar 29, 2021, Wanpeng Li wrote: > > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > > > index 32cf828..85695b3 100644 > > > --- a/arch/x86/kvm/vmx/vmx.c > > > +++ b/arch/x86/kvm/vmx/vmx.c > > > @@ -6689,7 +6689,8 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu, > > > * into world and some more. > > > */ > > > lockdep_hardirqs_off(CALLER_ADDR0); > > > - guest_exit_irqoff(); > > > + if (vtime_accounting_enabled_this_cpu()) > > > + guest_exit_irqoff(); > > > > This looks ok, as CONFIG_CONTEXT_TRACKING and CONFIG_VIRT_CPU_ACCOUNTING_GEN are > > selected by CONFIG_NO_HZ_FULL=y, and can't be enabled independently, e.g. the > > rcu_user_exit() call won't be delayed because it will never be called in the > > !vtime case. But it still feels wrong poking into those details, e.g. it'll > > be weird and/or wrong guest_exit_irqoff() gains stuff that isn't vtime specific. > > Could you elaborate what's the meaning of "it'll be weird and/or wrong > guest_exit_irqoff() gains stuff that isn't vtime specific."? For example, if RCU logic is added to guest_exit_irqoff() that is needed irrespective of vtime, then KVM will end up with different RCU logic depending on whether or not vtime is enabled. RCU is just an example. My point is that it doesn't seem impossible that there would be something in the future that wants to tap into the guest->host transition. Maybe that never happens and the vtime check is perfectly ok, but for me, the name guest_exit_irqoff() doesn't sound like something that should hinge on time accounting being enabled.
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 58a45bb..55fb5ce 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -3812,7 +3812,8 @@ static noinstr void svm_vcpu_enter_exit(struct kvm_vcpu *vcpu, * into world and some more. */ lockdep_hardirqs_off(CALLER_ADDR0); - guest_exit_irqoff(); + if (vtime_accounting_enabled_this_cpu()) + guest_exit_irqoff(); instrumentation_begin(); trace_hardirqs_off_finish(); diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 32cf828..85695b3 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6689,7 +6689,8 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu, * into world and some more. */ lockdep_hardirqs_off(CALLER_ADDR0); - guest_exit_irqoff(); + if (vtime_accounting_enabled_this_cpu()) + guest_exit_irqoff(); instrumentation_begin(); trace_hardirqs_off_finish(); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index fe806e8..234c8b3 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -9185,6 +9185,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) ++vcpu->stat.exits; local_irq_disable(); kvm_after_interrupt(vcpu); + if (!vtime_accounting_enabled_this_cpu()) + guest_exit_irqoff(); if (lapic_in_kernel(vcpu)) { s64 delta = vcpu->arch.apic->lapic_timer.advance_expire_delta;