diff mbox series

KVM: X86: Properly account for guest CPU time when considering context tracking

Message ID 1617011036-11734-1-git-send-email-wanpengli@tencent.com (mailing list archive)
State New, archived
Headers show
Series KVM: X86: Properly account for guest CPU time when considering context tracking | expand

Commit Message

Wanpeng Li March 29, 2021, 9:43 a.m. UTC
From: Wanpeng Li <wanpengli@tencent.com>

The bugzilla https://bugzilla.kernel.org/show_bug.cgi?id=209831 
reported that the guest time remains 0 when running a while true 
loop in the guest.

The commit 87fa7f3e98a131 ("x86/kvm: Move context tracking where it 
belongs") moves guest_exit_irqoff() close to vmexit breaks the 
tick-based time accouting when the ticks that happen after IRQs are 
disabled are incorrectly accounted to the host/system time. This is 
because we exit the guest state too early.

vtime-based time accounting is tied to context tracking, keep the 
guest_exit_irqoff() around vmexit code when both vtime-based time 
accounting and specific cpu is context tracking mode active. 
Otherwise, leave guest_exit_irqoff() after handle_exit_irqoff() 
and explicit IRQ window for tick-based time accouting.

Fixes: 87fa7f3e98a131 ("x86/kvm: Move context tracking where it belongs")
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
 arch/x86/kvm/svm/svm.c | 3 ++-
 arch/x86/kvm/vmx/vmx.c | 3 ++-
 arch/x86/kvm/x86.c     | 2 ++
 3 files changed, 6 insertions(+), 2 deletions(-)

Comments

Sean Christopherson March 29, 2021, 5:15 p.m. UTC | #1
+Thomas

On Mon, Mar 29, 2021, Wanpeng Li wrote:
> From: Wanpeng Li <wanpengli@tencent.com>
> 
> The bugzilla https://bugzilla.kernel.org/show_bug.cgi?id=209831 
> reported that the guest time remains 0 when running a while true 
> loop in the guest.
> 
> The commit 87fa7f3e98a131 ("x86/kvm: Move context tracking where it 
> belongs") moves guest_exit_irqoff() close to vmexit breaks the 
> tick-based time accouting when the ticks that happen after IRQs are 
> disabled are incorrectly accounted to the host/system time. This is 
> because we exit the guest state too early.
> 
> vtime-based time accounting is tied to context tracking, keep the 
> guest_exit_irqoff() around vmexit code when both vtime-based time 
> accounting and specific cpu is context tracking mode active. 
> Otherwise, leave guest_exit_irqoff() after handle_exit_irqoff() 
> and explicit IRQ window for tick-based time accouting.
> 
> Fixes: 87fa7f3e98a131 ("x86/kvm: Move context tracking where it belongs")
> Cc: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
> ---
>  arch/x86/kvm/svm/svm.c | 3 ++-
>  arch/x86/kvm/vmx/vmx.c | 3 ++-
>  arch/x86/kvm/x86.c     | 2 ++
>  3 files changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 58a45bb..55fb5ce 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -3812,7 +3812,8 @@ static noinstr void svm_vcpu_enter_exit(struct kvm_vcpu *vcpu,
>  	 * into world and some more.
>  	 */
>  	lockdep_hardirqs_off(CALLER_ADDR0);
> -	guest_exit_irqoff();
> +	if (vtime_accounting_enabled_this_cpu())
> +		guest_exit_irqoff();
>  
>  	instrumentation_begin();
>  	trace_hardirqs_off_finish();
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 32cf828..85695b3 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6689,7 +6689,8 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
>  	 * into world and some more.
>  	 */
>  	lockdep_hardirqs_off(CALLER_ADDR0);
> -	guest_exit_irqoff();
> +	if (vtime_accounting_enabled_this_cpu())
> +		guest_exit_irqoff();

This looks ok, as CONFIG_CONTEXT_TRACKING and CONFIG_VIRT_CPU_ACCOUNTING_GEN are
selected by CONFIG_NO_HZ_FULL=y, and can't be enabled independently, e.g. the
rcu_user_exit() call won't be delayed because it will never be called in the
!vtime case.  But it still feels wrong poking into those details, e.g. it'll
be weird and/or wrong guest_exit_irqoff() gains stuff that isn't vtime specific.
Maybe that will never happen though?  And of course, my hack alternative also
pokes into the details[*].

Thomas, do you have an input on the least awful way to handle this?  My horrible
hack was to force PF_VCPU around the window where KVM handles IRQs after guest
exit.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d9f931c63293..6ddf341cd755 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9118,6 +9118,13 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	vcpu->mode = OUTSIDE_GUEST_MODE;
 	smp_wmb();

+	/*
+	 * Temporarily pretend this task is running a vCPU when potentially
+	 * processing an IRQ exit, including the below opening of an IRQ
+	 * window.  Tick-based accounting of guest time relies on PF_VCPU
+	 * being set when the tick IRQ handler runs.
+	 */
+	current->flags |= PF_VCPU;
 	static_call(kvm_x86_handle_exit_irqoff)(vcpu);

 	/*
@@ -9132,6 +9139,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	++vcpu->stat.exits;
 	local_irq_disable();
 	kvm_after_interrupt(vcpu);
+	current->flags &= ~PF_VCPU;

 	if (lapic_in_kernel(vcpu)) {
 		s64 delta = vcpu->arch.apic->lapic_timer.advance_expire_delta;

[*]https://lkml.kernel.org/r/20210206004218.312023-1-seanjc@google.com

>  	instrumentation_begin();
>  	trace_hardirqs_off_finish();
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index fe806e8..234c8b3 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9185,6 +9185,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  	++vcpu->stat.exits;
>  	local_irq_disable();
>  	kvm_after_interrupt(vcpu);
> +	if (!vtime_accounting_enabled_this_cpu())
> +		guest_exit_irqoff();
>  
>  	if (lapic_in_kernel(vcpu)) {
>  		s64 delta = vcpu->arch.apic->lapic_timer.advance_expire_delta;
> -- 
> 2.7.4
>
Wanpeng Li March 30, 2021, 1:33 a.m. UTC | #2
On Tue, 30 Mar 2021 at 01:15, Sean Christopherson <seanjc@google.com> wrote:
>
> +Thomas
>
> On Mon, Mar 29, 2021, Wanpeng Li wrote:
> > From: Wanpeng Li <wanpengli@tencent.com>
> >
> > The bugzilla https://bugzilla.kernel.org/show_bug.cgi?id=209831
> > reported that the guest time remains 0 when running a while true
> > loop in the guest.
> >
> > The commit 87fa7f3e98a131 ("x86/kvm: Move context tracking where it
> > belongs") moves guest_exit_irqoff() close to vmexit breaks the
> > tick-based time accouting when the ticks that happen after IRQs are
> > disabled are incorrectly accounted to the host/system time. This is
> > because we exit the guest state too early.
> >
> > vtime-based time accounting is tied to context tracking, keep the
> > guest_exit_irqoff() around vmexit code when both vtime-based time
> > accounting and specific cpu is context tracking mode active.
> > Otherwise, leave guest_exit_irqoff() after handle_exit_irqoff()
> > and explicit IRQ window for tick-based time accouting.
> >
> > Fixes: 87fa7f3e98a131 ("x86/kvm: Move context tracking where it belongs")
> > Cc: Sean Christopherson <seanjc@google.com>
> > Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
> > ---
> >  arch/x86/kvm/svm/svm.c | 3 ++-
> >  arch/x86/kvm/vmx/vmx.c | 3 ++-
> >  arch/x86/kvm/x86.c     | 2 ++
> >  3 files changed, 6 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index 58a45bb..55fb5ce 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -3812,7 +3812,8 @@ static noinstr void svm_vcpu_enter_exit(struct kvm_vcpu *vcpu,
> >        * into world and some more.
> >        */
> >       lockdep_hardirqs_off(CALLER_ADDR0);
> > -     guest_exit_irqoff();
> > +     if (vtime_accounting_enabled_this_cpu())
> > +             guest_exit_irqoff();
> >
> >       instrumentation_begin();
> >       trace_hardirqs_off_finish();
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 32cf828..85695b3 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -6689,7 +6689,8 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
> >        * into world and some more.
> >        */
> >       lockdep_hardirqs_off(CALLER_ADDR0);
> > -     guest_exit_irqoff();
> > +     if (vtime_accounting_enabled_this_cpu())
> > +             guest_exit_irqoff();
>
> This looks ok, as CONFIG_CONTEXT_TRACKING and CONFIG_VIRT_CPU_ACCOUNTING_GEN are
> selected by CONFIG_NO_HZ_FULL=y, and can't be enabled independently, e.g. the
> rcu_user_exit() call won't be delayed because it will never be called in the
> !vtime case.  But it still feels wrong poking into those details, e.g. it'll
> be weird and/or wrong guest_exit_irqoff() gains stuff that isn't vtime specific.

Could you elaborate what's the meaning of "it'll be weird and/or wrong
guest_exit_irqoff() gains stuff that isn't vtime specific."?

    Wanpeng
Sean Christopherson April 6, 2021, 10:16 p.m. UTC | #3
On Tue, Mar 30, 2021, Wanpeng Li wrote:
> On Tue, 30 Mar 2021 at 01:15, Sean Christopherson <seanjc@google.com> wrote:
> >
> > +Thomas
> >
> > On Mon, Mar 29, 2021, Wanpeng Li wrote:
> > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > index 32cf828..85695b3 100644
> > > --- a/arch/x86/kvm/vmx/vmx.c
> > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > @@ -6689,7 +6689,8 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
> > >        * into world and some more.
> > >        */
> > >       lockdep_hardirqs_off(CALLER_ADDR0);
> > > -     guest_exit_irqoff();
> > > +     if (vtime_accounting_enabled_this_cpu())
> > > +             guest_exit_irqoff();
> >
> > This looks ok, as CONFIG_CONTEXT_TRACKING and CONFIG_VIRT_CPU_ACCOUNTING_GEN are
> > selected by CONFIG_NO_HZ_FULL=y, and can't be enabled independently, e.g. the
> > rcu_user_exit() call won't be delayed because it will never be called in the
> > !vtime case.  But it still feels wrong poking into those details, e.g. it'll
> > be weird and/or wrong guest_exit_irqoff() gains stuff that isn't vtime specific.
> 
> Could you elaborate what's the meaning of "it'll be weird and/or wrong
> guest_exit_irqoff() gains stuff that isn't vtime specific."?

For example, if RCU logic is added to guest_exit_irqoff() that is needed
irrespective of vtime, then KVM will end up with different RCU logic depending
on whether or not vtime is enabled.  RCU is just an example.  My point is that
it doesn't seem impossible that there would be something in the future that
wants to tap into the guest->host transition.

Maybe that never happens and the vtime check is perfectly ok, but for me, the
name guest_exit_irqoff() doesn't sound like something that should hinge on
time accounting being enabled.
diff mbox series

Patch

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 58a45bb..55fb5ce 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3812,7 +3812,8 @@  static noinstr void svm_vcpu_enter_exit(struct kvm_vcpu *vcpu,
 	 * into world and some more.
 	 */
 	lockdep_hardirqs_off(CALLER_ADDR0);
-	guest_exit_irqoff();
+	if (vtime_accounting_enabled_this_cpu())
+		guest_exit_irqoff();
 
 	instrumentation_begin();
 	trace_hardirqs_off_finish();
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 32cf828..85695b3 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6689,7 +6689,8 @@  static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
 	 * into world and some more.
 	 */
 	lockdep_hardirqs_off(CALLER_ADDR0);
-	guest_exit_irqoff();
+	if (vtime_accounting_enabled_this_cpu())
+		guest_exit_irqoff();
 
 	instrumentation_begin();
 	trace_hardirqs_off_finish();
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fe806e8..234c8b3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9185,6 +9185,8 @@  static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	++vcpu->stat.exits;
 	local_irq_disable();
 	kvm_after_interrupt(vcpu);
+	if (!vtime_accounting_enabled_this_cpu())
+		guest_exit_irqoff();
 
 	if (lapic_in_kernel(vcpu)) {
 		s64 delta = vcpu->arch.apic->lapic_timer.advance_expire_delta;