diff mbox

[5/5] KVM: x86: Using TSC deadline may cause multiple interrupts by user writes

Message ID 20141008100619.GA20422@potion.brq.redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Radim Krčmář Oct. 8, 2014, 10:06 a.m. UTC
2014-10-07 12:35+0300, Nadav Amit:
> Thanks for reviewing this patch and the rest of the gang.

Happy to do so, I've learned a lot.

> On Oct 6, 2014, at 11:57 PM, Radim Kr?má? <rkrcmar@redhat.com> wrote:
> > 2014-10-03 01:10+0300, Nadav Amit:
> >> Setting the TSC deadline MSR that are initiated by the host (using ioctl's) may
> >> cause superfluous interrupt.  This occurs in the following case:
> >> 
> >> 1. A TSC deadline timer interrupt is pending.
> >> 2. TSC deadline was still not cleared (which happens during vcpu_run).
> >> 3. Userspace uses KVM_GET_MSRS/KVM_SET_MSRS to load the same deadline msr.
> >> 
> >> To solve this situation, ignore host initiated TSC deadline writes that do not
> >> change the deadline value.
> > 
> > I find this change slightly dubious …
> Why? I see similar handling of MSR_TSC_ADJUST.

In other modes, we don't inject pending timer when writing to
APIC_TMICT.  (Which, sadly, is inconsistent with APIC_LVTT.)

Adding a workaround is usually worse than removing the reason ...

> > - why does the userspace do that?
> It seems qemu’s kvm_cpu_exec does so when it calls kvm_arch_put_registers.
> It is pretty much done after every exit to userspace.

Thanks, it really doesn't do much checking of what is needed.

> > - why is host_initiated required?
> Since if the guest writes to the MSR, it means it wants to rearm the TSC deadline. Even if the deadline passed, interrupt should be triggered.

MSR isn't 0, so the deadline hasn't passed for the guest yet.

> If the guest writes the same value on the deadline MSR twice, it might expect two interrupts.

When guest writes to it without getting an interrupt first, it might
expect just one.  (Which it better IMO.)

> > 
> > Thanks.
> > 
> > It seems like an performance improvement, so why shouldn't return when
> >  'data <= tscdeadline && pending()'
> > or even
> >  'data <= now() && pending()'
> > 
> > (Sorry, I ran out of time to search for answers today.)
> The bug I encountered is not a performance issue, but a superfluous interrupt (functional bug).

True.

> As I said, the guest may write a new deadline MSR value which is in the past and expect an interrupt.

And it would get one from the currently pending timer.

What about the following patch?
(The introduced else branch could use some abstractions.)

--8<---
KVM: x86: fix deadline tsc interrupt injection

The check in kvm_set_lapic_tscdeadline_msr() was trying to prevent a
situation where we lose a pending deadline timer in a MSR write.
Losing it is fine, because it effectively occurs before the timer fired,
so we should be able to cancel or postpone it.

Another problem comes from interaction with QEMU, or other userspace
that can set deadline MSR without a good reason, when timer is already
pending:  one guest's deadline request results in more than one
interrupt because one is injected immediately on MSR write from
userspace and one through hrtimer later.

The solution is to remove the injection when replacing a pending timer
and to improve the usual QEMU path, we inject without a hrtimer when the
deadline has already passed.

Signed-off-by: Radim Kr?má? <rkrcmar@redhat.com>
Reported-by: Nadav Amit <namit@cs.technion.ac.il>
---
 arch/x86/kvm/lapic.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Paolo Bonzini Oct. 8, 2014, 10:07 a.m. UTC | #1
Il 08/10/2014 12:06, Radim Kr?má? ha scritto:
> > > - why is host_initiated required?
> 
> > Since if the guest writes to the MSR, it means it wants to rearm the TSC deadline. Even if the deadline passed, interrupt should be triggered.
> 
> MSR isn't 0, so the deadline hasn't passed for the guest yet.
> 
> > If the guest writes the same value on the deadline MSR twice, it might expect two interrupts.
> 
> When guest writes to it without getting an interrupt first, it might
> expect just one.  (Which it better IMO.)

Indeed that was my doubt as well.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nadav Amit Oct. 10, 2014, 1:55 a.m. UTC | #2
On Oct 8, 2014, at 1:06 PM, Radim Kr?má? <rkrcmar@redhat.com> wrote:

> 
> And it would get one from the currently pending timer.
> 
> What about the following patch?
> (The introduced else branch could use some abstractions.)
> 
> --8<---
> KVM: x86: fix deadline tsc interrupt injection
> 
> The check in kvm_set_lapic_tscdeadline_msr() was trying to prevent a
> situation where we lose a pending deadline timer in a MSR write.
> Losing it is fine, because it effectively occurs before the timer fired,
> so we should be able to cancel or postpone it.
> 
> Another problem comes from interaction with QEMU, or other userspace
> that can set deadline MSR without a good reason, when timer is already
> pending:  one guest's deadline request results in more than one
> interrupt because one is injected immediately on MSR write from
> userspace and one through hrtimer later.
> 
> The solution is to remove the injection when replacing a pending timer
> and to improve the usual QEMU path, we inject without a hrtimer when the
> deadline has already passed.
> 
> Signed-off-by: Radim Kr?má? <rkrcmar@redhat.com>
> Reported-by: Nadav Amit <namit@cs.technion.ac.il>
> ---
> arch/x86/kvm/lapic.c | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index b8345dd..51428dd 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1096,9 +1096,12 @@ static void start_apic_timer(struct kvm_lapic *apic)
>  		if (likely(tscdeadline > guest_tsc)) {
>  			ns = (tscdeadline - guest_tsc) * 1000000ULL;
>  			do_div(ns, this_tsc_khz);
> +			hrtimer_start(&apic->lapic_timer.timer,
> +				ktime_add_ns(now, ns), HRTIMER_MODE_ABS);
> +		} else {
> +			atomic_inc(&ktimer->pending);
> +			kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu);
>  		}
> -		hrtimer_start(&apic->lapic_timer.timer,
> -			ktime_add_ns(now, ns), HRTIMER_MODE_ABS);
> 
>  		local_irq_restore(flags);
>  	}
> @@ -1355,9 +1358,6 @@ void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, u64 data)
>  		return;
> 
>  	hrtimer_cancel(&apic->lapic_timer.timer);
> -	/* Inject here so clearing tscdeadline won't override new value */
> -	if (apic_has_pending_timer(vcpu))
> -		kvm_inject_apic_timer_irqs(vcpu);
>  	apic->lapic_timer.tscdeadline = data;
>  	start_apic_timer(apic);
> }

Perhaps I am missing something, but I don’t see how it solves the problem I encountered.
Recall the scenario:
1. A TSC deadline timer interrupt is pending.
2. TSC deadline was still not cleared (which happens during vcpu_run).
3. Userspace uses KVM_GET_MSRS/KVM_SET_MSRS to load the same deadline msr.

It appears that in such scenario the guest would still get spurious interrupt for no reason, as ktimer->pending may already be increased in apic_timer_fn.

Second, I think that the solution I proposed would perform better. Currently, there are many unnecessary cancellations and setups of the timer. This solution does not resolve this problem.

Last, I think that having less interrupts on deadline changes is not completely according to the SDM which says: "If software disarms the timer or postpones the deadline, race conditions may result in the delivery of a spurious timer interrupt.” It never says interrupts may be lost if you reprogram the deadline before you check it expired.

Thanks,
Nadav

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini Oct. 10, 2014, 2:02 p.m. UTC | #3
Il 08/10/2014 12:06, Radim Kr?má? ha scritto:
> KVM: x86: fix deadline tsc interrupt injection
> 
> The check in kvm_set_lapic_tscdeadline_msr() was trying to prevent a
> situation where we lose a pending deadline timer in a MSR write.
> Losing it is fine, because it effectively occurs before the timer fired,
> so we should be able to cancel or postpone it.
> 
> Another problem comes from interaction with QEMU, or other userspace
> that can set deadline MSR without a good reason, when timer is already
> pending:  one guest's deadline request results in more than one
> interrupt because one is injected immediately on MSR write from
> userspace and one through hrtimer later.
> 
> The solution is to remove the injection when replacing a pending timer
> and to improve the usual QEMU path, we inject without a hrtimer when the
> deadline has already passed.
> 
> Signed-off-by: Radim Kr?má? <rkrcmar@redhat.com>
> Reported-by: Nadav Amit <namit@cs.technion.ac.il>
> ---
>  arch/x86/kvm/lapic.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index b8345dd..51428dd 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1096,9 +1096,12 @@ static void start_apic_timer(struct kvm_lapic *apic)
>  		if (likely(tscdeadline > guest_tsc)) {
>  			ns = (tscdeadline - guest_tsc) * 1000000ULL;
>  			do_div(ns, this_tsc_khz);
> +			hrtimer_start(&apic->lapic_timer.timer,
> +				ktime_add_ns(now, ns), HRTIMER_MODE_ABS);
> +		} else {
> +			atomic_inc(&ktimer->pending);
> +			kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu);
>  		}
> -		hrtimer_start(&apic->lapic_timer.timer,
> -			ktime_add_ns(now, ns), HRTIMER_MODE_ABS);
>  
>  		local_irq_restore(flags);
>  	}
> @@ -1355,9 +1358,6 @@ void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, u64 data)
>  		return;
>  
>  	hrtimer_cancel(&apic->lapic_timer.timer);
> -	/* Inject here so clearing tscdeadline won't override new value */
> -	if (apic_has_pending_timer(vcpu))
> -		kvm_inject_apic_timer_irqs(vcpu);
>  	apic->lapic_timer.tscdeadline = data;
>  	start_apic_timer(apic);
>  }

Radim, the patch looks good but please extract this code:

        /*
         * There is a race window between reading and incrementing, but we do
         * not care about potentially losing timer events in the !reinject
         * case anyway. Note: KVM_REQ_PENDING_TIMER is implicitly checked
         * in vcpu_enter_guest.
         */
        if (!atomic_read(&ktimer->pending)) {
                atomic_inc(&ktimer->pending);
                /* FIXME: this code should not know anything about vcpus */
                kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu);
        }

        if (waitqueue_active(q))
                wake_up_interruptible(q);

to a new "static void apic_timer_expired(struct kvm_lapic *apic)" function,
and call it from both apic_timer_fn and start_apic_timer.

Also, we should not need to do wake_up_interruptible() unless we have
changed ktimer->pending from zero to non-zero.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index b8345dd..51428dd 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1096,9 +1096,12 @@  static void start_apic_timer(struct kvm_lapic *apic)
 		if (likely(tscdeadline > guest_tsc)) {
 			ns = (tscdeadline - guest_tsc) * 1000000ULL;
 			do_div(ns, this_tsc_khz);
+			hrtimer_start(&apic->lapic_timer.timer,
+				ktime_add_ns(now, ns), HRTIMER_MODE_ABS);
+		} else {
+			atomic_inc(&ktimer->pending);
+			kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu);
 		}
-		hrtimer_start(&apic->lapic_timer.timer,
-			ktime_add_ns(now, ns), HRTIMER_MODE_ABS);
 
 		local_irq_restore(flags);
 	}
@@ -1355,9 +1358,6 @@  void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, u64 data)
 		return;
 
 	hrtimer_cancel(&apic->lapic_timer.timer);
-	/* Inject here so clearing tscdeadline won't override new value */
-	if (apic_has_pending_timer(vcpu))
-		kvm_inject_apic_timer_irqs(vcpu);
 	apic->lapic_timer.tscdeadline = data;
 	start_apic_timer(apic);
 }