Fix almost infinite loop in APIC
diff mbox

Message ID 200901091434.39200.sheng@linux.intel.com
State Accepted, archived
Headers show

Commit Message

Sheng Yang Jan. 9, 2009, 6:34 a.m. UTC
On Friday 09 January 2009 00:36:06 Alexander Graf wrote:
> While booting Linux in VMware ESX, I encountered a strange effect
> in the in-kernel lapic implementation: time went backwards!
>
> While this should never occur, because of that the while loop that
> is done after the invalid calculations caused my host system to hang.
>
> In order to make debugging easier, let's replace this as suggested
> with a modulo function and not run into the danger of looping forever.
>
> To replace the nice hint this bug gave me that the values are broken,
> I added a printk message so people encountering this can at least
> see that something is fishy.
>
> Of course, the real issue needs to be fixed as well! I'm open to ideas
> why now < last_update!
>
> (Thanks to Kevin for his help in debugging this)
>
> Signed-off-by: Alexander Graf <agraf@suse.de>
> Signed-off-by: Kevin Wolf <kwolf@suse.de>
> ---

Hi Alexander

I'm a little suspect here:

> 	if (unlikely(ktime_to_ns(now) <=
> 		ktime_to_ns(apic->timer.last_update))) {
> 		/* Wrap around */
> 		passed = ktime_add(( {
> 				    (ktime_t) {
> 				    .tv64 = KTIME_MAX -
> 				    (apic->timer.last_update).tv64}; }
> 				   ), now);
> 		apic_debug("time elapsed\n");
> 	} else
> 		passed = ktime_sub(now, apic->timer.last_update);

And now apic timer base is hr_timer with CLOCK_MONOTONIC, and get_time() is 
really ktime_get() which is almost impossible to wrap around. If it's 
overflow, at least we need a warning. I think this piece of code due to clock 
source change.

So I doubt: due to some reason, now <= apic->timer.last_update, which cause a 
big wrap around operation.

And the most suspect:

> void kvm_apic_timer_intr_post(struct kvm_vcpu *vcpu, int vec)
> {
> 	struct kvm_lapic *apic = vcpu->arch.apic;
>
> 	if (apic && apic_lvt_vector(apic, APIC_LVTT) == vec)
> 		apic->timer.last_update = ktime_add_ns(
> 				apic->timer.last_update,
> 				apic->timer.period);
> }

Not sure what's happening, have you tried this? (In fact, I am little willing 
to replace all apic->timer.dev.base->get_time() with more explicit ktime_get() 
as in pit.)

--
 int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu)

Comments

Alexander Graf Jan. 9, 2009, 10:49 a.m. UTC | #1
On 09.01.2009, at 07:34, Sheng Yang <sheng@linux.intel.com> wrote:

> On Friday 09 January 2009 00:36:06 Alexander Graf wrote:
>> While booting Linux in VMware ESX, I encountered a strange effect
>> in the in-kernel lapic implementation: time went backwards!
>>
>> While this should never occur, because of that the while loop that
>> is done after the invalid calculations caused my host system to hang.
>>
>> In order to make debugging easier, let's replace this as suggested
>> with a modulo function and not run into the danger of looping  
>> forever.
>>
>> To replace the nice hint this bug gave me that the values are broken,
>> I added a printk message so people encountering this can at least
>> see that something is fishy.
>>
>> Of course, the real issue needs to be fixed as well! I'm open to  
>> ideas
>> why now < last_update!
>>
>> (Thanks to Kevin for his help in debugging this)
>>
>> Signed-off-by: Alexander Graf <agraf@suse.de>
>> Signed-off-by: Kevin Wolf <kwolf@suse.de>
>> ---
>
> Hi Alexander
>
> I'm a little suspect here:
>
>>    if (unlikely(ktime_to_ns(now) <=
>>        ktime_to_ns(apic->timer.last_update))) {
>>        /* Wrap around */
>>        passed = ktime_add(( {
>>                    (ktime_t) {
>>                    .tv64 = KTIME_MAX -
>>                    (apic->timer.last_update).tv64}; }
>>                   ), now);
>>        apic_debug("time elapsed\n");
>>    } else
>>        passed = ktime_sub(now, apic->timer.last_update);
>
> And now apic timer base is hr_timer with CLOCK_MONOTONIC, and  
> get_time() is
> really ktime_get() which is almost impossible to wrap around. If it's
> overflow, at least we need a warning. I think this piece of code due  
> to clock
> source change.
>
> So I doubt: due to some reason, now <= apic->timer.last_update,  
> which cause a
> big wrap around operation.

Well, I put a printk in that 'time elapsed' place and it happened at  
the same time my other printk fired with now < last_update. I can't  
explain why this is happening, but I do see it.

>
>
> And the most suspect:
>
>> void kvm_apic_timer_intr_post(struct kvm_vcpu *vcpu, int vec)
>> {
>>    struct kvm_lapic *apic = vcpu->arch.apic;
>>
>>    if (apic && apic_lvt_vector(apic, APIC_LVTT) == vec)
>>        apic->timer.last_update = ktime_add_ns(
>>                apic->timer.last_update,
>>                apic->timer.period);
>> }
>
> Not sure what's happening, have you tried this? (In fact, I am  
> little willing
> to replace all apic->timer.dev.base->get_time() with more explicit  
> ktime_get()
> as in pit.)


I can give that patch a try :)

Alex

>
>
> --
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index afac68c..414e7e0 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1115,9 +1115,7 @@ void kvm_apic_timer_intr_post(struct kvm_vcpu  
> *vcpu, int
> vec)
>        struct kvm_lapic *apic = vcpu->arch.apic;
>
>        if (apic && apic_lvt_vector(apic, APIC_LVTT) == vec)
> -               apic->timer.last_update = ktime_add_ns(
> -                               apic->timer.last_update,
> -                               apic->timer.period);
> +               apic->timer.last_update = apic->timer.dev.base- 
> >get_time();
> }
>
> int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu)
>
>
> -- 
> regards
> Yang, Sheng
>
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Graf Jan. 9, 2009, 12:57 p.m. UTC | #2
Sheng Yang wrote:
> On Friday 09 January 2009 00:36:06 Alexander Graf wrote:
>   
>> While booting Linux in VMware ESX, I encountered a strange effect
>> in the in-kernel lapic implementation: time went backwards!
>>
>> While this should never occur, because of that the while loop that
>> is done after the invalid calculations caused my host system to hang.
>>
>> In order to make debugging easier, let's replace this as suggested
>> with a modulo function and not run into the danger of looping forever.
>>
>> To replace the nice hint this bug gave me that the values are broken,
>> I added a printk message so people encountering this can at least
>> see that something is fishy.
>>
>> Of course, the real issue needs to be fixed as well! I'm open to ideas
>> why now < last_update!
>>
>> (Thanks to Kevin for his help in debugging this)
>>
>> Signed-off-by: Alexander Graf <agraf@suse.de>
>> Signed-off-by: Kevin Wolf <kwolf@suse.de>
>> ---
>>     
>
> Hi Alexander
>
> I'm a little suspect here:
>
>   
>> 	if (unlikely(ktime_to_ns(now) <=
>> 		ktime_to_ns(apic->timer.last_update))) {
>> 		/* Wrap around */
>> 		passed = ktime_add(( {
>> 				    (ktime_t) {
>> 				    .tv64 = KTIME_MAX -
>> 				    (apic->timer.last_update).tv64}; }
>> 				   ), now);
>> 		apic_debug("time elapsed\n");
>> 	} else
>> 		passed = ktime_sub(now, apic->timer.last_update);
>>     
>
> And now apic timer base is hr_timer with CLOCK_MONOTONIC, and get_time() is 
> really ktime_get() which is almost impossible to wrap around. If it's 
> overflow, at least we need a warning. I think this piece of code due to clock 
> source change.
>
> So I doubt: due to some reason, now <= apic->timer.last_update, which cause a 
> big wrap around operation.
>
> And the most suspect:
>
>   
>> void kvm_apic_timer_intr_post(struct kvm_vcpu *vcpu, int vec)
>> {
>> 	struct kvm_lapic *apic = vcpu->arch.apic;
>>
>> 	if (apic && apic_lvt_vector(apic, APIC_LVTT) == vec)
>> 		apic->timer.last_update = ktime_add_ns(
>> 				apic->timer.last_update,
>> 				apic->timer.period);
>> }
>>     
>
> Not sure what's happening, have you tried this? (In fact, I am little willing 
> to replace all apic->timer.dev.base->get_time() with more explicit ktime_get() 
> as in pit.)
>   

Yes, this code is the culprit. Using that patch of yours now is always >
last_update. I'd still like to see that while loop go away ;-).

Alex

> --
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index afac68c..414e7e0 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1115,9 +1115,7 @@ void kvm_apic_timer_intr_post(struct kvm_vcpu *vcpu, int 
> vec)
>         struct kvm_lapic *apic = vcpu->arch.apic;
>
>         if (apic && apic_lvt_vector(apic, APIC_LVTT) == vec)
> -               apic->timer.last_update = ktime_add_ns(
> -                               apic->timer.last_update,
> -                               apic->timer.period);
> +               apic->timer.last_update = apic->timer.dev.base->get_time();
>  }
>
>  int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu)
>
>
>   

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sheng Yang Jan. 10, 2009, 11:21 a.m. UTC | #3
On Fri, Jan 09, 2009 at 01:57:30PM +0100, Alexander Graf wrote:
> Sheng Yang wrote:
> > On Friday 09 January 2009 00:36:06 Alexander Graf wrote:
> >   
> >> While booting Linux in VMware ESX, I encountered a strange effect
> >> in the in-kernel lapic implementation: time went backwards!
> >>
> >> While this should never occur, because of that the while loop that
> >> is done after the invalid calculations caused my host system to hang.
> >>
> >> In order to make debugging easier, let's replace this as suggested
> >> with a modulo function and not run into the danger of looping forever.
> >>
> >> To replace the nice hint this bug gave me that the values are broken,
> >> I added a printk message so people encountering this can at least
> >> see that something is fishy.
> >>
> >> Of course, the real issue needs to be fixed as well! I'm open to ideas
> >> why now < last_update!
> >>
> >> (Thanks to Kevin for his help in debugging this)
> >>
> >> Signed-off-by: Alexander Graf <agraf@suse.de>
> >> Signed-off-by: Kevin Wolf <kwolf@suse.de>
> >> ---
> >>     
> >
> > Hi Alexander
> >
> > I'm a little suspect here:
> >
> >   
> >> 	if (unlikely(ktime_to_ns(now) <=
> >> 		ktime_to_ns(apic->timer.last_update))) {
> >> 		/* Wrap around */
> >> 		passed = ktime_add(( {
> >> 				    (ktime_t) {
> >> 				    .tv64 = KTIME_MAX -
> >> 				    (apic->timer.last_update).tv64}; }
> >> 				   ), now);
> >> 		apic_debug("time elapsed\n");
> >> 	} else
> >> 		passed = ktime_sub(now, apic->timer.last_update);
> >>     
> >
> > And now apic timer base is hr_timer with CLOCK_MONOTONIC, and get_time() is 
> > really ktime_get() which is almost impossible to wrap around. If it's 
> > overflow, at least we need a warning. I think this piece of code due to clock 
> > source change.
> >
> > So I doubt: due to some reason, now <= apic->timer.last_update, which cause a 
> > big wrap around operation.
> >
> > And the most suspect:
> >
> >   
> >> void kvm_apic_timer_intr_post(struct kvm_vcpu *vcpu, int vec)
> >> {
> >> 	struct kvm_lapic *apic = vcpu->arch.apic;
> >>
> >> 	if (apic && apic_lvt_vector(apic, APIC_LVTT) == vec)
> >> 		apic->timer.last_update = ktime_add_ns(
> >> 				apic->timer.last_update,
> >> 				apic->timer.period);
> >> }
> >>     
> >
> > Not sure what's happening, have you tried this? (In fact, I am little willing 
> > to replace all apic->timer.dev.base->get_time() with more explicit ktime_get() 
> > as in pit.)
> >   
> 
> Yes, this code is the culprit. Using that patch of yours now is always >
> last_update. I'd still like to see that while loop go away ;-).

Glad it works. :)

I think it's a little similar to the PIT problem that Marcelo fixed, but I
am not sure about the reason. Anyone got idea?

Anyway, I would like to post a patch later.
Sheng Yang Jan. 10, 2009, 11:25 a.m. UTC | #4
On Fri, Jan 09, 2009 at 01:57:30PM +0100, Alexander Graf wrote:
> Sheng Yang wrote:
> > On Friday 09 January 2009 00:36:06 Alexander Graf wrote:
> >   
> >> While booting Linux in VMware ESX, I encountered a strange effect
> >> in the in-kernel lapic implementation: time went backwards!
> >>
> >> While this should never occur, because of that the while loop that
> >> is done after the invalid calculations caused my host system to hang.
> >>
> >> In order to make debugging easier, let's replace this as suggested
> >> with a modulo function and not run into the danger of looping forever.
> >>
> >> To replace the nice hint this bug gave me that the values are broken,
> >> I added a printk message so people encountering this can at least
> >> see that something is fishy.
> >>
> >> Of course, the real issue needs to be fixed as well! I'm open to ideas
> >> why now < last_update!
> >>
> >> (Thanks to Kevin for his help in debugging this)
> >>
> >> Signed-off-by: Alexander Graf <agraf@suse.de>
> >> Signed-off-by: Kevin Wolf <kwolf@suse.de>
> >> ---
> >>     
> >
> > Hi Alexander
> >
> > I'm a little suspect here:
> >
> >   
> >> 	if (unlikely(ktime_to_ns(now) <=
> >> 		ktime_to_ns(apic->timer.last_update))) {
> >> 		/* Wrap around */
> >> 		passed = ktime_add(( {
> >> 				    (ktime_t) {
> >> 				    .tv64 = KTIME_MAX -
> >> 				    (apic->timer.last_update).tv64}; }
> >> 				   ), now);
> >> 		apic_debug("time elapsed\n");
> >> 	} else
> >> 		passed = ktime_sub(now, apic->timer.last_update);
> >>     
> >
> > And now apic timer base is hr_timer with CLOCK_MONOTONIC, and get_time() is 
> > really ktime_get() which is almost impossible to wrap around. If it's 
> > overflow, at least we need a warning. I think this piece of code due to clock 
> > source change.
> >
> > So I doubt: due to some reason, now <= apic->timer.last_update, which cause a 
> > big wrap around operation.
> >
> > And the most suspect:
> >
> >   
> >> void kvm_apic_timer_intr_post(struct kvm_vcpu *vcpu, int vec)
> >> {
> >> 	struct kvm_lapic *apic = vcpu->arch.apic;
> >>
> >> 	if (apic && apic_lvt_vector(apic, APIC_LVTT) == vec)
> >> 		apic->timer.last_update = ktime_add_ns(
> >> 				apic->timer.last_update,
> >> 				apic->timer.period);
> >> }
> >>     
> >
> > Not sure what's happening, have you tried this? (In fact, I am little willing 
> > to replace all apic->timer.dev.base->get_time() with more explicit ktime_get() 
> > as in pit.)
> >   
> 
> Yes, this code is the culprit. Using that patch of yours now is always >
> last_update. I'd still like to see that while loop go away ;-).

Oh, for the loop, let's ask Avi. :)
Sheng Yang Jan. 10, 2009, 11:28 a.m. UTC | #5
On Fri, Jan 09, 2009 at 01:57:30PM +0100, Alexander Graf wrote:
> Sheng Yang wrote:
> > On Friday 09 January 2009 00:36:06 Alexander Graf wrote:
> >   
> >> While booting Linux in VMware ESX, I encountered a strange effect
> >> in the in-kernel lapic implementation: time went backwards!
> >>
> >> While this should never occur, because of that the while loop that
> >> is done after the invalid calculations caused my host system to hang.
> >>
> >> In order to make debugging easier, let's replace this as suggested
> >> with a modulo function and not run into the danger of looping forever.
> >>
> >> To replace the nice hint this bug gave me that the values are broken,
> >> I added a printk message so people encountering this can at least
> >> see that something is fishy.
> >>
> >> Of course, the real issue needs to be fixed as well! I'm open to ideas
> >> why now < last_update!
> >>
> >> (Thanks to Kevin for his help in debugging this)
> >>
> >> Signed-off-by: Alexander Graf <agraf@suse.de>
> >> Signed-off-by: Kevin Wolf <kwolf@suse.de>
> >> ---
> >>     
> >
> > Hi Alexander
> >
> > I'm a little suspect here:
> >
> >   
> >> 	if (unlikely(ktime_to_ns(now) <=
> >> 		ktime_to_ns(apic->timer.last_update))) {
> >> 		/* Wrap around */
> >> 		passed = ktime_add(( {
> >> 				    (ktime_t) {
> >> 				    .tv64 = KTIME_MAX -
> >> 				    (apic->timer.last_update).tv64}; }
> >> 				   ), now);
> >> 		apic_debug("time elapsed\n");
> >> 	} else
> >> 		passed = ktime_sub(now, apic->timer.last_update);
> >>     
> >
> > And now apic timer base is hr_timer with CLOCK_MONOTONIC, and get_time() is 
> > really ktime_get() which is almost impossible to wrap around. If it's 
> > overflow, at least we need a warning. I think this piece of code due to clock 
> > source change.
> >
> > So I doubt: due to some reason, now <= apic->timer.last_update, which cause a 
> > big wrap around operation.
> >
> > And the most suspect:
> >
> >   
> >> void kvm_apic_timer_intr_post(struct kvm_vcpu *vcpu, int vec)
> >> {
> >> 	struct kvm_lapic *apic = vcpu->arch.apic;
> >>
> >> 	if (apic && apic_lvt_vector(apic, APIC_LVTT) == vec)
> >> 		apic->timer.last_update = ktime_add_ns(
> >> 				apic->timer.last_update,
> >> 				apic->timer.period);
> >> }
> >>     
> >
> > Not sure what's happening, have you tried this? (In fact, I am little willing 
> > to replace all apic->timer.dev.base->get_time() with more explicit ktime_get() 
> > as in pit.)
> >   
> 
> Yes, this code is the culprit. Using that patch of yours now is always >
> last_update. I'd still like to see that while loop go away ;-).

(I would say all in one mail, but it always comes a little later... :) )

I think at least part of comment for the while loop should be kept...
Marcelo Tosatti Jan. 11, 2009, 4:55 a.m. UTC | #6
On Sat, Jan 10, 2009 at 07:21:13PM +0800, Sheng Yang wrote:
> On Fri, Jan 09, 2009 at 01:57:30PM +0100, Alexander Graf wrote:
> > Sheng Yang wrote:
> > > On Friday 09 January 2009 00:36:06 Alexander Graf wrote:
> > >   
> > >> While booting Linux in VMware ESX, I encountered a strange effect
> > >> in the in-kernel lapic implementation: time went backwards!
> > >>
> > >> While this should never occur, because of that the while loop that
> > >> is done after the invalid calculations caused my host system to hang.
> > >>
> > >> In order to make debugging easier, let's replace this as suggested
> > >> with a modulo function and not run into the danger of looping forever.
> > >>
> > >> To replace the nice hint this bug gave me that the values are broken,
> > >> I added a printk message so people encountering this can at least
> > >> see that something is fishy.
> > >>
> > >> Of course, the real issue needs to be fixed as well! I'm open to ideas
> > >> why now < last_update!
> > >>
> > >> (Thanks to Kevin for his help in debugging this)
> > >>
> > >> Signed-off-by: Alexander Graf <agraf@suse.de>
> > >> Signed-off-by: Kevin Wolf <kwolf@suse.de>
> > >> ---
> > >>     
> > >
> > > Hi Alexander
> > >
> > > I'm a little suspect here:
> > >
> > >   
> > >> 	if (unlikely(ktime_to_ns(now) <=
> > >> 		ktime_to_ns(apic->timer.last_update))) {
> > >> 		/* Wrap around */
> > >> 		passed = ktime_add(( {
> > >> 				    (ktime_t) {
> > >> 				    .tv64 = KTIME_MAX -
> > >> 				    (apic->timer.last_update).tv64}; }
> > >> 				   ), now);
> > >> 		apic_debug("time elapsed\n");
> > >> 	} else
> > >> 		passed = ktime_sub(now, apic->timer.last_update);
> > >>     
> > >
> > > And now apic timer base is hr_timer with CLOCK_MONOTONIC, and get_time() is 
> > > really ktime_get() which is almost impossible to wrap around. If it's 
> > > overflow, at least we need a warning. I think this piece of code due to clock 
> > > source change.
> > >
> > > So I doubt: due to some reason, now <= apic->timer.last_update, which cause a 
> > > big wrap around operation.
> > >
> > > And the most suspect:
> > >
> > >   
> > >> void kvm_apic_timer_intr_post(struct kvm_vcpu *vcpu, int vec)
> > >> {
> > >> 	struct kvm_lapic *apic = vcpu->arch.apic;
> > >>
> > >> 	if (apic && apic_lvt_vector(apic, APIC_LVTT) == vec)
> > >> 		apic->timer.last_update = ktime_add_ns(
> > >> 				apic->timer.last_update,
> > >> 				apic->timer.period);
> > >> }
> > >>     
> > >
> > > Not sure what's happening, have you tried this? (In fact, I am little willing 
> > > to replace all apic->timer.dev.base->get_time() with more explicit ktime_get() 
> > > as in pit.)
> > >   
> > 
> > Yes, this code is the culprit. Using that patch of yours now is always >
> > last_update. I'd still like to see that while loop go away ;-).

Sheng,

Will separate the discussions on items:


1) On the current code, last_update is assigned by kvm_apic_timer_intr_post
as the expiration time of the next interrupt, while apic_get_tmcct
interprets it as "countdown start":

    passed = ktime_sub(now, apic->timer.last_update);

This explains why your first patch against kvm_apic_timer_intr_post
fixed the bug Alexander was experiencing.

BTW, yes, the "overflow" handling is not necessary as you noted.

2) The current count calculation is not right (apart from any eventual
bugs), or I don't understand it. It uses the interrupt injection
time as "countdown start" (or rearm in Linux terminology), in
kvm_apic_timer_intr_post.

But the host hrtimer starts counting once rearmed (assuming periodic
configuration). So the inaccuracy of APIC_TMCCT depends on this delay
between hrtimer rearm and interrupt injection.

What is this scheme supposed to achieve? It seems much simpler, and
more accurate, to return scaled hrtimer_get_remaining on APIC_TMCCT
emulation.

3) And then there's interrupt reinjection. Once interrupts accumulate,
and we reinject, current count reads become completly bogus. This is
the reason for time drift on RHEL4-style kernels with "clock=tsc". The
algorithm calculates the interrupt handling delay by using the PIT
count (equivalent to APIC_TMCCT). But PIT count emulation uses the
hrtimer expiration time, which has nothing to do with the accumulated
interrupts.

So what I propose is to first switch lapic current count emulation to a
straightforward scaled hrtimer_get_remaining equivalent.

For the reinjection case, maintain an average of the delay between host
interrupt and interrupt injection (this can be generic, so both PIT
and LAPIC timer can use it). Return that scaled average on APIC_TMCCT
emulation whenever pending > 2.

What you think?

And as Avi noted, there should be a distinction between accumulated
interrupts, we only want to reinject the ones caused by host load (and
there are some interesting optimizations that can be done later, such
as stopping hrtimer from ticking until guest acks one interrupt, and
calculating the missed ones). But better start with the basic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sheng Yang Jan. 13, 2009, 7:47 a.m. UTC | #7
On Sunday 11 January 2009 12:55:58 Marcelo Tosatti wrote:
> On Sat, Jan 10, 2009 at 07:21:13PM +0800, Sheng Yang wrote:
> > On Fri, Jan 09, 2009 at 01:57:30PM +0100, Alexander Graf wrote:
> > > Sheng Yang wrote:
> > > > On Friday 09 January 2009 00:36:06 Alexander Graf wrote:
> > > >> While booting Linux in VMware ESX, I encountered a strange effect
> > > >> in the in-kernel lapic implementation: time went backwards!
> > > >>
> > > >> ---
> > > >
> > > > Hi Alexander
> > > >
> > > > I'm a little suspect here:
> > > >> 	if (unlikely(ktime_to_ns(now) <=
> > > >> 		ktime_to_ns(apic->timer.last_update))) {
> > > >> 		/* Wrap around */
> > > >> 		passed = ktime_add(( {
> > > >> 				    (ktime_t) {
> > > >> 				    .tv64 = KTIME_MAX -
> > > >> 				    (apic->timer.last_update).tv64}; }
> > > >> 				   ), now);
> > > >> 		apic_debug("time elapsed\n");
> > > >> 	} else
> > > >> 		passed = ktime_sub(now, apic->timer.last_update);
> > > >
> > > > And now apic timer base is hr_timer with CLOCK_MONOTONIC, and
> > > > get_time() is really ktime_get() which is almost impossible to wrap
> > > > around. If it's overflow, at least we need a warning. I think this
> > > > piece of code due to clock source change.
> > > >
> > > > So I doubt: due to some reason, now <= apic->timer.last_update, which
> > > > cause a big wrap around operation.
> > > >
> > > > And the most suspect:
> > > >> void kvm_apic_timer_intr_post(struct kvm_vcpu *vcpu, int vec)
> > > >> {
> > > >> 	struct kvm_lapic *apic = vcpu->arch.apic;
> > > >>
> > > >> 	if (apic && apic_lvt_vector(apic, APIC_LVTT) == vec)
> > > >> 		apic->timer.last_update = ktime_add_ns(
> > > >> 				apic->timer.last_update,
> > > >> 				apic->timer.period);
> > > >> }
> > > >
> > > > Not sure what's happening, have you tried this? (In fact, I am little
> > > > willing to replace all apic->timer.dev.base->get_time() with more
> > > > explicit ktime_get() as in pit.)
> > >
> > > Yes, this code is the culprit. Using that patch of yours now is always
> > > > last_update. I'd still like to see that while loop go away ;-).
>
> Sheng,
>
> Will separate the discussions on items:
>
>
> 1) On the current code, last_update is assigned by kvm_apic_timer_intr_post
> as the expiration time of the next interrupt, while apic_get_tmcct
> interprets it as "countdown start":
>
>     passed = ktime_sub(now, apic->timer.last_update);
>
> This explains why your first patch against kvm_apic_timer_intr_post
> fixed the bug Alexander was experiencing.
>
> BTW, yes, the "overflow" handling is not necessary as you noted.
>
> 2) The current count calculation is not right (apart from any eventual
> bugs), or I don't understand it. It uses the interrupt injection
> time as "countdown start" (or rearm in Linux terminology), in
> kvm_apic_timer_intr_post.
>
> But the host hrtimer starts counting once rearmed (assuming periodic
> configuration). So the inaccuracy of APIC_TMCCT depends on this delay
> between hrtimer rearm and interrupt injection.
>
> What is this scheme supposed to achieve? It seems much simpler, and
> more accurate, to return scaled hrtimer_get_remaining on APIC_TMCCT
> emulation.

Marcelo,

Sorry to late reply...

After rechecked the code, got some thoughts. last_update can indicated rearm 
time as following:

The original apic_timer_intr_post() got:

>         if (apic && apic_lvt_vector(apic, APIC_LVTT) == vec)
>                 apic->timer.last_update = ktime_add_ns(
>                                 apic->timer.last_update,
>                                 apic->timer.period);

I think the purpose is let guest see a reasonable TMCCT. Which means:

1. Timer start to fire at start_apic_timer(). last_update=now(). 

2. Every one LAPIC timer interrupt was injected into the guest, so the time at 
least elapsed timer.period(the first alarm set at "period" later), then 
last_update is increased by period. Notice last_update's base is set before 
timer fire, so it's not indicated the next one, but *the time this interrupt 
should be injected*(time can be delayed)... So last_update = n*period + base.

3. If there is pending interrupt, last_update won't be updated, so we have 
this:

>                         while (counter_passed > tmcct)
>                                 counter_passed -= tmcct;
>                         tmcct -= counter_passed;
(notice tmcct is TMICT here.)

And the returned tmcct value indicated (counter_passed % tmict), a offset 
regardless of pending interrupt numbers.

I think now the overflow seems OK, but I am not sure why last_update can be 
bigger than ktime_get()? Maybe due to vcpu migration? Suspect some racy or 
boundary condition existed...

And base on this, I don't think my quick fix is correct...

>
> 3) And then there's interrupt reinjection. Once interrupts accumulate,
> and we reinject, current count reads become completly bogus. This is
> the reason for time drift on RHEL4-style kernels with "clock=tsc". The
> algorithm calculates the interrupt handling delay by using the PIT
> count (equivalent to APIC_TMCCT). But PIT count emulation uses the
> hrtimer expiration time, which has nothing to do with the accumulated
> interrupts.
>
> So what I propose is to first switch lapic current count emulation to a
> straightforward scaled hrtimer_get_remaining equivalent.
>
> For the reinjection case, maintain an average of the delay between host
> interrupt and interrupt injection (this can be generic, so both PIT
> and LAPIC timer can use it). Return that scaled average on APIC_TMCCT
> emulation whenever pending > 2.
>
> What you think?

1. If we simply use ktime_get() to update last_update, the interval can't be 
same between different last_update. I think we may got some problem here, but 
not for sure. For example, three delayed interrupt was injected one after one, 
last_update would show very little interval, and APIC_TMCCT may got trouble.  
Maintain an average of the delay seems a little tricky here, and I am not sure 
if it would help the problem...  Average is average at most, not the real 
affection at the time...

2. For the current mechanism, the interval is the same, so last_update always 
equal to n*period + base. If there are more than one pending interrupts, TMCCT 
should also can return the relative correct value.

So I am leaning toward to fix current problem, though I haven't find the root 
cause yet...

> And as Avi noted, there should be a distinction between accumulated
> interrupts, we only want to reinject the ones caused by host load (and
> there are some interesting optimizations that can be done later, such
> as stopping hrtimer from ticking until guest acks one interrupt, and
> calculating the missed ones). But better start with the basic.

Agree...
Marcelo Tosatti Jan. 13, 2009, 10:01 p.m. UTC | #8
On Tue, Jan 13, 2009 at 03:47:49PM +0800, Sheng Yang wrote:
> On Sunday 11 January 2009 12:55:58 Marcelo Tosatti wrote:
> The original apic_timer_intr_post() got:
> 
> >         if (apic && apic_lvt_vector(apic, APIC_LVTT) == vec)
> >                 apic->timer.last_update = ktime_add_ns(
> >                                 apic->timer.last_update,
> >                                 apic->timer.period);
> 
> I think the purpose is let guest see a reasonable TMCCT. Which means:
> 
> 1. Timer start to fire at start_apic_timer(). last_update=now(). 
> 
> 2. Every one LAPIC timer interrupt was injected into the guest, so the time at 
> least elapsed timer.period(the first alarm set at "period" later), then 
> last_update is increased by period. Notice last_update's base is set before 
> timer fire, so it's not indicated the next one, but *the time this interrupt 
> should be injected*(time can be delayed)... So last_update = n*period + base.

Right.

> 3. If there is pending interrupt, last_update won't be updated, so we have 
> this:
> 
> >                         while (counter_passed > tmcct)
> >                                 counter_passed -= tmcct;
> >                         tmcct -= counter_passed;
> (notice tmcct is TMICT here.)
> 
> And the returned tmcct value indicated (counter_passed % tmict), a offset 
> regardless of pending interrupt numbers.

Right. The problem is that for accumulated interrupts, the guest will
receive the interrupt as fast as the qemu process can be scheduled (as
long as its not masked, of course). There could be higher priority
vectors in the mix, but thats not the common case.

So calculating the offset using last interrupt injection is not very 
reasonable in this case.

> I think now the overflow seems OK, but I am not sure why last_update can be 
> bigger than ktime_get()? Maybe due to vcpu migration? Suspect some racy or 
> boundary condition existed...

Yep, it seems to be some small inaccuracy in the accounting that
causes it. 

> And base on this, I don't think my quick fix is correct...
> 
> >
> > 3) And then there's interrupt reinjection. Once interrupts accumulate,
> > and we reinject, current count reads become completly bogus. This is
> > the reason for time drift on RHEL4-style kernels with "clock=tsc". The
> > algorithm calculates the interrupt handling delay by using the PIT
> > count (equivalent to APIC_TMCCT). But PIT count emulation uses the
> > hrtimer expiration time, which has nothing to do with the accumulated
> > interrupts.
> >
> > So what I propose is to first switch lapic current count emulation to a
> > straightforward scaled hrtimer_get_remaining equivalent.
> >
> > For the reinjection case, maintain an average of the delay between host
> > interrupt and interrupt injection (this can be generic, so both PIT
> > and LAPIC timer can use it). Return that scaled average on APIC_TMCCT
> > emulation whenever pending > 2.
> >
> > What you think?
> 
> 1. If we simply use ktime_get() to update last_update, the interval can't be 
> same between different last_update. I think we may got some problem here, but 
> not for sure. For example, three delayed interrupt was injected one after one, 
> last_update would show very little interval, and APIC_TMCCT may got trouble.  

I mean get rid of last_update and use hrtimer_get_remaining on
apic_get_tmcct for no pending interrupts. And estimated delay between
timer fire and injection if there are pending interrupts.

> Maintain an average of the delay seems a little tricky here, and I am not sure 
> if it would help the problem...  Average is average at most, not the real 
> affection at the time...

Yes. Probably using the last delay between timer fire and injection is
more accurate than an average.

> 
> 2. For the current mechanism, the interval is the same, so last_update always 
> equal to n*period + base. If there are more than one pending interrupts, TMCCT 
> should also can return the relative correct value.
> 
> So I am leaning toward to fix current problem, though I haven't find the root 
> cause yet...

I think it can be simpler, without the need to deal with overflow at
all. 

Understanding the present root cause is important though.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sheng Yang Jan. 14, 2009, 9:17 a.m. UTC | #9
On Wednesday 14 January 2009 06:01:17 Marcelo Tosatti wrote:
> On Tue, Jan 13, 2009 at 03:47:49PM +0800, Sheng Yang wrote:
> > On Sunday 11 January 2009 12:55:58 Marcelo Tosatti wrote:
> >
> > The original apic_timer_intr_post() got:
> > >         if (apic && apic_lvt_vector(apic, APIC_LVTT) == vec)
> > >                 apic->timer.last_update = ktime_add_ns(
> > >                                 apic->timer.last_update,
> > >                                 apic->timer.period);
> >
> > I think the purpose is let guest see a reasonable TMCCT. Which means:
> >
> > 1. Timer start to fire at start_apic_timer(). last_update=now().
> >
> > 2. Every one LAPIC timer interrupt was injected into the guest, so the
> > time at least elapsed timer.period(the first alarm set at "period"
> > later), then last_update is increased by period. Notice last_update's
> > base is set before timer fire, so it's not indicated the next one, but
> > *the time this interrupt should be injected*(time can be delayed)... So
> > last_update = n*period + base.
>
> Right.
>
> > 3. If there is pending interrupt, last_update won't be updated, so we
> > have
> >
> > this:
> > >                         while (counter_passed > tmcct)
> > >                                 counter_passed -= tmcct;
> > >                         tmcct -= counter_passed;
> >
> > (notice tmcct is TMICT here.)
> >
> > And the returned tmcct value indicated (counter_passed % tmict), a offset
> > regardless of pending interrupt numbers.
>
> Right. The problem is that for accumulated interrupts, the guest will
> receive the interrupt as fast as the qemu process can be scheduled (as
> long as its not masked, of course). There could be higher priority
> vectors in the mix, but thats not the common case.
>
> So calculating the offset using last interrupt injection is not very
> reasonable in this case.

Um... I think it's not easy to find a complete reasonable result here, I 
haven't look into how OS us TMCCT, maybe it can help us to find a more 
reasonable result here.

> > I think now the overflow seems OK, but I am not sure why last_update can
> > be bigger than ktime_get()? Maybe due to vcpu migration? Suspect some
> > racy or boundary condition existed...
>
> Yep, it seems to be some small inaccuracy in the accounting that
> causes it.
>
> > And base on this, I don't think my quick fix is correct...
> >
> > > 3) And then there's interrupt reinjection. Once interrupts accumulate,
> > > and we reinject, current count reads become completly bogus. This is
> > > the reason for time drift on RHEL4-style kernels with "clock=tsc". The
> > > algorithm calculates the interrupt handling delay by using the PIT
> > > count (equivalent to APIC_TMCCT). But PIT count emulation uses the
> > > hrtimer expiration time, which has nothing to do with the accumulated
> > > interrupts.
> > >
> > > So what I propose is to first switch lapic current count emulation to a
> > > straightforward scaled hrtimer_get_remaining equivalent.
> > >
> > > For the reinjection case, maintain an average of the delay between host
> > > interrupt and interrupt injection (this can be generic, so both PIT
> > > and LAPIC timer can use it). Return that scaled average on APIC_TMCCT
> > > emulation whenever pending > 2.
> > >
> > > What you think?
> >
> > 1. If we simply use ktime_get() to update last_update, the interval can't
> > be same between different last_update. I think we may got some problem
> > here, but not for sure. For example, three delayed interrupt was injected
> > one after one, last_update would show very little interval, and
> > APIC_TMCCT may got trouble.
>
> I mean get rid of last_update and use hrtimer_get_remaining on
> apic_get_tmcct for no pending interrupts. And estimated delay between
> timer fire and injection if there are pending interrupts.
>
> > Maintain an average of the delay seems a little tricky here, and I am not
> > sure if it would help the problem...  Average is average at most, not the
> > real affection at the time...
>
> Yes. Probably using the last delay between timer fire and injection is
> more accurate than an average.
>
> > 2. For the current mechanism, the interval is the same, so last_update
> > always equal to n*period + base. If there are more than one pending
> > interrupts, TMCCT should also can return the relative correct value.
> >
> > So I am leaning toward to fix current problem, though I haven't find the
> > root cause yet...
>
> I think it can be simpler, without the need to deal with overflow at
> all.
>
> Understanding the present root cause is important though.

Yes, but still don't have many clues now... I would look through the code and 
purpose some patches for Alex to test...

Patch
diff mbox

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index afac68c..414e7e0 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1115,9 +1115,7 @@  void kvm_apic_timer_intr_post(struct kvm_vcpu *vcpu, int 
vec)
        struct kvm_lapic *apic = vcpu->arch.apic;

        if (apic && apic_lvt_vector(apic, APIC_LVTT) == vec)
-               apic->timer.last_update = ktime_add_ns(
-                               apic->timer.last_update,
-                               apic->timer.period);
+               apic->timer.last_update = apic->timer.dev.base->get_time();
 }