Message ID | 200901091434.39200.sheng@linux.intel.com (mailing list archive) |
---|---|
State | Accepted, archived |
Headers | show |
On 09.01.2009, at 07:34, Sheng Yang <sheng@linux.intel.com> wrote: > On Friday 09 January 2009 00:36:06 Alexander Graf wrote: >> While booting Linux in VMware ESX, I encountered a strange effect >> in the in-kernel lapic implementation: time went backwards! >> >> While this should never occur, because of that the while loop that >> is done after the invalid calculations caused my host system to hang. >> >> In order to make debugging easier, let's replace this as suggested >> with a modulo function and not run into the danger of looping >> forever. >> >> To replace the nice hint this bug gave me that the values are broken, >> I added a printk message so people encountering this can at least >> see that something is fishy. >> >> Of course, the real issue needs to be fixed as well! I'm open to >> ideas >> why now < last_update! >> >> (Thanks to Kevin for his help in debugging this) >> >> Signed-off-by: Alexander Graf <agraf@suse.de> >> Signed-off-by: Kevin Wolf <kwolf@suse.de> >> --- > > Hi Alexander > > I'm a little suspect here: > >> if (unlikely(ktime_to_ns(now) <= >> ktime_to_ns(apic->timer.last_update))) { >> /* Wrap around */ >> passed = ktime_add(( { >> (ktime_t) { >> .tv64 = KTIME_MAX - >> (apic->timer.last_update).tv64}; } >> ), now); >> apic_debug("time elapsed\n"); >> } else >> passed = ktime_sub(now, apic->timer.last_update); > > And now apic timer base is hr_timer with CLOCK_MONOTONIC, and > get_time() is > really ktime_get() which is almost impossible to wrap around. If it's > overflow, at least we need a warning. I think this piece of code due > to clock > source change. > > So I doubt: due to some reason, now <= apic->timer.last_update, > which cause a > big wrap around operation. Well, I put a printk in that 'time elapsed' place and it happened at the same time my other printk fired with now < last_update. I can't explain why this is happening, but I do see it. > > > And the most suspect: > >> void kvm_apic_timer_intr_post(struct kvm_vcpu *vcpu, int vec) >> { >> struct kvm_lapic *apic = vcpu->arch.apic; >> >> if (apic && apic_lvt_vector(apic, APIC_LVTT) == vec) >> apic->timer.last_update = ktime_add_ns( >> apic->timer.last_update, >> apic->timer.period); >> } > > Not sure what's happening, have you tried this? (In fact, I am > little willing > to replace all apic->timer.dev.base->get_time() with more explicit > ktime_get() > as in pit.) I can give that patch a try :) Alex > > > -- > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c > index afac68c..414e7e0 100644 > --- a/arch/x86/kvm/lapic.c > +++ b/arch/x86/kvm/lapic.c > @@ -1115,9 +1115,7 @@ void kvm_apic_timer_intr_post(struct kvm_vcpu > *vcpu, int > vec) > struct kvm_lapic *apic = vcpu->arch.apic; > > if (apic && apic_lvt_vector(apic, APIC_LVTT) == vec) > - apic->timer.last_update = ktime_add_ns( > - apic->timer.last_update, > - apic->timer.period); > + apic->timer.last_update = apic->timer.dev.base- > >get_time(); > } > > int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu) > > > -- > regards > Yang, Sheng > > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Sheng Yang wrote: > On Friday 09 January 2009 00:36:06 Alexander Graf wrote: > >> While booting Linux in VMware ESX, I encountered a strange effect >> in the in-kernel lapic implementation: time went backwards! >> >> While this should never occur, because of that the while loop that >> is done after the invalid calculations caused my host system to hang. >> >> In order to make debugging easier, let's replace this as suggested >> with a modulo function and not run into the danger of looping forever. >> >> To replace the nice hint this bug gave me that the values are broken, >> I added a printk message so people encountering this can at least >> see that something is fishy. >> >> Of course, the real issue needs to be fixed as well! I'm open to ideas >> why now < last_update! >> >> (Thanks to Kevin for his help in debugging this) >> >> Signed-off-by: Alexander Graf <agraf@suse.de> >> Signed-off-by: Kevin Wolf <kwolf@suse.de> >> --- >> > > Hi Alexander > > I'm a little suspect here: > > >> if (unlikely(ktime_to_ns(now) <= >> ktime_to_ns(apic->timer.last_update))) { >> /* Wrap around */ >> passed = ktime_add(( { >> (ktime_t) { >> .tv64 = KTIME_MAX - >> (apic->timer.last_update).tv64}; } >> ), now); >> apic_debug("time elapsed\n"); >> } else >> passed = ktime_sub(now, apic->timer.last_update); >> > > And now apic timer base is hr_timer with CLOCK_MONOTONIC, and get_time() is > really ktime_get() which is almost impossible to wrap around. If it's > overflow, at least we need a warning. I think this piece of code due to clock > source change. > > So I doubt: due to some reason, now <= apic->timer.last_update, which cause a > big wrap around operation. > > And the most suspect: > > >> void kvm_apic_timer_intr_post(struct kvm_vcpu *vcpu, int vec) >> { >> struct kvm_lapic *apic = vcpu->arch.apic; >> >> if (apic && apic_lvt_vector(apic, APIC_LVTT) == vec) >> apic->timer.last_update = ktime_add_ns( >> apic->timer.last_update, >> apic->timer.period); >> } >> > > Not sure what's happening, have you tried this? (In fact, I am little willing > to replace all apic->timer.dev.base->get_time() with more explicit ktime_get() > as in pit.) > Yes, this code is the culprit. Using that patch of yours now is always > last_update. I'd still like to see that while loop go away ;-). Alex > -- > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c > index afac68c..414e7e0 100644 > --- a/arch/x86/kvm/lapic.c > +++ b/arch/x86/kvm/lapic.c > @@ -1115,9 +1115,7 @@ void kvm_apic_timer_intr_post(struct kvm_vcpu *vcpu, int > vec) > struct kvm_lapic *apic = vcpu->arch.apic; > > if (apic && apic_lvt_vector(apic, APIC_LVTT) == vec) > - apic->timer.last_update = ktime_add_ns( > - apic->timer.last_update, > - apic->timer.period); > + apic->timer.last_update = apic->timer.dev.base->get_time(); > } > > int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu) > > > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jan 09, 2009 at 01:57:30PM +0100, Alexander Graf wrote: > Sheng Yang wrote: > > On Friday 09 January 2009 00:36:06 Alexander Graf wrote: > > > >> While booting Linux in VMware ESX, I encountered a strange effect > >> in the in-kernel lapic implementation: time went backwards! > >> > >> While this should never occur, because of that the while loop that > >> is done after the invalid calculations caused my host system to hang. > >> > >> In order to make debugging easier, let's replace this as suggested > >> with a modulo function and not run into the danger of looping forever. > >> > >> To replace the nice hint this bug gave me that the values are broken, > >> I added a printk message so people encountering this can at least > >> see that something is fishy. > >> > >> Of course, the real issue needs to be fixed as well! I'm open to ideas > >> why now < last_update! > >> > >> (Thanks to Kevin for his help in debugging this) > >> > >> Signed-off-by: Alexander Graf <agraf@suse.de> > >> Signed-off-by: Kevin Wolf <kwolf@suse.de> > >> --- > >> > > > > Hi Alexander > > > > I'm a little suspect here: > > > > > >> if (unlikely(ktime_to_ns(now) <= > >> ktime_to_ns(apic->timer.last_update))) { > >> /* Wrap around */ > >> passed = ktime_add(( { > >> (ktime_t) { > >> .tv64 = KTIME_MAX - > >> (apic->timer.last_update).tv64}; } > >> ), now); > >> apic_debug("time elapsed\n"); > >> } else > >> passed = ktime_sub(now, apic->timer.last_update); > >> > > > > And now apic timer base is hr_timer with CLOCK_MONOTONIC, and get_time() is > > really ktime_get() which is almost impossible to wrap around. If it's > > overflow, at least we need a warning. I think this piece of code due to clock > > source change. > > > > So I doubt: due to some reason, now <= apic->timer.last_update, which cause a > > big wrap around operation. > > > > And the most suspect: > > > > > >> void kvm_apic_timer_intr_post(struct kvm_vcpu *vcpu, int vec) > >> { > >> struct kvm_lapic *apic = vcpu->arch.apic; > >> > >> if (apic && apic_lvt_vector(apic, APIC_LVTT) == vec) > >> apic->timer.last_update = ktime_add_ns( > >> apic->timer.last_update, > >> apic->timer.period); > >> } > >> > > > > Not sure what's happening, have you tried this? (In fact, I am little willing > > to replace all apic->timer.dev.base->get_time() with more explicit ktime_get() > > as in pit.) > > > > Yes, this code is the culprit. Using that patch of yours now is always > > last_update. I'd still like to see that while loop go away ;-). Glad it works. :) I think it's a little similar to the PIT problem that Marcelo fixed, but I am not sure about the reason. Anyone got idea? Anyway, I would like to post a patch later.
On Fri, Jan 09, 2009 at 01:57:30PM +0100, Alexander Graf wrote: > Sheng Yang wrote: > > On Friday 09 January 2009 00:36:06 Alexander Graf wrote: > > > >> While booting Linux in VMware ESX, I encountered a strange effect > >> in the in-kernel lapic implementation: time went backwards! > >> > >> While this should never occur, because of that the while loop that > >> is done after the invalid calculations caused my host system to hang. > >> > >> In order to make debugging easier, let's replace this as suggested > >> with a modulo function and not run into the danger of looping forever. > >> > >> To replace the nice hint this bug gave me that the values are broken, > >> I added a printk message so people encountering this can at least > >> see that something is fishy. > >> > >> Of course, the real issue needs to be fixed as well! I'm open to ideas > >> why now < last_update! > >> > >> (Thanks to Kevin for his help in debugging this) > >> > >> Signed-off-by: Alexander Graf <agraf@suse.de> > >> Signed-off-by: Kevin Wolf <kwolf@suse.de> > >> --- > >> > > > > Hi Alexander > > > > I'm a little suspect here: > > > > > >> if (unlikely(ktime_to_ns(now) <= > >> ktime_to_ns(apic->timer.last_update))) { > >> /* Wrap around */ > >> passed = ktime_add(( { > >> (ktime_t) { > >> .tv64 = KTIME_MAX - > >> (apic->timer.last_update).tv64}; } > >> ), now); > >> apic_debug("time elapsed\n"); > >> } else > >> passed = ktime_sub(now, apic->timer.last_update); > >> > > > > And now apic timer base is hr_timer with CLOCK_MONOTONIC, and get_time() is > > really ktime_get() which is almost impossible to wrap around. If it's > > overflow, at least we need a warning. I think this piece of code due to clock > > source change. > > > > So I doubt: due to some reason, now <= apic->timer.last_update, which cause a > > big wrap around operation. > > > > And the most suspect: > > > > > >> void kvm_apic_timer_intr_post(struct kvm_vcpu *vcpu, int vec) > >> { > >> struct kvm_lapic *apic = vcpu->arch.apic; > >> > >> if (apic && apic_lvt_vector(apic, APIC_LVTT) == vec) > >> apic->timer.last_update = ktime_add_ns( > >> apic->timer.last_update, > >> apic->timer.period); > >> } > >> > > > > Not sure what's happening, have you tried this? (In fact, I am little willing > > to replace all apic->timer.dev.base->get_time() with more explicit ktime_get() > > as in pit.) > > > > Yes, this code is the culprit. Using that patch of yours now is always > > last_update. I'd still like to see that while loop go away ;-). Oh, for the loop, let's ask Avi. :)
On Fri, Jan 09, 2009 at 01:57:30PM +0100, Alexander Graf wrote: > Sheng Yang wrote: > > On Friday 09 January 2009 00:36:06 Alexander Graf wrote: > > > >> While booting Linux in VMware ESX, I encountered a strange effect > >> in the in-kernel lapic implementation: time went backwards! > >> > >> While this should never occur, because of that the while loop that > >> is done after the invalid calculations caused my host system to hang. > >> > >> In order to make debugging easier, let's replace this as suggested > >> with a modulo function and not run into the danger of looping forever. > >> > >> To replace the nice hint this bug gave me that the values are broken, > >> I added a printk message so people encountering this can at least > >> see that something is fishy. > >> > >> Of course, the real issue needs to be fixed as well! I'm open to ideas > >> why now < last_update! > >> > >> (Thanks to Kevin for his help in debugging this) > >> > >> Signed-off-by: Alexander Graf <agraf@suse.de> > >> Signed-off-by: Kevin Wolf <kwolf@suse.de> > >> --- > >> > > > > Hi Alexander > > > > I'm a little suspect here: > > > > > >> if (unlikely(ktime_to_ns(now) <= > >> ktime_to_ns(apic->timer.last_update))) { > >> /* Wrap around */ > >> passed = ktime_add(( { > >> (ktime_t) { > >> .tv64 = KTIME_MAX - > >> (apic->timer.last_update).tv64}; } > >> ), now); > >> apic_debug("time elapsed\n"); > >> } else > >> passed = ktime_sub(now, apic->timer.last_update); > >> > > > > And now apic timer base is hr_timer with CLOCK_MONOTONIC, and get_time() is > > really ktime_get() which is almost impossible to wrap around. If it's > > overflow, at least we need a warning. I think this piece of code due to clock > > source change. > > > > So I doubt: due to some reason, now <= apic->timer.last_update, which cause a > > big wrap around operation. > > > > And the most suspect: > > > > > >> void kvm_apic_timer_intr_post(struct kvm_vcpu *vcpu, int vec) > >> { > >> struct kvm_lapic *apic = vcpu->arch.apic; > >> > >> if (apic && apic_lvt_vector(apic, APIC_LVTT) == vec) > >> apic->timer.last_update = ktime_add_ns( > >> apic->timer.last_update, > >> apic->timer.period); > >> } > >> > > > > Not sure what's happening, have you tried this? (In fact, I am little willing > > to replace all apic->timer.dev.base->get_time() with more explicit ktime_get() > > as in pit.) > > > > Yes, this code is the culprit. Using that patch of yours now is always > > last_update. I'd still like to see that while loop go away ;-). (I would say all in one mail, but it always comes a little later... :) ) I think at least part of comment for the while loop should be kept...
On Sat, Jan 10, 2009 at 07:21:13PM +0800, Sheng Yang wrote: > On Fri, Jan 09, 2009 at 01:57:30PM +0100, Alexander Graf wrote: > > Sheng Yang wrote: > > > On Friday 09 January 2009 00:36:06 Alexander Graf wrote: > > > > > >> While booting Linux in VMware ESX, I encountered a strange effect > > >> in the in-kernel lapic implementation: time went backwards! > > >> > > >> While this should never occur, because of that the while loop that > > >> is done after the invalid calculations caused my host system to hang. > > >> > > >> In order to make debugging easier, let's replace this as suggested > > >> with a modulo function and not run into the danger of looping forever. > > >> > > >> To replace the nice hint this bug gave me that the values are broken, > > >> I added a printk message so people encountering this can at least > > >> see that something is fishy. > > >> > > >> Of course, the real issue needs to be fixed as well! I'm open to ideas > > >> why now < last_update! > > >> > > >> (Thanks to Kevin for his help in debugging this) > > >> > > >> Signed-off-by: Alexander Graf <agraf@suse.de> > > >> Signed-off-by: Kevin Wolf <kwolf@suse.de> > > >> --- > > >> > > > > > > Hi Alexander > > > > > > I'm a little suspect here: > > > > > > > > >> if (unlikely(ktime_to_ns(now) <= > > >> ktime_to_ns(apic->timer.last_update))) { > > >> /* Wrap around */ > > >> passed = ktime_add(( { > > >> (ktime_t) { > > >> .tv64 = KTIME_MAX - > > >> (apic->timer.last_update).tv64}; } > > >> ), now); > > >> apic_debug("time elapsed\n"); > > >> } else > > >> passed = ktime_sub(now, apic->timer.last_update); > > >> > > > > > > And now apic timer base is hr_timer with CLOCK_MONOTONIC, and get_time() is > > > really ktime_get() which is almost impossible to wrap around. If it's > > > overflow, at least we need a warning. I think this piece of code due to clock > > > source change. > > > > > > So I doubt: due to some reason, now <= apic->timer.last_update, which cause a > > > big wrap around operation. > > > > > > And the most suspect: > > > > > > > > >> void kvm_apic_timer_intr_post(struct kvm_vcpu *vcpu, int vec) > > >> { > > >> struct kvm_lapic *apic = vcpu->arch.apic; > > >> > > >> if (apic && apic_lvt_vector(apic, APIC_LVTT) == vec) > > >> apic->timer.last_update = ktime_add_ns( > > >> apic->timer.last_update, > > >> apic->timer.period); > > >> } > > >> > > > > > > Not sure what's happening, have you tried this? (In fact, I am little willing > > > to replace all apic->timer.dev.base->get_time() with more explicit ktime_get() > > > as in pit.) > > > > > > > Yes, this code is the culprit. Using that patch of yours now is always > > > last_update. I'd still like to see that while loop go away ;-). Sheng, Will separate the discussions on items: 1) On the current code, last_update is assigned by kvm_apic_timer_intr_post as the expiration time of the next interrupt, while apic_get_tmcct interprets it as "countdown start": passed = ktime_sub(now, apic->timer.last_update); This explains why your first patch against kvm_apic_timer_intr_post fixed the bug Alexander was experiencing. BTW, yes, the "overflow" handling is not necessary as you noted. 2) The current count calculation is not right (apart from any eventual bugs), or I don't understand it. It uses the interrupt injection time as "countdown start" (or rearm in Linux terminology), in kvm_apic_timer_intr_post. But the host hrtimer starts counting once rearmed (assuming periodic configuration). So the inaccuracy of APIC_TMCCT depends on this delay between hrtimer rearm and interrupt injection. What is this scheme supposed to achieve? It seems much simpler, and more accurate, to return scaled hrtimer_get_remaining on APIC_TMCCT emulation. 3) And then there's interrupt reinjection. Once interrupts accumulate, and we reinject, current count reads become completly bogus. This is the reason for time drift on RHEL4-style kernels with "clock=tsc". The algorithm calculates the interrupt handling delay by using the PIT count (equivalent to APIC_TMCCT). But PIT count emulation uses the hrtimer expiration time, which has nothing to do with the accumulated interrupts. So what I propose is to first switch lapic current count emulation to a straightforward scaled hrtimer_get_remaining equivalent. For the reinjection case, maintain an average of the delay between host interrupt and interrupt injection (this can be generic, so both PIT and LAPIC timer can use it). Return that scaled average on APIC_TMCCT emulation whenever pending > 2. What you think? And as Avi noted, there should be a distinction between accumulated interrupts, we only want to reinject the ones caused by host load (and there are some interesting optimizations that can be done later, such as stopping hrtimer from ticking until guest acks one interrupt, and calculating the missed ones). But better start with the basic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sunday 11 January 2009 12:55:58 Marcelo Tosatti wrote: > On Sat, Jan 10, 2009 at 07:21:13PM +0800, Sheng Yang wrote: > > On Fri, Jan 09, 2009 at 01:57:30PM +0100, Alexander Graf wrote: > > > Sheng Yang wrote: > > > > On Friday 09 January 2009 00:36:06 Alexander Graf wrote: > > > >> While booting Linux in VMware ESX, I encountered a strange effect > > > >> in the in-kernel lapic implementation: time went backwards! > > > >> > > > >> --- > > > > > > > > Hi Alexander > > > > > > > > I'm a little suspect here: > > > >> if (unlikely(ktime_to_ns(now) <= > > > >> ktime_to_ns(apic->timer.last_update))) { > > > >> /* Wrap around */ > > > >> passed = ktime_add(( { > > > >> (ktime_t) { > > > >> .tv64 = KTIME_MAX - > > > >> (apic->timer.last_update).tv64}; } > > > >> ), now); > > > >> apic_debug("time elapsed\n"); > > > >> } else > > > >> passed = ktime_sub(now, apic->timer.last_update); > > > > > > > > And now apic timer base is hr_timer with CLOCK_MONOTONIC, and > > > > get_time() is really ktime_get() which is almost impossible to wrap > > > > around. If it's overflow, at least we need a warning. I think this > > > > piece of code due to clock source change. > > > > > > > > So I doubt: due to some reason, now <= apic->timer.last_update, which > > > > cause a big wrap around operation. > > > > > > > > And the most suspect: > > > >> void kvm_apic_timer_intr_post(struct kvm_vcpu *vcpu, int vec) > > > >> { > > > >> struct kvm_lapic *apic = vcpu->arch.apic; > > > >> > > > >> if (apic && apic_lvt_vector(apic, APIC_LVTT) == vec) > > > >> apic->timer.last_update = ktime_add_ns( > > > >> apic->timer.last_update, > > > >> apic->timer.period); > > > >> } > > > > > > > > Not sure what's happening, have you tried this? (In fact, I am little > > > > willing to replace all apic->timer.dev.base->get_time() with more > > > > explicit ktime_get() as in pit.) > > > > > > Yes, this code is the culprit. Using that patch of yours now is always > > > > last_update. I'd still like to see that while loop go away ;-). > > Sheng, > > Will separate the discussions on items: > > > 1) On the current code, last_update is assigned by kvm_apic_timer_intr_post > as the expiration time of the next interrupt, while apic_get_tmcct > interprets it as "countdown start": > > passed = ktime_sub(now, apic->timer.last_update); > > This explains why your first patch against kvm_apic_timer_intr_post > fixed the bug Alexander was experiencing. > > BTW, yes, the "overflow" handling is not necessary as you noted. > > 2) The current count calculation is not right (apart from any eventual > bugs), or I don't understand it. It uses the interrupt injection > time as "countdown start" (or rearm in Linux terminology), in > kvm_apic_timer_intr_post. > > But the host hrtimer starts counting once rearmed (assuming periodic > configuration). So the inaccuracy of APIC_TMCCT depends on this delay > between hrtimer rearm and interrupt injection. > > What is this scheme supposed to achieve? It seems much simpler, and > more accurate, to return scaled hrtimer_get_remaining on APIC_TMCCT > emulation. Marcelo, Sorry to late reply... After rechecked the code, got some thoughts. last_update can indicated rearm time as following: The original apic_timer_intr_post() got: > if (apic && apic_lvt_vector(apic, APIC_LVTT) == vec) > apic->timer.last_update = ktime_add_ns( > apic->timer.last_update, > apic->timer.period); I think the purpose is let guest see a reasonable TMCCT. Which means: 1. Timer start to fire at start_apic_timer(). last_update=now(). 2. Every one LAPIC timer interrupt was injected into the guest, so the time at least elapsed timer.period(the first alarm set at "period" later), then last_update is increased by period. Notice last_update's base is set before timer fire, so it's not indicated the next one, but *the time this interrupt should be injected*(time can be delayed)... So last_update = n*period + base. 3. If there is pending interrupt, last_update won't be updated, so we have this: > while (counter_passed > tmcct) > counter_passed -= tmcct; > tmcct -= counter_passed; (notice tmcct is TMICT here.) And the returned tmcct value indicated (counter_passed % tmict), a offset regardless of pending interrupt numbers. I think now the overflow seems OK, but I am not sure why last_update can be bigger than ktime_get()? Maybe due to vcpu migration? Suspect some racy or boundary condition existed... And base on this, I don't think my quick fix is correct... > > 3) And then there's interrupt reinjection. Once interrupts accumulate, > and we reinject, current count reads become completly bogus. This is > the reason for time drift on RHEL4-style kernels with "clock=tsc". The > algorithm calculates the interrupt handling delay by using the PIT > count (equivalent to APIC_TMCCT). But PIT count emulation uses the > hrtimer expiration time, which has nothing to do with the accumulated > interrupts. > > So what I propose is to first switch lapic current count emulation to a > straightforward scaled hrtimer_get_remaining equivalent. > > For the reinjection case, maintain an average of the delay between host > interrupt and interrupt injection (this can be generic, so both PIT > and LAPIC timer can use it). Return that scaled average on APIC_TMCCT > emulation whenever pending > 2. > > What you think? 1. If we simply use ktime_get() to update last_update, the interval can't be same between different last_update. I think we may got some problem here, but not for sure. For example, three delayed interrupt was injected one after one, last_update would show very little interval, and APIC_TMCCT may got trouble. Maintain an average of the delay seems a little tricky here, and I am not sure if it would help the problem... Average is average at most, not the real affection at the time... 2. For the current mechanism, the interval is the same, so last_update always equal to n*period + base. If there are more than one pending interrupts, TMCCT should also can return the relative correct value. So I am leaning toward to fix current problem, though I haven't find the root cause yet... > And as Avi noted, there should be a distinction between accumulated > interrupts, we only want to reinject the ones caused by host load (and > there are some interesting optimizations that can be done later, such > as stopping hrtimer from ticking until guest acks one interrupt, and > calculating the missed ones). But better start with the basic. Agree...
On Tue, Jan 13, 2009 at 03:47:49PM +0800, Sheng Yang wrote: > On Sunday 11 January 2009 12:55:58 Marcelo Tosatti wrote: > The original apic_timer_intr_post() got: > > > if (apic && apic_lvt_vector(apic, APIC_LVTT) == vec) > > apic->timer.last_update = ktime_add_ns( > > apic->timer.last_update, > > apic->timer.period); > > I think the purpose is let guest see a reasonable TMCCT. Which means: > > 1. Timer start to fire at start_apic_timer(). last_update=now(). > > 2. Every one LAPIC timer interrupt was injected into the guest, so the time at > least elapsed timer.period(the first alarm set at "period" later), then > last_update is increased by period. Notice last_update's base is set before > timer fire, so it's not indicated the next one, but *the time this interrupt > should be injected*(time can be delayed)... So last_update = n*period + base. Right. > 3. If there is pending interrupt, last_update won't be updated, so we have > this: > > > while (counter_passed > tmcct) > > counter_passed -= tmcct; > > tmcct -= counter_passed; > (notice tmcct is TMICT here.) > > And the returned tmcct value indicated (counter_passed % tmict), a offset > regardless of pending interrupt numbers. Right. The problem is that for accumulated interrupts, the guest will receive the interrupt as fast as the qemu process can be scheduled (as long as its not masked, of course). There could be higher priority vectors in the mix, but thats not the common case. So calculating the offset using last interrupt injection is not very reasonable in this case. > I think now the overflow seems OK, but I am not sure why last_update can be > bigger than ktime_get()? Maybe due to vcpu migration? Suspect some racy or > boundary condition existed... Yep, it seems to be some small inaccuracy in the accounting that causes it. > And base on this, I don't think my quick fix is correct... > > > > > 3) And then there's interrupt reinjection. Once interrupts accumulate, > > and we reinject, current count reads become completly bogus. This is > > the reason for time drift on RHEL4-style kernels with "clock=tsc". The > > algorithm calculates the interrupt handling delay by using the PIT > > count (equivalent to APIC_TMCCT). But PIT count emulation uses the > > hrtimer expiration time, which has nothing to do with the accumulated > > interrupts. > > > > So what I propose is to first switch lapic current count emulation to a > > straightforward scaled hrtimer_get_remaining equivalent. > > > > For the reinjection case, maintain an average of the delay between host > > interrupt and interrupt injection (this can be generic, so both PIT > > and LAPIC timer can use it). Return that scaled average on APIC_TMCCT > > emulation whenever pending > 2. > > > > What you think? > > 1. If we simply use ktime_get() to update last_update, the interval can't be > same between different last_update. I think we may got some problem here, but > not for sure. For example, three delayed interrupt was injected one after one, > last_update would show very little interval, and APIC_TMCCT may got trouble. I mean get rid of last_update and use hrtimer_get_remaining on apic_get_tmcct for no pending interrupts. And estimated delay between timer fire and injection if there are pending interrupts. > Maintain an average of the delay seems a little tricky here, and I am not sure > if it would help the problem... Average is average at most, not the real > affection at the time... Yes. Probably using the last delay between timer fire and injection is more accurate than an average. > > 2. For the current mechanism, the interval is the same, so last_update always > equal to n*period + base. If there are more than one pending interrupts, TMCCT > should also can return the relative correct value. > > So I am leaning toward to fix current problem, though I haven't find the root > cause yet... I think it can be simpler, without the need to deal with overflow at all. Understanding the present root cause is important though. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wednesday 14 January 2009 06:01:17 Marcelo Tosatti wrote: > On Tue, Jan 13, 2009 at 03:47:49PM +0800, Sheng Yang wrote: > > On Sunday 11 January 2009 12:55:58 Marcelo Tosatti wrote: > > > > The original apic_timer_intr_post() got: > > > if (apic && apic_lvt_vector(apic, APIC_LVTT) == vec) > > > apic->timer.last_update = ktime_add_ns( > > > apic->timer.last_update, > > > apic->timer.period); > > > > I think the purpose is let guest see a reasonable TMCCT. Which means: > > > > 1. Timer start to fire at start_apic_timer(). last_update=now(). > > > > 2. Every one LAPIC timer interrupt was injected into the guest, so the > > time at least elapsed timer.period(the first alarm set at "period" > > later), then last_update is increased by period. Notice last_update's > > base is set before timer fire, so it's not indicated the next one, but > > *the time this interrupt should be injected*(time can be delayed)... So > > last_update = n*period + base. > > Right. > > > 3. If there is pending interrupt, last_update won't be updated, so we > > have > > > > this: > > > while (counter_passed > tmcct) > > > counter_passed -= tmcct; > > > tmcct -= counter_passed; > > > > (notice tmcct is TMICT here.) > > > > And the returned tmcct value indicated (counter_passed % tmict), a offset > > regardless of pending interrupt numbers. > > Right. The problem is that for accumulated interrupts, the guest will > receive the interrupt as fast as the qemu process can be scheduled (as > long as its not masked, of course). There could be higher priority > vectors in the mix, but thats not the common case. > > So calculating the offset using last interrupt injection is not very > reasonable in this case. Um... I think it's not easy to find a complete reasonable result here, I haven't look into how OS us TMCCT, maybe it can help us to find a more reasonable result here. > > I think now the overflow seems OK, but I am not sure why last_update can > > be bigger than ktime_get()? Maybe due to vcpu migration? Suspect some > > racy or boundary condition existed... > > Yep, it seems to be some small inaccuracy in the accounting that > causes it. > > > And base on this, I don't think my quick fix is correct... > > > > > 3) And then there's interrupt reinjection. Once interrupts accumulate, > > > and we reinject, current count reads become completly bogus. This is > > > the reason for time drift on RHEL4-style kernels with "clock=tsc". The > > > algorithm calculates the interrupt handling delay by using the PIT > > > count (equivalent to APIC_TMCCT). But PIT count emulation uses the > > > hrtimer expiration time, which has nothing to do with the accumulated > > > interrupts. > > > > > > So what I propose is to first switch lapic current count emulation to a > > > straightforward scaled hrtimer_get_remaining equivalent. > > > > > > For the reinjection case, maintain an average of the delay between host > > > interrupt and interrupt injection (this can be generic, so both PIT > > > and LAPIC timer can use it). Return that scaled average on APIC_TMCCT > > > emulation whenever pending > 2. > > > > > > What you think? > > > > 1. If we simply use ktime_get() to update last_update, the interval can't > > be same between different last_update. I think we may got some problem > > here, but not for sure. For example, three delayed interrupt was injected > > one after one, last_update would show very little interval, and > > APIC_TMCCT may got trouble. > > I mean get rid of last_update and use hrtimer_get_remaining on > apic_get_tmcct for no pending interrupts. And estimated delay between > timer fire and injection if there are pending interrupts. > > > Maintain an average of the delay seems a little tricky here, and I am not > > sure if it would help the problem... Average is average at most, not the > > real affection at the time... > > Yes. Probably using the last delay between timer fire and injection is > more accurate than an average. > > > 2. For the current mechanism, the interval is the same, so last_update > > always equal to n*period + base. If there are more than one pending > > interrupts, TMCCT should also can return the relative correct value. > > > > So I am leaning toward to fix current problem, though I haven't find the > > root cause yet... > > I think it can be simpler, without the need to deal with overflow at > all. > > Understanding the present root cause is important though. Yes, but still don't have many clues now... I would look through the code and purpose some patches for Alex to test...
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index afac68c..414e7e0 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -1115,9 +1115,7 @@ void kvm_apic_timer_intr_post(struct kvm_vcpu *vcpu, int vec) struct kvm_lapic *apic = vcpu->arch.apic; if (apic && apic_lvt_vector(apic, APIC_LVTT) == vec) - apic->timer.last_update = ktime_add_ns( - apic->timer.last_update, - apic->timer.period); + apic->timer.last_update = apic->timer.dev.base->get_time(); }