Message ID | 20211009021236.4122790-12-seanjc@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: Halt-polling and x86 APICv overhaul | expand |
On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote: > Invoke the arch hooks for block+unblock if and only if KVM actually > attempts to block the vCPU. The only non-nop implementation is on x86, > specifically SVM's AVIC, and there is no need to put the AVIC prior to > halt-polling as KVM x86's kvm_vcpu_has_events() will scour the full vIRR > to find pending IRQs regardless of whether the AVIC is loaded/"running". > > The primary motivation is to allow future cleanup to split out "block" > from "halt", but this is also likely a small performance boost on x86 SVM > when halt-polling is successful. > > Adjust the post-block path to update "cur" after unblocking, i.e. include > AVIC load time in halt_wait_ns and halt_wait_hist, so that the behavior > is consistent. Moving just the pre-block arch hook would result in only > the AVIC put latency being included in the halt_wait stats. There is no > obvious evidence that one way or the other is correct, so just ensure KVM > is consistent. > > Note, x86 has two separate paths for handling APICv with respect to vCPU > blocking. VMX uses hooks in x86's vcpu_block(), while SVM uses the arch > hooks in kvm_vcpu_block(). Prior to this path, the two paths were more > or less functionally identical. That is very much not the case after > this patch, as the hooks used by VMX _must_ fire before halt-polling. > x86's entire mess will be cleaned up in future patches. > > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- > virt/kvm/kvm_main.c | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index f90b3ed05628..227f6bbe0716 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -3235,8 +3235,6 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu) > bool waited = false; > u64 block_ns; > > - kvm_arch_vcpu_blocking(vcpu); > - > start = cur = poll_end = ktime_get(); > if (do_halt_poll) { > ktime_t stop = ktime_add_ns(ktime_get(), vcpu->halt_poll_ns); > @@ -3253,6 +3251,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu) > } while (kvm_vcpu_can_poll(cur, stop)); > } > > + kvm_arch_vcpu_blocking(vcpu); > > prepare_to_rcuwait(wait); > for (;;) { > @@ -3265,6 +3264,9 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu) > schedule(); > } > finish_rcuwait(wait); > + > + kvm_arch_vcpu_unblocking(vcpu); > + > cur = ktime_get(); > if (waited) { > vcpu->stat.generic.halt_wait_ns += > @@ -3273,7 +3275,6 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu) > ktime_to_ns(cur) - ktime_to_ns(poll_end)); > } > out: > - kvm_arch_vcpu_unblocking(vcpu); > block_ns = ktime_to_ns(cur) - ktime_to_ns(start); > > /* Makes sense. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Best regards, Maxim Levitsky
On Wed, 2021-10-27 at 16:40 +0300, Maxim Levitsky wrote: > On Fri, 2021-10-08 at 19:12 -0700, Sean Christopherson wrote: > > Invoke the arch hooks for block+unblock if and only if KVM actually > > attempts to block the vCPU. The only non-nop implementation is on x86, > > specifically SVM's AVIC, and there is no need to put the AVIC prior to > > halt-polling as KVM x86's kvm_vcpu_has_events() will scour the full vIRR > > to find pending IRQs regardless of whether the AVIC is loaded/"running". > > > > The primary motivation is to allow future cleanup to split out "block" > > from "halt", but this is also likely a small performance boost on x86 SVM > > when halt-polling is successful. > > > > Adjust the post-block path to update "cur" after unblocking, i.e. include > > AVIC load time in halt_wait_ns and halt_wait_hist, so that the behavior > > is consistent. Moving just the pre-block arch hook would result in only > > the AVIC put latency being included in the halt_wait stats. There is no > > obvious evidence that one way or the other is correct, so just ensure KVM > > is consistent. > > > > Note, x86 has two separate paths for handling APICv with respect to vCPU > > blocking. VMX uses hooks in x86's vcpu_block(), while SVM uses the arch > > hooks in kvm_vcpu_block(). Prior to this path, the two paths were more > > or less functionally identical. That is very much not the case after > > this patch, as the hooks used by VMX _must_ fire before halt-polling. > > x86's entire mess will be cleaned up in future patches. > > > > Signed-off-by: Sean Christopherson <seanjc@google.com> > > --- > > virt/kvm/kvm_main.c | 7 ++++--- > > 1 file changed, 4 insertions(+), 3 deletions(-) > > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > index f90b3ed05628..227f6bbe0716 100644 > > --- a/virt/kvm/kvm_main.c > > +++ b/virt/kvm/kvm_main.c > > @@ -3235,8 +3235,6 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu) > > bool waited = false; > > u64 block_ns; > > > > - kvm_arch_vcpu_blocking(vcpu); > > - > > start = cur = poll_end = ktime_get(); > > if (do_halt_poll) { > > ktime_t stop = ktime_add_ns(ktime_get(), vcpu->halt_poll_ns); > > @@ -3253,6 +3251,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu) > > } while (kvm_vcpu_can_poll(cur, stop)); > > } > > > > + kvm_arch_vcpu_blocking(vcpu); > > > > prepare_to_rcuwait(wait); > > for (;;) { > > @@ -3265,6 +3264,9 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu) > > schedule(); > > } > > finish_rcuwait(wait); > > + > > + kvm_arch_vcpu_unblocking(vcpu); > > + > > cur = ktime_get(); > > if (waited) { > > vcpu->stat.generic.halt_wait_ns += > > @@ -3273,7 +3275,6 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu) > > ktime_to_ns(cur) - ktime_to_ns(poll_end)); > > } > > out: > > - kvm_arch_vcpu_unblocking(vcpu); > > block_ns = ktime_to_ns(cur) - ktime_to_ns(start); > > > > /* > > Makes sense. > > Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> > > Best regards, > Maxim Levitsky So... Last week I decided to study a bit how AVIC behaves when vCPUs are not 100% running (aka no cpu_pm=on), to mostly understand their so-called 'GA log' thing. (This thing is that when you tell the IOMMU that a vCPU is not running, the IOMMU starts logging all incoming passed-through interrupts to a ring buffer, and raises its own interrupt, which’s handler is supposed to wake up the VM's vCPU.) That led to me discovering that AMD's IOMMU is totally busted after a suspend/resume cycle, fixing which took me few days (and most of the time I worried that it's some sort of a BIOS bug which nobody would fix, as the IOMMU interrupt delivery was totally busted after resume, sometimes even power cycle didn't help to revive it - phew...). Luckily I did fix it, and patches are waiting for the review upstream. (https://www.spinics.net/lists/kernel/msg4153488.html) Another thing I discovered that this patch series totally breaks my VMs, without cpu_pm=on The whole series (I didn't yet bisect it) makes even my fedora32 VM be very laggy, almost unusable, and it only has one passed-through device, a nic). If I apply though only the patch series up to this patch, my fedora VM seems to work fine, but my windows VM still locks up hard when I run 'LatencyTop' in it, which doesn't happen without this patch. So far the symptoms I see is that on VCPU 0, ISR has quite high interrupt (0xe1 last time I seen it), TPR and PPR are 0xe0 (although I have seen TPR to have different values), and IRR has plenty of interrupts with lower priority. The VM seems to be stuck in this case. As if its EOI got lost or something is preventing the IRQ handler from issuing EOI. LatencyTop does install some form of a kernel driver which likely does meddle with interrupts (maybe it sends lots of self IPIs?). 100% reproducible as soon as I start monitoring with LatencyTop. Without this patch it works (or if disabling halt polling), but I still did manage to lockup the VM few times still, after lot of random clicking/starting up various apps while LatencyTop was running, etc, but in this case when I dump local apic via qemu's hmp interface the VM instantly revives, which might be either same bug which got amplified by this patch or something else. That was tested on the pure 5.15.0 kernel without any patches. It is possible that this is a bug in LatencyTop that just got exposed by different timing. The windows VM does have GPU and few USB controllers passed to it, and without them, in pure VM mode, as I call it, the LatencyTop seems to work. Tomorrow I'll give it a more formal investigation. Best regards, Maxim Levitsky
On Mon, Nov 29, 2021, Maxim Levitsky wrote: > (This thing is that when you tell the IOMMU that a vCPU is not running, > Another thing I discovered that this patch series totally breaks my VMs, > without cpu_pm=on The whole series (I didn't yet bisect it) makes even my > fedora32 VM be very laggy, almost unusable, and it only has one > passed-through device, a nic). Grrrr, the complete lack of comments in the KVM code and the separate paths for VMX vs SVM when handling HLT with APICv make this all way for difficult to understand than it should be. The hangs are likely due to: KVM: SVM: Unconditionally mark AVIC as running on vCPU load (with APICv) If a posted interrupt arrives after KVM has done its final search through the vIRR, but before avic_update_iommu_vcpu_affinity() is called, the posted interrupt will be set in the vIRR without triggering a host IRQ to wake the vCPU via the GA log. I.e. KVM is missing an equivalent to VMX's posted interrupt check for an outstanding notification after switching to the wakeup vector. For now, the least awful approach is sadly to keep the vcpu_(un)blocking() hooks. Unlike VMX's PI support, there's no fast check for an interrupt being posted (KVM would have to rewalk the vIRR), no easy to signal the current CPU to do wakeup (I don't think KVM even has access to the IRQ used by the owning IOMMU), and there's no simplification of load/put code. If the scheduler were changed to support waking in the sched_out path, then I'd be more inclined to handle this in avic_vcpu_put() by rewalking the vIRR one final time, but for now it's not worth it. > If I apply though only the patch series up to this patch, my fedora VM seems > to work fine, but my windows VM still locks up hard when I run 'LatencyTop' > in it, which doesn't happen without this patch. Buy "run 'LatencyTop' in it", do you mean running something in the Windows guest? The only search results I can find for LatencyTop are Linux specific. > So far the symptoms I see is that on VCPU 0, ISR has quite high interrupt > (0xe1 last time I seen it), TPR and PPR are 0xe0 (although I have seen TPR to > have different values), and IRR has plenty of interrupts with lower priority. > The VM seems to be stuck in this case. As if its EOI got lost or something is > preventing the IRQ handler from issuing EOI. > > LatencyTop does install some form of a kernel driver which likely does meddle > with interrupts (maybe it sends lots of self IPIs?). > > 100% reproducible as soon as I start monitoring with LatencyTop. > > Without this patch it works (or if disabling halt polling), Huh. I assume everything works if you disable halt polling _without_ this patch applied? If so, that implies that successful halt polling without mucking with vCPU IOMMU affinity is somehow problematic. I can't think of any relevant side effects other than timing.
On 11/29/21 18:25, Sean Christopherson wrote: > If a posted interrupt arrives after KVM has done its final search through the vIRR, > but before avic_update_iommu_vcpu_affinity() is called, the posted interrupt will > be set in the vIRR without triggering a host IRQ to wake the vCPU via the GA log. > > I.e. KVM is missing an equivalent to VMX's posted interrupt check for an outstanding > notification after switching to the wakeup vector. BTW Maxim reported that it can break even without assigned devices. > For now, the least awful approach is sadly to keep the vcpu_(un)blocking() hooks. I agree that the hooks cannot be dropped but the bug is reproducible with this patch, where the hooks are still there. With the hooks in place, you have: kvm_vcpu_blocking(vcpu) avic_set_running(vcpu, false) avic_vcpu_put(vcpu) avic_update_iommu_vcpu_affinity() WRITE_ONCE(...) // clear IS_RUNNING bit set_current_state() smp_mb() kvm_vcpu_check_block() return kvm_arch_vcpu_runnable() || ... return kvm_vcpu_has_events() || ... return kvm_cpu_has_interrupt() || ... return kvm_apic_has_interrupt() || ... return apic_has_interrupt_for_ppr() apic_find_highest_irr() scan vIRR This covers the barrier between the write of is_running and the read of vIRR, and the other side should be correct as well. in particular, reads of is_running always come after an atomic write to vIRR, and hence after an implicit full memory barrier. svm_deliver_avic_intr() has an smp_mb__after_atomic() after writing IRR; avic_kick_target_vcpus() even has an explicit barrier in srcu_read_lock(), between the microcode's write to vIRR and its own call to avic_vcpu_is_running(). Still it does seem to be a race that happens when IS_RUNNING=true but vcpu->mode == OUTSIDE_GUEST_MODE. This patch makes the race easier to trigger because it moves IS_RUNNING=false later. Paolo
On 11/29/21 18:25, Sean Christopherson wrote: >> If I apply though only the patch series up to this patch, my fedora VM seems >> to work fine, but my windows VM still locks up hard when I run 'LatencyTop' >> in it, which doesn't happen without this patch. > > Buy "run 'LatencyTop' in it", do you mean running something in the Windows guest? > The only search results I can find for LatencyTop are Linux specific. I think it's LatencyMon, https://www.resplendence.com/latencymon. Paolo
On Mon, Nov 29, 2021, Paolo Bonzini wrote: > On 11/29/21 18:25, Sean Christopherson wrote: > > If a posted interrupt arrives after KVM has done its final search through the vIRR, > > but before avic_update_iommu_vcpu_affinity() is called, the posted interrupt will > > be set in the vIRR without triggering a host IRQ to wake the vCPU via the GA log. > > > > I.e. KVM is missing an equivalent to VMX's posted interrupt check for an outstanding > > notification after switching to the wakeup vector. > > BTW Maxim reported that it can break even without assigned devices. > > > For now, the least awful approach is sadly to keep the vcpu_(un)blocking() hooks. > > I agree that the hooks cannot be dropped but the bug is reproducible with > this patch, where the hooks are still there. ... > Still it does seem to be a race that happens when IS_RUNNING=true but > vcpu->mode == OUTSIDE_GUEST_MODE. This patch makes the race easier to > trigger because it moves IS_RUNNING=false later. Oh! Any chance the bug only repros with preemption enabled? That would explain why I don't see problems, I'm pretty sure I've only run AVIC with a PREEMPT=n. svm_vcpu_{un}blocking() are called with preemption enabled, and avic_set_running() passes in vcpu->cpu. If the vCPU is preempted and scheduled in on a different CPU, avic_vcpu_load() will overwrite the vCPU's entry with the wrong CPU info.
On 11/29/21 19:55, Sean Christopherson wrote: >> Still it does seem to be a race that happens when IS_RUNNING=true but >> vcpu->mode == OUTSIDE_GUEST_MODE. This patch makes the race easier to >> trigger because it moves IS_RUNNING=false later. > > Oh! Any chance the bug only repros with preemption enabled? That would explain > why I don't see problems, I'm pretty sure I've only run AVIC with a PREEMPT=n. Me too. > svm_vcpu_{un}blocking() are called with preemption enabled, and avic_set_running() > passes in vcpu->cpu. If the vCPU is preempted and scheduled in on a different CPU, > avic_vcpu_load() will overwrite the vCPU's entry with the wrong CPU info. That would make a lot of sense. avic_vcpu_load() can handle svm->avic_is_running = false, but avic_set_running still needs its body wrapped by preempt_disable/preempt_enable. Fedora's kernel is CONFIG_PREEMPT_VOLUNTARY, but I know Maxim uses his own build so it would not surprise me if he used CONFIG_PREEMPT=y. Paolo
On Mon, 2021-11-29 at 20:18 +0100, Paolo Bonzini wrote: > On 11/29/21 19:55, Sean Christopherson wrote: > > > Still it does seem to be a race that happens when IS_RUNNING=true but > > > vcpu->mode == OUTSIDE_GUEST_MODE. This patch makes the race easier to > > > trigger because it moves IS_RUNNING=false later. > > > > Oh! Any chance the bug only repros with preemption enabled? That would explain > > why I don't see problems, I'm pretty sure I've only run AVIC with a PREEMPT=n. > > Me too. > > > svm_vcpu_{un}blocking() are called with preemption enabled, and avic_set_running() > > passes in vcpu->cpu. If the vCPU is preempted and scheduled in on a different CPU, > > avic_vcpu_load() will overwrite the vCPU's entry with the wrong CPU info. > > That would make a lot of sense. avic_vcpu_load() can handle > svm->avic_is_running = false, but avic_set_running still needs its body > wrapped by preempt_disable/preempt_enable. > > Fedora's kernel is CONFIG_PREEMPT_VOLUNTARY, but I know Maxim uses his > own build so it would not surprise me if he used CONFIG_PREEMPT=y. > > Paolo > I will write ll the details tomorrow but I strongly suspect the CPU errata https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf #1235 Basically what I see that 1. vCPU2 disables is_running in avic physical id cache 2. vCPU2 checks that IRR is empty and it is 3. vCPU2 does schedule(); and it keeps on sleeping forever. If I kick it via signal (like just doing 'info registers' qemu hmp command or just stop/cont on the same hmp interface, the vCPU wakes up and notices that IRR suddenly is not empty, and the VM comes back to life (and then hangs after a while again with the same problem....). As far as I see in the traces, the bit in IRR came from another VCPU who didn't respect the ir_running bit and didn't get AVIC_INCOMPLETE_IPI VMexit. I can't 100% prove it yet, but everything in the trace shows this. About the rest of the environment, currently I reproduce this in a VM which has no pci passed through devices at all, just AVIC. (I wasn't able to reproduce it before just because I forgot to enable AVIC in this configuration). So I also agree that Sean's patch is not to blame here, it just made the window between setting is_running and getting to sleep shorter and made it less likely that other vCPUs will pick up the is_running change. (I suspect that they pick it up on next vmrun, and otherwise the value is somehow cached wrongfully in them). A very performance killing workaround of kicking all vCPUs when one of them enters vcpu_block does seem to work for me but it skews all the timing off so I can't prove it. That is all, I will write more detailed info, including some traces I have. I do use windows 10 with so called LatencyMon in it, which shows overall how much latency hardware interrupts have, which used to be useful for me to ensure that my VMs are suitable for RT like latency (even before I joined RedHat, I tuned my VMs as much as I could to make my Rift CV1 VR headset work well which needs RT like latencies. These days VR works fine in my VMs anyway, but I still kept this tool to keep an eye on it). I really need to write a kvm unit test to stress test IPIs, especially this case, I will do this very soon. Wei Huang, any info on this would be very helpful. Maybe putting the avic physical table in UC memory would help? Maybe ringing doorbells of all other vcpus will help them notice the change? Best regards, Maxim Levitsky
On Mon, 2021-11-29 at 18:55 +0100, Paolo Bonzini wrote: > On 11/29/21 18:25, Sean Christopherson wrote: > > > If I apply though only the patch series up to this patch, my fedora VM seems > > > to work fine, but my windows VM still locks up hard when I run 'LatencyTop' > > > in it, which doesn't happen without this patch. > > > > Buy "run 'LatencyTop' in it", do you mean running something in the Windows guest? > > The only search results I can find for LatencyTop are Linux specific. > > I think it's LatencyMon, https://www.resplendence.com/latencymon. > > Paolo > Yes. Best regards, Maxim Levitsky
On Tue, 2021-11-30 at 00:53 +0200, Maxim Levitsky wrote: > On Mon, 2021-11-29 at 20:18 +0100, Paolo Bonzini wrote: > > On 11/29/21 19:55, Sean Christopherson wrote: > > > > Still it does seem to be a race that happens when IS_RUNNING=true but > > > > vcpu->mode == OUTSIDE_GUEST_MODE. This patch makes the race easier to > > > > trigger because it moves IS_RUNNING=false later. > > > > > > Oh! Any chance the bug only repros with preemption enabled? That would explain > > > why I don't see problems, I'm pretty sure I've only run AVIC with a PREEMPT=n. > > > > Me too. > > > > > svm_vcpu_{un}blocking() are called with preemption enabled, and avic_set_running() > > > passes in vcpu->cpu. If the vCPU is preempted and scheduled in on a different CPU, > > > avic_vcpu_load() will overwrite the vCPU's entry with the wrong CPU info. > > > > That would make a lot of sense. avic_vcpu_load() can handle > > svm->avic_is_running = false, but avic_set_running still needs its body > > wrapped by preempt_disable/preempt_enable. > > > > Fedora's kernel is CONFIG_PREEMPT_VOLUNTARY, but I know Maxim uses his > > own build so it would not surprise me if he used CONFIG_PREEMPT=y. > > > > Paolo > > > > I will write ll the details tomorrow but I strongly suspect the CPU errata > https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf > #1235 > > Basically what I see that > > 1. vCPU2 disables is_running in avic physical id cache > 2. vCPU2 checks that IRR is empty and it is > 3. vCPU2 does schedule(); > > and it keeps on sleeping forever. If I kick it via signal > (like just doing 'info registers' qemu hmp command > or just stop/cont on the same hmp interface, the > vCPU wakes up and notices that IRR suddenly is not empty, > and the VM comes back to life (and then hangs after a while again > with the same problem....). > > As far as I see in the traces, the bit in IRR came from > another VCPU who didn't respect the ir_running bit and didn't get > AVIC_INCOMPLETE_IPI VMexit. > I can't 100% prove it yet, but everything in the trace shows this. > > About the rest of the environment, currently I reproduce this in > a VM which has no pci passed through devices at all, just AVIC. > (I wasn't able to reproduce it before just because I forgot to > enable AVIC in this configuration). > > So I also agree that Sean's patch is not to blame here, > it just made the window between setting is_running and getting to sleep > shorter and made it less likely that other vCPUs will pick up the is_running change. > (I suspect that they pick it up on next vmrun, and otherwise the value is somehow > cached wrongfully in them). > > A very performance killing workaround of kicking all vCPUs when one of them enters vcpu_block > does seem to work for me but it skews all the timing off so I can't prove it. > > That is all, I will write more detailed info, including some traces I have. > > I do use windows 10 with so called LatencyMon in it, which shows overall how > much latency hardware interrupts have, which used to be useful for me to > ensure that my VMs are suitable for RT like latency (even before I joined RedHat, > I tuned my VMs as much as I could to make my Rift CV1 VR headset work well which > needs RT like latencies. > > These days VR works fine in my VMs anyway, but I still kept this tool to keep an eye on it). > > I really need to write a kvm unit test to stress test IPIs, especially this case, > I will do this very soon. > > > Wei Huang, any info on this would be very helpful. > > Maybe putting the avic physical table in UC memory would help? > Maybe ringing doorbells of all other vcpus will help them notice the change? > > Best regards, > Maxim Levitsky Hi! I am now almost sure that this is errata #1235. I had attached a kvm-unit-test I wrote (patch against master of https://gitlab.com/kvm-unit-tests/kvm-unit-tests.git/) which is able to reproduce the issue on stock 5.15.0 kernel (*no patches applied at all*) after just few seconds. If kvm is loaded without halt-polling (that is halt_poll_ns=0 is used). Halt polling and/or Sean's patch are not to blame, it just changes timeing. With Sean's patch I don't need to disable half polling. I did find few avic inhibition bugs that this test also finds and to make it work before I fix them, I added a workaround to not hit them in this test. I'll send patches to fix those very soon. Note that in windows VM there were no avic inhibitions so those bugs are not relevant. Wei Huang, do you know if this issue is fixed on Zen3, and if it is fixed on some Zen2 machines? Any workarounds other than 'don't use avic'? Best regards, Maxim Levitsky
On Thu, Dec 02, 2021, Maxim Levitsky wrote: > On Tue, 2021-11-30 at 00:53 +0200, Maxim Levitsky wrote: > > On Mon, 2021-11-29 at 20:18 +0100, Paolo Bonzini wrote: > > Basically what I see that > > > > 1. vCPU2 disables is_running in avic physical id cache > > 2. vCPU2 checks that IRR is empty and it is > > 3. vCPU2 does schedule(); > > > > and it keeps on sleeping forever. If I kick it via signal > > (like just doing 'info registers' qemu hmp command > > or just stop/cont on the same hmp interface, the > > vCPU wakes up and notices that IRR suddenly is not empty, > > and the VM comes back to life (and then hangs after a while again > > with the same problem....). > > > > As far as I see in the traces, the bit in IRR came from > > another VCPU who didn't respect the ir_running bit and didn't get > > AVIC_INCOMPLETE_IPI VMexit. > > I can't 100% prove it yet, but everything in the trace shows this. ... > I am now almost sure that this is errata #1235. > > I had attached a kvm-unit-test I wrote (patch against master of > https://gitlab.com/kvm-unit-tests/kvm-unit-tests.git/) which is able to > reproduce the issue on stock 5.15.0 kernel (*no patches applied at all*) > after just few seconds. If kvm is loaded without halt-polling (that is > halt_poll_ns=0 is used). > > Halt polling and/or Sean's patch are not to blame, it just changes timeing. > With Sean's patch I don't need to disable half polling. Hmm, that suggests the bug/erratum is due to the CPU consuming stale data from #4 for the IsRunning check in #5, or retiring uops for the IsRunning check before retiring the vIRR update. It would be helpful if the erratum actually provided info on the "highly specific and detailed set of internal timing conditions". :-/ 4. Lookup the vAPIC backing page address in the Physical APIC table using the guest physical APIC ID as an index into the table. 5. For every valid destination: - Atomically set the appropriate IRR bit in each of the destinations’ vAPIC backing page. - Check the IsRunning status of each destination.
On Mon, 2021-11-29 at 17:25 +0000, Sean Christopherson wrote: > On Mon, Nov 29, 2021, Maxim Levitsky wrote: > > (This thing is that when you tell the IOMMU that a vCPU is not running, > > Another thing I discovered that this patch series totally breaks my VMs, > > without cpu_pm=on The whole series (I didn't yet bisect it) makes even my > > fedora32 VM be very laggy, almost unusable, and it only has one > > passed-through device, a nic). > > Grrrr, the complete lack of comments in the KVM code and the separate paths for > VMX vs SVM when handling HLT with APICv make this all way for difficult to > understand than it should be. > > The hangs are likely due to: > > KVM: SVM: Unconditionally mark AVIC as running on vCPU load (with APICv) > > If a posted interrupt arrives after KVM has done its final search through the vIRR, > but before avic_update_iommu_vcpu_affinity() is called, the posted interrupt will > be set in the vIRR without triggering a host IRQ to wake the vCPU via the GA log. > > I.e. KVM is missing an equivalent to VMX's posted interrupt check for an outstanding > notification after switching to the wakeup vector. > > For now, the least awful approach is sadly to keep the vcpu_(un)blocking() hooks. > Unlike VMX's PI support, there's no fast check for an interrupt being posted (KVM > would have to rewalk the vIRR), no easy to signal the current CPU to do wakeup (I > don't think KVM even has access to the IRQ used by the owning IOMMU), and there's > no simplification of load/put code. I have an idea. Why do we even use/need the GA log? Why not, just disable the 'guest mode' in the iommu and let it sent good old normal interrupt when a vCPU is not running, just like we do when we inhibit the AVIC? GA log makes all devices that share an iommu (there are 4 iommus per package these days, some without useful devices) go through a single (!) msi like interrupt, which is even for some reason implemented by a threaded IRQ in the linux kernel. Best regards, Maxim Levitsky > > If the scheduler were changed to support waking in the sched_out path, then I'd be > more inclined to handle this in avic_vcpu_put() by rewalking the vIRR one final > time, but for now it's not worth it. > > > If I apply though only the patch series up to this patch, my fedora VM seems > > to work fine, but my windows VM still locks up hard when I run 'LatencyTop' > > in it, which doesn't happen without this patch. > > Buy "run 'LatencyTop' in it", do you mean running something in the Windows guest? > The only search results I can find for LatencyTop are Linux specific. > > > So far the symptoms I see is that on VCPU 0, ISR has quite high interrupt > > (0xe1 last time I seen it), TPR and PPR are 0xe0 (although I have seen TPR to > > have different values), and IRR has plenty of interrupts with lower priority. > > The VM seems to be stuck in this case. As if its EOI got lost or something is > > preventing the IRQ handler from issuing EOI. > > > > LatencyTop does install some form of a kernel driver which likely does meddle > > with interrupts (maybe it sends lots of self IPIs?). > > > > 100% reproducible as soon as I start monitoring with LatencyTop. > > > > Without this patch it works (or if disabling halt polling), > > Huh. I assume everything works if you disable halt polling _without_ this patch > applied? > > If so, that implies that successful halt polling without mucking with vCPU IOMMU > affinity is somehow problematic. I can't think of any relevant side effects other > than timing. >
On 12/2/21 03:00, Sean Christopherson wrote: > Hmm, that suggests the bug/erratum is due to the CPU consuming stale data from #4 > for the IsRunning check in #5, or retiring uops for the IsRunning check before > retiring the vIRR update. Yes, this seems to be an error in the implementation of step 5. In assembly, atomic operations have implicit memory barriers, but who knows what's going on in microcode. So either it's the former, or something is going on that's specific to the microcode sequencer, or it's a more mundane implementation bug. In any case, AVIC is disabled for now and will need a list of model where it works, so I'll go on and queue the first part of this series. Paolo > It would be helpful if the erratum actually provided > info on the "highly specific and detailed set of internal timing conditions". :-/ > > 4. Lookup the vAPIC backing page address in the Physical APIC table using the > guest physical APIC ID as an index into the table. > 5. For every valid destination: > - Atomically set the appropriate IRR bit in each of the destinations’ vAPIC > backing page. > - Check the IsRunning status of each destination.
On Thu, 2021-12-02 at 12:20 +0200, Maxim Levitsky wrote: > On Mon, 2021-11-29 at 17:25 +0000, Sean Christopherson wrote: > > On Mon, Nov 29, 2021, Maxim Levitsky wrote: > > > (This thing is that when you tell the IOMMU that a vCPU is not running, > > > Another thing I discovered that this patch series totally breaks my VMs, > > > without cpu_pm=on The whole series (I didn't yet bisect it) makes even my > > > fedora32 VM be very laggy, almost unusable, and it only has one > > > passed-through device, a nic). > > > > Grrrr, the complete lack of comments in the KVM code and the separate paths for > > VMX vs SVM when handling HLT with APICv make this all way for difficult to > > understand than it should be. > > > > The hangs are likely due to: > > > > KVM: SVM: Unconditionally mark AVIC as running on vCPU load (with APICv) > > > > If a posted interrupt arrives after KVM has done its final search through the vIRR, > > but before avic_update_iommu_vcpu_affinity() is called, the posted interrupt will > > be set in the vIRR without triggering a host IRQ to wake the vCPU via the GA log. > > > > I.e. KVM is missing an equivalent to VMX's posted interrupt check for an outstanding > > notification after switching to the wakeup vector. > > > > For now, the least awful approach is sadly to keep the vcpu_(un)blocking() hooks. > > Unlike VMX's PI support, there's no fast check for an interrupt being posted (KVM > > would have to rewalk the vIRR), no easy to signal the current CPU to do wakeup (I > > don't think KVM even has access to the IRQ used by the owning IOMMU), and there's > > no simplification of load/put code. > > I have an idea. > > Why do we even use/need the GA log? > Why not, just disable the 'guest mode' in the iommu and let it sent good old normal interrupt > when a vCPU is not running, just like we do when we inhibit the AVIC? > > GA log makes all devices that share an iommu (there are 4 iommus per package these days, > some without useful devices) go through a single (!) msi like interrupt, > which is even for some reason implemented by a threaded IRQ in the linux kernel. Yep, this gross hack works! diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c index 958966276d00b8..6136b94f6b5f5e 100644 --- a/arch/x86/kvm/svm/avic.c +++ b/arch/x86/kvm/svm/avic.c @@ -987,8 +987,9 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu) entry |= AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK; WRITE_ONCE(*(svm->avic_physical_id_cache), entry); - avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, - svm->avic_is_running); + + svm_set_pi_irte_mode(vcpu, svm->avic_is_running); + avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, true); } void avic_vcpu_put(struct kvm_vcpu *vcpu) @@ -997,8 +998,9 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu) struct vcpu_svm *svm = to_svm(vcpu); entry = READ_ONCE(*(svm->avic_physical_id_cache)); - if (entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK) - avic_update_iommu_vcpu_affinity(vcpu, -1, 0); + if (entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK) { + svm_set_pi_irte_mode(vcpu, false); + } entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK; WRITE_ONCE(*(svm->avic_physical_id_cache), entry); > GA log interrupts almost gone (there are still few because svm_set_pi_irte_mode sets is_running false) devices works as expected sending normal interrupts unless guest is loaded, then normal interrupts disappear, as expected. Best regards, Maxim Levitsky > > Best regards, > Maxim Levitsky > > > If the scheduler were changed to support waking in the sched_out path, then I'd be > > more inclined to handle this in avic_vcpu_put() by rewalking the vIRR one final > > time, but for now it's not worth it. > > > > > If I apply though only the patch series up to this patch, my fedora VM seems > > > to work fine, but my windows VM still locks up hard when I run 'LatencyTop' > > > in it, which doesn't happen without this patch. > > > > Buy "run 'LatencyTop' in it", do you mean running something in the Windows guest? > > The only search results I can find for LatencyTop are Linux specific. > > > > > So far the symptoms I see is that on VCPU 0, ISR has quite high interrupt > > > (0xe1 last time I seen it), TPR and PPR are 0xe0 (although I have seen TPR to > > > have different values), and IRR has plenty of interrupts with lower priority. > > > The VM seems to be stuck in this case. As if its EOI got lost or something is > > > preventing the IRQ handler from issuing EOI. > > > > > > LatencyTop does install some form of a kernel driver which likely does meddle > > > with interrupts (maybe it sends lots of self IPIs?). > > > > > > 100% reproducible as soon as I start monitoring with LatencyTop. > > > > > > Without this patch it works (or if disabling halt polling), > > > > Huh. I assume everything works if you disable halt polling _without_ this patch > > applied? > > > > If so, that implies that successful halt polling without mucking with vCPU IOMMU > > affinity is somehow problematic. I can't think of any relevant side effects other > > than timing. > >
On Mon, 2021-11-29 at 17:25 +0000, Sean Christopherson wrote: > On Mon, Nov 29, 2021, Maxim Levitsky wrote: > > (This thing is that when you tell the IOMMU that a vCPU is not running, > > Another thing I discovered that this patch series totally breaks my VMs, > > without cpu_pm=on The whole series (I didn't yet bisect it) makes even my > > fedora32 VM be very laggy, almost unusable, and it only has one > > passed-through device, a nic). > > Grrrr, the complete lack of comments in the KVM code and the separate paths for > VMX vs SVM when handling HLT with APICv make this all way for difficult to > understand than it should be. > > The hangs are likely due to: > > KVM: SVM: Unconditionally mark AVIC as running on vCPU load (with APICv) Yes, the other hang I told about which makes all my VMs very laggy, almost impossible to use is because of the above patch, but since I reproduced it now again without any passed-through device, I also blame the cpu errata on this. Best regards, Maxim Levitsky > > If a posted interrupt arrives after KVM has done its final search through the vIRR, > but before avic_update_iommu_vcpu_affinity() is called, the posted interrupt will > be set in the vIRR without triggering a host IRQ to wake the vCPU via the GA log. > > I.e. KVM is missing an equivalent to VMX's posted interrupt check for an outstanding > notification after switching to the wakeup vector. > > For now, the least awful approach is sadly to keep the vcpu_(un)blocking() hooks. > Unlike VMX's PI support, there's no fast check for an interrupt being posted (KVM > would have to rewalk the vIRR), no easy to signal the current CPU to do wakeup (I > don't think KVM even has access to the IRQ used by the owning IOMMU), and there's > no simplification of load/put code. > > If the scheduler were changed to support waking in the sched_out path, then I'd be > more inclined to handle this in avic_vcpu_put() by rewalking the vIRR one final > time, but for now it's not worth it. > > > If I apply though only the patch series up to this patch, my fedora VM seems > > to work fine, but my windows VM still locks up hard when I run 'LatencyTop' > > in it, which doesn't happen without this patch. > > Buy "run 'LatencyTop' in it", do you mean running something in the Windows guest? > The only search results I can find for LatencyTop are Linux specific. > > > So far the symptoms I see is that on VCPU 0, ISR has quite high interrupt > > (0xe1 last time I seen it), TPR and PPR are 0xe0 (although I have seen TPR to > > have different values), and IRR has plenty of interrupts with lower priority. > > The VM seems to be stuck in this case. As if its EOI got lost or something is > > preventing the IRQ handler from issuing EOI. > > > > LatencyTop does install some form of a kernel driver which likely does meddle > > with interrupts (maybe it sends lots of self IPIs?). > > > > 100% reproducible as soon as I start monitoring with LatencyTop. > > > > Without this patch it works (or if disabling halt polling), > > Huh. I assume everything works if you disable halt polling _without_ this patch > applied? > > If so, that implies that successful halt polling without mucking with vCPU IOMMU > affinity is somehow problematic. I can't think of any relevant side effects other > than timing. >
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index f90b3ed05628..227f6bbe0716 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -3235,8 +3235,6 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu) bool waited = false; u64 block_ns; - kvm_arch_vcpu_blocking(vcpu); - start = cur = poll_end = ktime_get(); if (do_halt_poll) { ktime_t stop = ktime_add_ns(ktime_get(), vcpu->halt_poll_ns); @@ -3253,6 +3251,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu) } while (kvm_vcpu_can_poll(cur, stop)); } + kvm_arch_vcpu_blocking(vcpu); prepare_to_rcuwait(wait); for (;;) { @@ -3265,6 +3264,9 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu) schedule(); } finish_rcuwait(wait); + + kvm_arch_vcpu_unblocking(vcpu); + cur = ktime_get(); if (waited) { vcpu->stat.generic.halt_wait_ns += @@ -3273,7 +3275,6 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu) ktime_to_ns(cur) - ktime_to_ns(poll_end)); } out: - kvm_arch_vcpu_unblocking(vcpu); block_ns = ktime_to_ns(cur) - ktime_to_ns(start); /*
Invoke the arch hooks for block+unblock if and only if KVM actually attempts to block the vCPU. The only non-nop implementation is on x86, specifically SVM's AVIC, and there is no need to put the AVIC prior to halt-polling as KVM x86's kvm_vcpu_has_events() will scour the full vIRR to find pending IRQs regardless of whether the AVIC is loaded/"running". The primary motivation is to allow future cleanup to split out "block" from "halt", but this is also likely a small performance boost on x86 SVM when halt-polling is successful. Adjust the post-block path to update "cur" after unblocking, i.e. include AVIC load time in halt_wait_ns and halt_wait_hist, so that the behavior is consistent. Moving just the pre-block arch hook would result in only the AVIC put latency being included in the halt_wait stats. There is no obvious evidence that one way or the other is correct, so just ensure KVM is consistent. Note, x86 has two separate paths for handling APICv with respect to vCPU blocking. VMX uses hooks in x86's vcpu_block(), while SVM uses the arch hooks in kvm_vcpu_block(). Prior to this path, the two paths were more or less functionally identical. That is very much not the case after this patch, as the hooks used by VMX _must_ fire before halt-polling. x86's entire mess will be cleaned up in future patches. Signed-off-by: Sean Christopherson <seanjc@google.com> --- virt/kvm/kvm_main.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)