Message ID | Z8ZBzEJ7--VWKdWd@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | QEMU's Hyper-V HV_X64_MSR_EOM is broken with split IRQCHIP | expand |
Sean Christopherson <seanjc@google.com> writes: > FYI, QEMU's Hyper-V emulation of HV_X64_MSR_EOM has been broken since QEMU commit > c82d9d43ed ("KVM: Kick resamplefd for split kernel irqchip"), as nothing in KVM > will forward the EOM notification to userspace. I have no idea if anything in > QEMU besides hyperv_testdev.c cares. The only VMBus device in QEMU besides the testdev seems to be Hyper-V ballooning driver, Cc: Maciej to check whether it's a real problem for it or not. > > The bug is reproducible by running the hyperv_connections KVM-Unit-Test with a > split IRQCHIP. Thanks, I can reproduce the problem too. > > Hacking QEMU and KVM (see KVM commit 654f1f13ea56 ("kvm: Check irqchip mode before > assign irqfd") as below gets the test to pass. Assuming that's not a palatable > solution, the other options I can think of would be for QEMU to intercept > HV_X64_MSR_EOM when using a split IRQCHIP, or to modify KVM to do KVM_EXIT_HYPERV_SYNIC > on writes to HV_X64_MSR_EOM with a split IRQCHIP. AFAIR, Hyper-V message interface is a fairly generic communication mechanism which in theory can be used without interrupts at all: the corresponding SINT can be masked and the guest can be polling for messages, proccessing them and then writing to HV_X64_MSR_EOM to trigger delivery on the next queued message. To support this scenario on the backend, we need to receive HV_X64_MSR_EOM writes regardless of whether irqchip is split or not. (In theory, we can get away without this by just checking if pending messages can be delivered upon each vCPU entry but this can take an undefined amount of time in some scenarios so I guess we're better off with notifications). > > diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c > index c65b790433..820bc1692e 100644 > --- a/accel/kvm/kvm-all.c > +++ b/accel/kvm/kvm-all.c > @@ -2261,10 +2261,9 @@ static int kvm_irqchip_assign_irqfd(KVMState *s, EventNotifier *event, > * the INTx slow path). > */ > kvm_resample_fd_insert(virq, resample); > - } else { > - irqfd.flags |= KVM_IRQFD_FLAG_RESAMPLE; > - irqfd.resamplefd = rfd; > } > + irqfd.flags |= KVM_IRQFD_FLAG_RESAMPLE; > + irqfd.resamplefd = rfd; > } else if (!assign) { > if (kvm_irqchip_is_split()) { > kvm_resample_fd_remove(virq); > > > diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c > index 63f66c51975a..0bf85f89eb27 100644 > --- a/arch/x86/kvm/irq.c > +++ b/arch/x86/kvm/irq.c > @@ -166,9 +166,7 @@ void __kvm_migrate_timers(struct kvm_vcpu *vcpu) > > bool kvm_arch_irqfd_allowed(struct kvm *kvm, struct kvm_irqfd *args) > { > - bool resample = args->flags & KVM_IRQFD_FLAG_RESAMPLE; > - > - return resample ? irqchip_kernel(kvm) : irqchip_in_kernel(kvm); > + return irqchip_in_kernel(kvm); > } > > bool kvm_arch_irqchip_in_kernel(struct kvm *kvm) > >
On 4.03.2025 13:59, Vitaly Kuznetsov wrote: > Sean Christopherson <seanjc@google.com> writes: > >> FYI, QEMU's Hyper-V emulation of HV_X64_MSR_EOM has been broken since QEMU commit >> c82d9d43ed ("KVM: Kick resamplefd for split kernel irqchip"), as nothing in KVM >> will forward the EOM notification to userspace. I have no idea if anything in >> QEMU besides hyperv_testdev.c cares. > > The only VMBus device in QEMU besides the testdev seems to be Hyper-V > ballooning driver, Cc: Maciej to check whether it's a real problem for > it or not. I just did a quick check on a hv-balloon Windows 2019 setup that I had on hand and it seems to work with "kernel-irqchip=split" the same correct way as without this option (which AFAIK is kernel-irqchip=on for q35 machine type). So at least this Windows version and its VMBus client driver seem to not be affected by this issue. Will try to look at this deeper in coming time as I am fairly busy right now with QEMU live migration stuff before the code freeze soon. Thanks, Maciej
On Tue, Mar 04, 2025, Vitaly Kuznetsov wrote: > Sean Christopherson <seanjc@google.com> writes: > > > FYI, QEMU's Hyper-V emulation of HV_X64_MSR_EOM has been broken since QEMU commit > > c82d9d43ed ("KVM: Kick resamplefd for split kernel irqchip"), as nothing in KVM > > will forward the EOM notification to userspace. I have no idea if anything in > > QEMU besides hyperv_testdev.c cares. > > The only VMBus device in QEMU besides the testdev seems to be Hyper-V > ballooning driver, Cc: Maciej to check whether it's a real problem for > it or not. > > > > > The bug is reproducible by running the hyperv_connections KVM-Unit-Test with a > > split IRQCHIP. > > Thanks, I can reproduce the problem too. > > > > > Hacking QEMU and KVM (see KVM commit 654f1f13ea56 ("kvm: Check irqchip mode before > > assign irqfd") as below gets the test to pass. Assuming that's not a palatable > > solution, the other options I can think of would be for QEMU to intercept > > HV_X64_MSR_EOM when using a split IRQCHIP, or to modify KVM to do KVM_EXIT_HYPERV_SYNIC > > on writes to HV_X64_MSR_EOM with a split IRQCHIP. > > AFAIR, Hyper-V message interface is a fairly generic communication > mechanism which in theory can be used without interrupts at all: the > corresponding SINT can be masked and the guest can be polling for > messages, proccessing them and then writing to HV_X64_MSR_EOM to trigger > delivery on the next queued message. To support this scenario on the > backend, we need to receive HV_X64_MSR_EOM writes regardless of whether > irqchip is split or not. (In theory, we can get away without this by > just checking if pending messages can be delivered upon each vCPU entry > but this can take an undefined amount of time in some scenarios so I > guess we're better off with notifications). Before c82d9d43ed ("KVM: Kick resamplefd for split kernel irqchip"), and without a split IRCHIP, QEMU gets notified via eventfd. On writes to HV_X64_MSR_EOM, KVM invokes irq_acked(), i.e. irqfd_resampler_ack(), for all SINT routes. The eventfd signal gets back to sint_ack_handler(), which invokes msg_retry() to re-post the message. I.e. trapping HV_X64_MSR_EOM on would be a slow path relative to what's there for in-kernel IRQCHIP.
Sean Christopherson <seanjc@google.com> writes: > On Tue, Mar 04, 2025, Vitaly Kuznetsov wrote: >> Sean Christopherson <seanjc@google.com> writes: >> >> > FYI, QEMU's Hyper-V emulation of HV_X64_MSR_EOM has been broken since QEMU commit >> > c82d9d43ed ("KVM: Kick resamplefd for split kernel irqchip"), as nothing in KVM >> > will forward the EOM notification to userspace. I have no idea if anything in >> > QEMU besides hyperv_testdev.c cares. >> >> The only VMBus device in QEMU besides the testdev seems to be Hyper-V >> ballooning driver, Cc: Maciej to check whether it's a real problem for >> it or not. >> >> > >> > The bug is reproducible by running the hyperv_connections KVM-Unit-Test with a >> > split IRQCHIP. >> >> Thanks, I can reproduce the problem too. >> >> > >> > Hacking QEMU and KVM (see KVM commit 654f1f13ea56 ("kvm: Check irqchip mode before >> > assign irqfd") as below gets the test to pass. Assuming that's not a palatable >> > solution, the other options I can think of would be for QEMU to intercept >> > HV_X64_MSR_EOM when using a split IRQCHIP, or to modify KVM to do KVM_EXIT_HYPERV_SYNIC >> > on writes to HV_X64_MSR_EOM with a split IRQCHIP. >> >> AFAIR, Hyper-V message interface is a fairly generic communication >> mechanism which in theory can be used without interrupts at all: the >> corresponding SINT can be masked and the guest can be polling for >> messages, proccessing them and then writing to HV_X64_MSR_EOM to trigger >> delivery on the next queued message. To support this scenario on the >> backend, we need to receive HV_X64_MSR_EOM writes regardless of whether >> irqchip is split or not. (In theory, we can get away without this by >> just checking if pending messages can be delivered upon each vCPU entry >> but this can take an undefined amount of time in some scenarios so I >> guess we're better off with notifications). > > Before c82d9d43ed ("KVM: Kick resamplefd for split kernel irqchip"), and without > a split IRCHIP, QEMU gets notified via eventfd. On writes to HV_X64_MSR_EOM, KVM > invokes irq_acked(), i.e. irqfd_resampler_ack(), for all SINT routes. The eventfd > signal gets back to sint_ack_handler(), which invokes msg_retry() to re-post the > message. > > I.e. trapping HV_X64_MSR_EOM on would be a slow path relative to what's there for > in-kernel IRQCHIP. My understanding is that the only type of message which requires fast processing is STIMER messages but we don't do stimers in userspace. I guess it is possible to have a competing 'noisy neighbough' in userspace draining message slots but then we are slow anyway.
On Tue, 2025-03-04 at 15:46 +0100, Vitaly Kuznetsov wrote: > Sean Christopherson <seanjc@google.com> writes: > > > On Tue, Mar 04, 2025, Vitaly Kuznetsov wrote: > > > Sean Christopherson <seanjc@google.com> writes: > > > > > > > FYI, QEMU's Hyper-V emulation of HV_X64_MSR_EOM has been broken since QEMU commit > > > > c82d9d43ed ("KVM: Kick resamplefd for split kernel irqchip"), as nothing in KVM > > > > will forward the EOM notification to userspace. I have no idea if anything in > > > > QEMU besides hyperv_testdev.c cares. > > > > > > The only VMBus device in QEMU besides the testdev seems to be Hyper-V > > > ballooning driver, Cc: Maciej to check whether it's a real problem for > > > it or not. > > > > > > > The bug is reproducible by running the hyperv_connections KVM-Unit-Test with a > > > > split IRQCHIP. > > > > > > Thanks, I can reproduce the problem too. > > > > > > > Hacking QEMU and KVM (see KVM commit 654f1f13ea56 ("kvm: Check irqchip mode before > > > > assign irqfd") as below gets the test to pass. Assuming that's not a palatable > > > > solution, the other options I can think of would be for QEMU to intercept > > > > HV_X64_MSR_EOM when using a split IRQCHIP, or to modify KVM to do KVM_EXIT_HYPERV_SYNIC > > > > on writes to HV_X64_MSR_EOM with a split IRQCHIP. > > > > > > AFAIR, Hyper-V message interface is a fairly generic communication > > > mechanism which in theory can be used without interrupts at all: the > > > corresponding SINT can be masked and the guest can be polling for > > > messages, proccessing them and then writing to HV_X64_MSR_EOM to trigger > > > delivery on the next queued message. To support this scenario on the > > > backend, we need to receive HV_X64_MSR_EOM writes regardless of whether > > > irqchip is split or not. (In theory, we can get away without this by > > > just checking if pending messages can be delivered upon each vCPU entry > > > but this can take an undefined amount of time in some scenarios so I > > > guess we're better off with notifications). > > > > Before c82d9d43ed ("KVM: Kick resamplefd for split kernel irqchip"), and without > > a split IRCHIP, QEMU gets notified via eventfd. On writes to HV_X64_MSR_EOM, KVM > > invokes irq_acked(), i.e. irqfd_resampler_ack(), for all SINT routes. The eventfd > > signal gets back to sint_ack_handler(), which invokes msg_retry() to re-post the > > message. > > > > I.e. trapping HV_X64_MSR_EOM on would be a slow path relative to what's there for > > in-kernel IRQCHIP. > > My understanding is that the only type of message which requires fast > processing is STIMER messages but we don't do stimers in userspace. I > guess it is possible to have a competing 'noisy neighbough' in userspace > draining message slots but then we are slow anyway. > Hi, AFAIK, HV_X64_MSR_EOM is only one of the ways for the guest to signal that it processed the SYNIC message. Guest can also signal that it finished processing a SYNIC message using HV_X64_MSR_EOI or even by writing to EOI local apic register, and I actually think that the later is what is used by at least recent Windows. Now KVM does intercept EOI and it even "happens" to work with both APICv and AVIC: APICv has EOI 'exiting bitmap' and SYNC interrupts are set there (see vcpu_load_eoi_exitmap). AVIC intercepts EOI write iff the interrupt was level-triggered and SYNIC interrupts happen to be indeed level-triggered: static int synic_set_irq(struct kvm_vcpu_hv_synic *synic, u32 sint) ... irq.shorthand = APIC_DEST_SELF; irq.dest_mode = APIC_DEST_PHYSICAL; irq.delivery_mode = APIC_DM_FIXED; irq.vector = vector; irq.level = 1; ... Best regards, Maxim Levitsky
Maxim Levitsky <mlevitsk@redhat.com> writes: > On Tue, 2025-03-04 at 15:46 +0100, Vitaly Kuznetsov wrote: >> Sean Christopherson <seanjc@google.com> writes: >> >> > On Tue, Mar 04, 2025, Vitaly Kuznetsov wrote: >> > > Sean Christopherson <seanjc@google.com> writes: >> > > >> > > > FYI, QEMU's Hyper-V emulation of HV_X64_MSR_EOM has been broken since QEMU commit >> > > > c82d9d43ed ("KVM: Kick resamplefd for split kernel irqchip"), as nothing in KVM >> > > > will forward the EOM notification to userspace. I have no idea if anything in >> > > > QEMU besides hyperv_testdev.c cares. >> > > >> > > The only VMBus device in QEMU besides the testdev seems to be Hyper-V >> > > ballooning driver, Cc: Maciej to check whether it's a real problem for >> > > it or not. >> > > >> > > > The bug is reproducible by running the hyperv_connections KVM-Unit-Test with a >> > > > split IRQCHIP. >> > > >> > > Thanks, I can reproduce the problem too. >> > > >> > > > Hacking QEMU and KVM (see KVM commit 654f1f13ea56 ("kvm: Check irqchip mode before >> > > > assign irqfd") as below gets the test to pass. Assuming that's not a palatable >> > > > solution, the other options I can think of would be for QEMU to intercept >> > > > HV_X64_MSR_EOM when using a split IRQCHIP, or to modify KVM to do KVM_EXIT_HYPERV_SYNIC >> > > > on writes to HV_X64_MSR_EOM with a split IRQCHIP. >> > > >> > > AFAIR, Hyper-V message interface is a fairly generic communication >> > > mechanism which in theory can be used without interrupts at all: the >> > > corresponding SINT can be masked and the guest can be polling for >> > > messages, proccessing them and then writing to HV_X64_MSR_EOM to trigger >> > > delivery on the next queued message. To support this scenario on the >> > > backend, we need to receive HV_X64_MSR_EOM writes regardless of whether >> > > irqchip is split or not. (In theory, we can get away without this by >> > > just checking if pending messages can be delivered upon each vCPU entry >> > > but this can take an undefined amount of time in some scenarios so I >> > > guess we're better off with notifications). >> > >> > Before c82d9d43ed ("KVM: Kick resamplefd for split kernel irqchip"), and without >> > a split IRCHIP, QEMU gets notified via eventfd. On writes to HV_X64_MSR_EOM, KVM >> > invokes irq_acked(), i.e. irqfd_resampler_ack(), for all SINT routes. The eventfd >> > signal gets back to sint_ack_handler(), which invokes msg_retry() to re-post the >> > message. >> > >> > I.e. trapping HV_X64_MSR_EOM on would be a slow path relative to what's there for >> > in-kernel IRQCHIP. >> >> My understanding is that the only type of message which requires fast >> processing is STIMER messages but we don't do stimers in userspace. I >> guess it is possible to have a competing 'noisy neighbough' in userspace >> draining message slots but then we are slow anyway. >> > > Hi, > > AFAIK, HV_X64_MSR_EOM is only one of the ways for the guest to signal that it processed the SYNIC message. > > Guest can also signal that it finished processing a SYNIC message using HV_X64_MSR_EOI or even by writing to EOI > local apic register, and I actually think that the later is what is used by at least recent Windows. > Hyper-V SynIC has two distinct concepts: "messages" and "events". While events are just flags (like interrupts), messages actually carry information and the recipient is responsible for clearing message slot (there are only 16 of them per vCPU AFAIR). Strictly speaking, HV_X64_MSR_EOM is optional and hypervisor may deliver a new message to an empty slot at any time. It may use EOI as a trigger but note that not every message delivery results in an interrupt as e.g. SINT can be configured in 'polling' mode -- and that's when HV_X64_MSR_EOM comes handy. > > Now KVM does intercept EOI and it even "happens" to work with both APICv and AVIC: > > APICv has EOI 'exiting bitmap' and SYNC interrupts are set there (see vcpu_load_eoi_exitmap). > > AVIC intercepts EOI write iff the interrupt was level-triggered and SYNIC interrupts happen > to be indeed level-triggered: > > static int synic_set_irq(struct kvm_vcpu_hv_synic *synic, u32 sint) > ... > irq.shorthand = APIC_DEST_SELF; > irq.dest_mode = APIC_DEST_PHYSICAL; > irq.delivery_mode = APIC_DM_FIXED; > irq.vector = > vector; > irq.level = 1; > ... > Yea, I think the problem here is specific to HV_X64_MSR_EOM.
Vitaly Kuznetsov <vkuznets@redhat.com> writes: > Maxim Levitsky <mlevitsk@redhat.com> writes: > >> On Tue, 2025-03-04 at 15:46 +0100, Vitaly Kuznetsov wrote: >>> Sean Christopherson <seanjc@google.com> writes: >>> >>> > On Tue, Mar 04, 2025, Vitaly Kuznetsov wrote: >>> > > Sean Christopherson <seanjc@google.com> writes: >>> > > >>> > > > FYI, QEMU's Hyper-V emulation of HV_X64_MSR_EOM has been broken since QEMU commit >>> > > > c82d9d43ed ("KVM: Kick resamplefd for split kernel irqchip"), as nothing in KVM >>> > > > will forward the EOM notification to userspace. I have no idea if anything in >>> > > > QEMU besides hyperv_testdev.c cares. >>> > > >>> > > The only VMBus device in QEMU besides the testdev seems to be Hyper-V >>> > > ballooning driver, Cc: Maciej to check whether it's a real problem for >>> > > it or not. >>> > > >>> > > > The bug is reproducible by running the hyperv_connections KVM-Unit-Test with a >>> > > > split IRQCHIP. >>> > > >>> > > Thanks, I can reproduce the problem too. >>> > > >>> > > > Hacking QEMU and KVM (see KVM commit 654f1f13ea56 ("kvm: Check irqchip mode before >>> > > > assign irqfd") as below gets the test to pass. Assuming that's not a palatable >>> > > > solution, the other options I can think of would be for QEMU to intercept >>> > > > HV_X64_MSR_EOM when using a split IRQCHIP, or to modify KVM to do KVM_EXIT_HYPERV_SYNIC >>> > > > on writes to HV_X64_MSR_EOM with a split IRQCHIP. >>> > > >>> > > AFAIR, Hyper-V message interface is a fairly generic communication >>> > > mechanism which in theory can be used without interrupts at all: the >>> > > corresponding SINT can be masked and the guest can be polling for >>> > > messages, proccessing them and then writing to HV_X64_MSR_EOM to trigger >>> > > delivery on the next queued message. To support this scenario on the >>> > > backend, we need to receive HV_X64_MSR_EOM writes regardless of whether >>> > > irqchip is split or not. (In theory, we can get away without this by >>> > > just checking if pending messages can be delivered upon each vCPU entry >>> > > but this can take an undefined amount of time in some scenarios so I >>> > > guess we're better off with notifications). >>> > >>> > Before c82d9d43ed ("KVM: Kick resamplefd for split kernel irqchip"), and without >>> > a split IRCHIP, QEMU gets notified via eventfd. On writes to HV_X64_MSR_EOM, KVM >>> > invokes irq_acked(), i.e. irqfd_resampler_ack(), for all SINT routes. The eventfd >>> > signal gets back to sint_ack_handler(), which invokes msg_retry() to re-post the >>> > message. >>> > >>> > I.e. trapping HV_X64_MSR_EOM on would be a slow path relative to what's there for >>> > in-kernel IRQCHIP. >>> >>> My understanding is that the only type of message which requires fast >>> processing is STIMER messages but we don't do stimers in userspace. I >>> guess it is possible to have a competing 'noisy neighbough' in userspace >>> draining message slots but then we are slow anyway. >>> >> >> Hi, >> >> AFAIK, HV_X64_MSR_EOM is only one of the ways for the guest to signal that it processed the SYNIC message. >> >> Guest can also signal that it finished processing a SYNIC message using HV_X64_MSR_EOI or even by writing to EOI >> local apic register, and I actually think that the later is what is used by at least recent Windows. >> > > Hyper-V SynIC has two distinct concepts: "messages" and "events". While > events are just flags (like interrupts), messages actually carry > information and the recipient is responsible for clearing message slot > (there are only 16 of them per vCPU AFAIR). Strictly speaking, > HV_X64_MSR_EOM is optional and hypervisor may deliver a new message to > an empty slot at any time. It may use EOI as a trigger but note that > not every message delivery results in an interrupt as e.g. SINT can be > configured in 'polling' mode -- and that's when HV_X64_MSR_EOM comes > handy. > Thinking more about this, I believe we should not be using eventfd for delivering writes to HV_X64_MSR_EOM to VMM at all. Namely, writes to HV_X64_MSR_EOM should not invoke irq_acked(): it should be valid for a guest to poll for messages by writing to HV_X64_MSR_EOM within the interrupt handler, drain the queue and then write to the EOI register. In this case, the VMM may want to distinguish between EOI and EOM. We can do a separate eventfd, of course, but I think it's an overkill and 'slow' processing will do just fine.
diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index c65b790433..820bc1692e 100644 --- a/accel/kvm/kvm-all.c +++ b/accel/kvm/kvm-all.c @@ -2261,10 +2261,9 @@ static int kvm_irqchip_assign_irqfd(KVMState *s, EventNotifier *event, * the INTx slow path). */ kvm_resample_fd_insert(virq, resample); - } else { - irqfd.flags |= KVM_IRQFD_FLAG_RESAMPLE; - irqfd.resamplefd = rfd; } + irqfd.flags |= KVM_IRQFD_FLAG_RESAMPLE; + irqfd.resamplefd = rfd; } else if (!assign) { if (kvm_irqchip_is_split()) { kvm_resample_fd_remove(virq); diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c index 63f66c51975a..0bf85f89eb27 100644 --- a/arch/x86/kvm/irq.c +++ b/arch/x86/kvm/irq.c @@ -166,9 +166,7 @@ void __kvm_migrate_timers(struct kvm_vcpu *vcpu) bool kvm_arch_irqfd_allowed(struct kvm *kvm, struct kvm_irqfd *args) { - bool resample = args->flags & KVM_IRQFD_FLAG_RESAMPLE; - - return resample ? irqchip_kernel(kvm) : irqchip_in_kernel(kvm); + return irqchip_in_kernel(kvm); } bool kvm_arch_irqchip_in_kernel(struct kvm *kvm)