Message ID | 20181220054037.24320-2-peterx@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | q35: change defaults for kernel irqchip and IR | expand |
On Thu, Dec 20, 2018 at 01:40:35PM +0800, Peter Xu wrote: > Starting from QEMU 4.0, let's specify "split" as the default value for > kernel-irqchip. > > So for QEMU>=4.0 we'll have: allowed=Y,required=N,split=Y > for QEMU<=3.1 we'll have: allowed=Y,required=N,split=N > (omitting all the "kernel_irqchip_" prefix) > > Note that this will let the default q35 machine type to depend on > Linux version 4.4 or newer because that's where split irqchip is > introduced in kernel. But it's fine since we're boosting supported > Linux version for QEMU 4.0 to around Linux 4.5. For more information > please refer to the discussion on AMD's RDTSCP: > > https://lore.kernel.org/lkml/20181210181328.GA762@zn.tnic/ > > Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
On Thu, 20 Dec 2018 13:40:35 +0800 Peter Xu <peterx@redhat.com> wrote: > Starting from QEMU 4.0, let's specify "split" as the default value for > kernel-irqchip. > > So for QEMU>=4.0 we'll have: allowed=Y,required=N,split=Y > for QEMU<=3.1 we'll have: allowed=Y,required=N,split=N > (omitting all the "kernel_irqchip_" prefix) > > Note that this will let the default q35 machine type to depend on > Linux version 4.4 or newer because that's where split irqchip is > introduced in kernel. But it's fine since we're boosting supported > Linux version for QEMU 4.0 to around Linux 4.5. For more information > please refer to the discussion on AMD's RDTSCP: > > https://lore.kernel.org/lkml/20181210181328.GA762@zn.tnic/ It looks like this broke INTx for vfio-pci, see: https://bugs.launchpad.net/qemu/+bug/1826422 In my testing it looks like KVM advertises supporting the KVM_IRQFD resample feature, but vfio never gets the unmask notification, so the device remains with DisINTx set and no further interrupts are generated. Do we expect KVM's IRQFD with resampler to work in the split IRQ mode? We can certainly hope that "high performance" devices use MSI or MSI/X, but this would be quite a performance regression with split mode if our userspace bypass for INTx goes away. Thanks, Alex PS - the least impact workaround for users is to re-enable kernel mode irqchip with kernel_irqchip=on or <ioapic driver='kvm'/>. We can also force QEMU routing of INTx with the x-no-kvm-intx=on option per vfio-pci device, but this disables the userspace bypass and brings along higher interrupt latency and overhead.
On Fri, 26 Apr 2019 13:27:44 -0600 Alex Williamson <alex.williamson@redhat.com> wrote: > On Thu, 20 Dec 2018 13:40:35 +0800 > Peter Xu <peterx@redhat.com> wrote: > > > Starting from QEMU 4.0, let's specify "split" as the default value for > > kernel-irqchip. > > > > So for QEMU>=4.0 we'll have: allowed=Y,required=N,split=Y > > for QEMU<=3.1 we'll have: allowed=Y,required=N,split=N > > (omitting all the "kernel_irqchip_" prefix) > > > > Note that this will let the default q35 machine type to depend on > > Linux version 4.4 or newer because that's where split irqchip is > > introduced in kernel. But it's fine since we're boosting supported > > Linux version for QEMU 4.0 to around Linux 4.5. For more information > > please refer to the discussion on AMD's RDTSCP: > > > > https://lore.kernel.org/lkml/20181210181328.GA762@zn.tnic/ > > It looks like this broke INTx for vfio-pci, see: > > https://bugs.launchpad.net/qemu/+bug/1826422 > > In my testing it looks like KVM advertises supporting the KVM_IRQFD > resample feature, but vfio never gets the unmask notification, so the > device remains with DisINTx set and no further interrupts are > generated. Do we expect KVM's IRQFD with resampler to work in the > split IRQ mode? We can certainly hope that "high performance" devices > use MSI or MSI/X, but this would be quite a performance regression with > split mode if our userspace bypass for INTx goes away. Thanks, arch/x86/kvm/lapic.c:kvm_ioapic_send_eoi() dumps to userspace before kvm_ioapic_update_eoi() can handle the irq_ack_notifier_list via kvm_notify_acked_gsi(), so it looks like KVM really ought to return an error when trying to register a resample IRQFD when irqchip_split(), but do we have better options? EOI handling in QEMU is pretty much non-existent and even if it was in place, it's a big performance regression for vfio INTx handling. Is a split irqchip that retains performant resampling (EOI) support possible? Thanks, Alex
> > In my testing it looks like KVM advertises supporting the KVM_IRQFD > > resample feature, but vfio never gets the unmask notification, so the > > device remains with DisINTx set and no further interrupts are > > generated. Do we expect KVM's IRQFD with resampler to work in the > > split IRQ mode? We can certainly hope that "high performance" devices > > use MSI or MSI/X, but this would be quite a performance regression with > > split mode if our userspace bypass for INTx goes away. Thanks, > > arch/x86/kvm/lapic.c:kvm_ioapic_send_eoi() dumps to userspace before > kvm_ioapic_update_eoi() can handle the irq_ack_notifier_list via > kvm_notify_acked_gsi(), That wouldn't help because kvm_ioapic_update_eoi would not even be able to access vcpu->kvm->arch.vioapic (it's NULL). The following untested patch would signal the resamplefd in kvm_ioapic_send_eoi, before requesting the exit to userspace. However I am not sure how QEMU sets up the VFIO eventfds: if I understand correctly, when VFIO writes again to the irq eventfd, the interrupt request would not reach the userspace IOAPIC, but only the in-kernel LAPIC. That would be incorrect, and if my understanding is correct we need to trigger resampling from hw/intc/ioapic.c. Something like: 1) VFIO uses kvm_irqchip_add_irqfd_notifier_gsi instead of KVM_IRQFD directly 2) add a NotifierList in kvm_irqchip_assign_irqfd 3) if kvm_irqchip_is_split(), register a corresponding Notifier in ioapic.c which stores the resamplefd as an EventNotifier* 4) ioapic_eoi_broadcast writes to the resamplefd BTW, how do non-x86 platforms handle intx resampling? Paolo ---- 8< ----- From: Paolo Bonzini <pbonzini@redhat.com> Subject: [PATCH WIP] kvm: lapic: run ack notifiers for userspace IOAPIC routes diff --git a/arch/x86/kvm/ioapic.h b/arch/x86/kvm/ioapic.h index ea1a4e0297da..be337e06e3fd 100644 --- a/arch/x86/kvm/ioapic.h +++ b/arch/x86/kvm/ioapic.h @@ -135,4 +135,6 @@ void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, ulong *ioapic_handled_vectors); void kvm_scan_ioapic_routes(struct kvm_vcpu *vcpu, ulong *ioapic_handled_vectors); +void kvm_notify_userspace_ioapic_routes(struct kvm *kvm, int vector); + #endif diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c index 3cc3b2d130a0..46ea79a0dd8a 100644 --- a/arch/x86/kvm/irq_comm.c +++ b/arch/x86/kvm/irq_comm.c @@ -435,6 +435,34 @@ void kvm_scan_ioapic_routes(struct kvm_vcpu *vcpu, srcu_read_unlock(&kvm->irq_srcu, idx); } +void kvm_notify_userspace_ioapic_routes(struct kvm *kvm, int vector) +{ + struct kvm_kernel_irq_routing_entry *entry; + struct kvm_irq_routing_table *table; + u32 i, nr_ioapic_pins; + int idx; + + idx = srcu_read_lock(&kvm->irq_srcu); + table = srcu_dereference(kvm->irq_routing, &kvm->irq_srcu); + nr_ioapic_pins = min_t(u32, table->nr_rt_entries, + kvm->arch.nr_reserved_ioapic_pins); + + for (i = 0; i < nr_ioapic_pins; i++) { + hlist_for_each_entry(entry, &table->map[i], link) { + struct kvm_lapic_irq irq; + + if (entry->type != KVM_IRQ_ROUTING_MSI) + continue; + + kvm_set_msi_irq(kvm, entry, &irq); + if (irq.level && irq.vector == vector) + kvm_notify_acked_gsi(kvm, i); + } + } + + srcu_read_unlock(&kvm->irq_srcu, idx); +} + void kvm_arch_irq_routing_update(struct kvm *kvm) { kvm_hv_irq_routing_update(kvm); diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 9f089e2e09d0..8f8e76d77925 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -1142,6 +1142,7 @@ static bool kvm_ioapic_handles_vector(struct kvm_lapic *apic, int vector) static void kvm_ioapic_send_eoi(struct kvm_lapic *apic, int vector) { + struct kvm_vcpu *vcpu = apic->vcpu; int trigger_mode; /* Eoi the ioapic only if the ioapic doesn't own the vector. */ @@ -1149,9 +1150,10 @@ static void kvm_ioapic_send_eoi(struct kvm_lapic *apic, int vector) return; /* Request a KVM exit to inform the userspace IOAPIC. */ - if (irqchip_split(apic->vcpu->kvm)) { - apic->vcpu->arch.pending_ioapic_eoi = vector; - kvm_make_request(KVM_REQ_IOAPIC_EOI_EXIT, apic->vcpu); + if (irqchip_split(vcpu->kvm)) { + kvm_notify_userspace_ioapic_routes(vcpu->kvm, vector); + vcpu->arch.pending_ioapic_eoi = vector; + kvm_make_request(KVM_REQ_IOAPIC_EOI_EXIT, vcpu); return; } @@ -1160,7 +1162,7 @@ static void kvm_ioapic_send_eoi(struct kvm_lapic *apic, int vector) else trigger_mode = IOAPIC_EDGE_TRIG; - kvm_ioapic_update_eoi(apic->vcpu, vector, trigger_mode); + kvm_ioapic_update_eoi(vcpu, vector, trigger_mode); } static int apic_set_eoi(struct kvm_lapic *apic) > so it looks like KVM really ought to return an > error when trying to register a resample IRQFD when irqchip_split(), > but do we have better options? EOI handling in QEMU is pretty much > non-existent and even if it was in place, it's a big performance > regression for vfio INTx handling. Is a split irqchip that retains > performant resampling (EOI) support possible? Thanks,
On 27/04/19 07:29, Paolo Bonzini wrote: > >>> In my testing it looks like KVM advertises supporting the KVM_IRQFD >>> resample feature, but vfio never gets the unmask notification, so the >>> device remains with DisINTx set and no further interrupts are >>> generated. Do we expect KVM's IRQFD with resampler to work in the >>> split IRQ mode? We can certainly hope that "high performance" devices >>> use MSI or MSI/X, but this would be quite a performance regression with >>> split mode if our userspace bypass for INTx goes away. Thanks, >> >> arch/x86/kvm/lapic.c:kvm_ioapic_send_eoi() dumps to userspace before >> kvm_ioapic_update_eoi() can handle the irq_ack_notifier_list via >> kvm_notify_acked_gsi(), > > That wouldn't help because kvm_ioapic_update_eoi would not even be > able to access vcpu->kvm->arch.vioapic (it's NULL). > > The following untested patch would signal the resamplefd in kvm_ioapic_send_eoi, > before requesting the exit to userspace. However I am not sure how QEMU > sets up the VFIO eventfds: if I understand correctly, when VFIO writes again to > the irq eventfd, the interrupt request would not reach the userspace IOAPIC, but > only the in-kernel LAPIC. That would be incorrect, and if my understanding is > correct we need to trigger resampling from hw/intc/ioapic.c. Actually it's worse: because you're bypassing IOAPIC when raising the irq, the IOAPIC's remote_irr for example will not be set. So split irqchip currently must disable the intx fast path completely. I guess we could also reimplement irqfd and resamplefd in the userspace IOAPIC, and run the listener in a separate thread (using "-object iothread" on the command line and AioContext in the code). Paolo > > Something like: > > 1) VFIO uses kvm_irqchip_add_irqfd_notifier_gsi instead of KVM_IRQFD directly > > 2) add a NotifierList in kvm_irqchip_assign_irqfd > > 3) if kvm_irqchip_is_split(), register a corresponding Notifier in ioapic.c which > stores the resamplefd as an EventNotifier* > > 4) ioapic_eoi_broadcast writes to the resamplefd > > BTW, how do non-x86 platforms handle intx resampling? > > Paolo > > ---- 8< ----- > From: Paolo Bonzini <pbonzini@redhat.com> > Subject: [PATCH WIP] kvm: lapic: run ack notifiers for userspace IOAPIC routes > > diff --git a/arch/x86/kvm/ioapic.h b/arch/x86/kvm/ioapic.h > index ea1a4e0297da..be337e06e3fd 100644 > --- a/arch/x86/kvm/ioapic.h > +++ b/arch/x86/kvm/ioapic.h > @@ -135,4 +135,6 @@ void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, > ulong *ioapic_handled_vectors); > void kvm_scan_ioapic_routes(struct kvm_vcpu *vcpu, > ulong *ioapic_handled_vectors); > +void kvm_notify_userspace_ioapic_routes(struct kvm *kvm, int vector); > + > #endif > diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c > index 3cc3b2d130a0..46ea79a0dd8a 100644 > --- a/arch/x86/kvm/irq_comm.c > +++ b/arch/x86/kvm/irq_comm.c > @@ -435,6 +435,34 @@ void kvm_scan_ioapic_routes(struct kvm_vcpu *vcpu, > srcu_read_unlock(&kvm->irq_srcu, idx); > } > > +void kvm_notify_userspace_ioapic_routes(struct kvm *kvm, int vector) > +{ > + struct kvm_kernel_irq_routing_entry *entry; > + struct kvm_irq_routing_table *table; > + u32 i, nr_ioapic_pins; > + int idx; > + > + idx = srcu_read_lock(&kvm->irq_srcu); > + table = srcu_dereference(kvm->irq_routing, &kvm->irq_srcu); > + nr_ioapic_pins = min_t(u32, table->nr_rt_entries, > + kvm->arch.nr_reserved_ioapic_pins); > + > + for (i = 0; i < nr_ioapic_pins; i++) { > + hlist_for_each_entry(entry, &table->map[i], link) { > + struct kvm_lapic_irq irq; > + > + if (entry->type != KVM_IRQ_ROUTING_MSI) > + continue; > + > + kvm_set_msi_irq(kvm, entry, &irq); > + if (irq.level && irq.vector == vector) > + kvm_notify_acked_gsi(kvm, i); > + } > + } > + > + srcu_read_unlock(&kvm->irq_srcu, idx); > +} > + > void kvm_arch_irq_routing_update(struct kvm *kvm) > { > kvm_hv_irq_routing_update(kvm); > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c > index 9f089e2e09d0..8f8e76d77925 100644 > --- a/arch/x86/kvm/lapic.c > +++ b/arch/x86/kvm/lapic.c > @@ -1142,6 +1142,7 @@ static bool kvm_ioapic_handles_vector(struct kvm_lapic *apic, int vector) > > static void kvm_ioapic_send_eoi(struct kvm_lapic *apic, int vector) > { > + struct kvm_vcpu *vcpu = apic->vcpu; > int trigger_mode; > > /* Eoi the ioapic only if the ioapic doesn't own the vector. */ > @@ -1149,9 +1150,10 @@ static void kvm_ioapic_send_eoi(struct kvm_lapic *apic, int vector) > return; > > /* Request a KVM exit to inform the userspace IOAPIC. */ > - if (irqchip_split(apic->vcpu->kvm)) { > - apic->vcpu->arch.pending_ioapic_eoi = vector; > - kvm_make_request(KVM_REQ_IOAPIC_EOI_EXIT, apic->vcpu); > + if (irqchip_split(vcpu->kvm)) { > + kvm_notify_userspace_ioapic_routes(vcpu->kvm, vector); > + vcpu->arch.pending_ioapic_eoi = vector; > + kvm_make_request(KVM_REQ_IOAPIC_EOI_EXIT, vcpu); > return; > } > > @@ -1160,7 +1162,7 @@ static void kvm_ioapic_send_eoi(struct kvm_lapic *apic, int vector) > else > trigger_mode = IOAPIC_EDGE_TRIG; > > - kvm_ioapic_update_eoi(apic->vcpu, vector, trigger_mode); > + kvm_ioapic_update_eoi(vcpu, vector, trigger_mode); > } > > static int apic_set_eoi(struct kvm_lapic *apic) > > >> so it looks like KVM really ought to return an >> error when trying to register a resample IRQFD when irqchip_split(), >> but do we have better options? EOI handling in QEMU is pretty much >> non-existent and even if it was in place, it's a big performance >> regression for vfio INTx handling. Is a split irqchip that retains >> performant resampling (EOI) support possible? Thanks,
On Sat, 27 Apr 2019 10:09:51 +0200 Paolo Bonzini <pbonzini@redhat.com> wrote: > On 27/04/19 07:29, Paolo Bonzini wrote: > > > >>> In my testing it looks like KVM advertises supporting the KVM_IRQFD > >>> resample feature, but vfio never gets the unmask notification, so the > >>> device remains with DisINTx set and no further interrupts are > >>> generated. Do we expect KVM's IRQFD with resampler to work in the > >>> split IRQ mode? We can certainly hope that "high performance" devices > >>> use MSI or MSI/X, but this would be quite a performance regression with > >>> split mode if our userspace bypass for INTx goes away. Thanks, > >> > >> arch/x86/kvm/lapic.c:kvm_ioapic_send_eoi() dumps to userspace before > >> kvm_ioapic_update_eoi() can handle the irq_ack_notifier_list via > >> kvm_notify_acked_gsi(), > > > > That wouldn't help because kvm_ioapic_update_eoi would not even be > > able to access vcpu->kvm->arch.vioapic (it's NULL). > > > > The following untested patch would signal the resamplefd in kvm_ioapic_send_eoi, > > before requesting the exit to userspace. However I am not sure how QEMU > > sets up the VFIO eventfds: if I understand correctly, when VFIO writes again to > > the irq eventfd, the interrupt request would not reach the userspace IOAPIC, but > > only the in-kernel LAPIC. That would be incorrect, and if my understanding is > > correct we need to trigger resampling from hw/intc/ioapic.c. > > Actually it's worse: because you're bypassing IOAPIC when raising the > irq, the IOAPIC's remote_irr for example will not be set. So split > irqchip currently must disable the intx fast path completely. > > I guess we could also reimplement irqfd and resamplefd in the userspace > IOAPIC, and run the listener in a separate thread (using "-object > iothread" on the command line and AioContext in the code). This sounds like a performance regression vs KVM irqchip any way we slice it. Was this change a mistake? Without KVM support, the universal support in QEMU kicks in, where device mmaps are disabled when an INTx occurs, forcing trapped access to the device, and we assume that the next access is in response to an interrupt and trigger our own internal EOI and re-enable mmaps. A timer acts as a catch-all. Needless to say, this is functional but not fast. It would be a massive performance regression for devices depending on INTx and previously using the KVM bypass to switch to this. INTx is largely considered a legacy interrupt, so non-x86 archs don't encounter it as often, S390 even explicitly disables INTx support. ARM and POWER likely just don't see a lot of these devices, but nearly all devices (except SR-IOV VFs) on x86 expect an INTx fallback mode and some drivers may run the device in INTx for compatibility. This split irqchip change was likely fine for "enterprise" users concerned only with modern high speed devices, but very much not for device assignment used for compatibility use cases or commodity hardware users. What's a good 4.0.1 strategy to resolve this? Re-instate KVM irqchip as the Q35 default? I can't see that simply switching to current QEMU handling is a viable option for performance? What about 4.1? We could certainly improve EOI support in QEMU, there's essentially no support currently, but it seems like an uphill battle for an iothread based userspace ioapic to ever compare to KVM handling? Thanks, Alex
On Mon, Apr 29, 2019 at 08:21:06AM -0600, Alex Williamson wrote: > On Sat, 27 Apr 2019 10:09:51 +0200 > Paolo Bonzini <pbonzini@redhat.com> wrote: > > > On 27/04/19 07:29, Paolo Bonzini wrote: > > > > > >>> In my testing it looks like KVM advertises supporting the KVM_IRQFD > > >>> resample feature, but vfio never gets the unmask notification, so the > > >>> device remains with DisINTx set and no further interrupts are > > >>> generated. Do we expect KVM's IRQFD with resampler to work in the > > >>> split IRQ mode? We can certainly hope that "high performance" devices > > >>> use MSI or MSI/X, but this would be quite a performance regression with > > >>> split mode if our userspace bypass for INTx goes away. Thanks, > > >> > > >> arch/x86/kvm/lapic.c:kvm_ioapic_send_eoi() dumps to userspace before > > >> kvm_ioapic_update_eoi() can handle the irq_ack_notifier_list via > > >> kvm_notify_acked_gsi(), > > > > > > That wouldn't help because kvm_ioapic_update_eoi would not even be > > > able to access vcpu->kvm->arch.vioapic (it's NULL). > > > > > > The following untested patch would signal the resamplefd in kvm_ioapic_send_eoi, > > > before requesting the exit to userspace. However I am not sure how QEMU > > > sets up the VFIO eventfds: if I understand correctly, when VFIO writes again to > > > the irq eventfd, the interrupt request would not reach the userspace IOAPIC, but > > > only the in-kernel LAPIC. That would be incorrect, and if my understanding is > > > correct we need to trigger resampling from hw/intc/ioapic.c. > > > > Actually it's worse: because you're bypassing IOAPIC when raising the > > irq, the IOAPIC's remote_irr for example will not be set. So split > > irqchip currently must disable the intx fast path completely. > > > > I guess we could also reimplement irqfd and resamplefd in the userspace > > IOAPIC, and run the listener in a separate thread (using "-object > > iothread" on the command line and AioContext in the code). > > This sounds like a performance regression vs KVM irqchip any way we > slice it. Was this change a mistake? Without KVM support, the > universal support in QEMU kicks in, where device mmaps are disabled > when an INTx occurs, forcing trapped access to the device, and we > assume that the next access is in response to an interrupt and trigger > our own internal EOI and re-enable mmaps. A timer acts as a > catch-all. Needless to say, this is functional but not fast. It would > be a massive performance regression for devices depending on INTx and > previously using the KVM bypass to switch to this. INTx is largely > considered a legacy interrupt, so non-x86 archs don't encounter it as > often, S390 even explicitly disables INTx support. ARM and POWER > likely just don't see a lot of these devices, but nearly all devices > (except SR-IOV VFs) on x86 expect an INTx fallback mode and some > drivers may run the device in INTx for compatibility. This split > irqchip change was likely fine for "enterprise" users concerned only > with modern high speed devices, but very much not for device assignment > used for compatibility use cases or commodity hardware users. > > What's a good 4.0.1 strategy to resolve this? Re-instate KVM irqchip > as the Q35 default? I can't see that simply switching to current QEMU > handling is a viable option for performance? What about 4.1? We could > certainly improve EOI support in QEMU, there's essentially no support > currently, but it seems like an uphill battle for an iothread based > userspace ioapic to ever compare to KVM handling? Thanks, irqchip=split and irqchip=kernel aren't guest ABI compatible, are they? That would make it impossible to fix this in pc-q35-4.0 for a 4.0.1 update.
On Mon, 29 Apr 2019 11:55:56 -0300 Eduardo Habkost <ehabkost@redhat.com> wrote: > On Mon, Apr 29, 2019 at 08:21:06AM -0600, Alex Williamson wrote: > > On Sat, 27 Apr 2019 10:09:51 +0200 > > Paolo Bonzini <pbonzini@redhat.com> wrote: > > > > > On 27/04/19 07:29, Paolo Bonzini wrote: > > > > > > > >>> In my testing it looks like KVM advertises supporting the KVM_IRQFD > > > >>> resample feature, but vfio never gets the unmask notification, so the > > > >>> device remains with DisINTx set and no further interrupts are > > > >>> generated. Do we expect KVM's IRQFD with resampler to work in the > > > >>> split IRQ mode? We can certainly hope that "high performance" devices > > > >>> use MSI or MSI/X, but this would be quite a performance regression with > > > >>> split mode if our userspace bypass for INTx goes away. Thanks, > > > >> > > > >> arch/x86/kvm/lapic.c:kvm_ioapic_send_eoi() dumps to userspace before > > > >> kvm_ioapic_update_eoi() can handle the irq_ack_notifier_list via > > > >> kvm_notify_acked_gsi(), > > > > > > > > That wouldn't help because kvm_ioapic_update_eoi would not even be > > > > able to access vcpu->kvm->arch.vioapic (it's NULL). > > > > > > > > The following untested patch would signal the resamplefd in kvm_ioapic_send_eoi, > > > > before requesting the exit to userspace. However I am not sure how QEMU > > > > sets up the VFIO eventfds: if I understand correctly, when VFIO writes again to > > > > the irq eventfd, the interrupt request would not reach the userspace IOAPIC, but > > > > only the in-kernel LAPIC. That would be incorrect, and if my understanding is > > > > correct we need to trigger resampling from hw/intc/ioapic.c. > > > > > > Actually it's worse: because you're bypassing IOAPIC when raising the > > > irq, the IOAPIC's remote_irr for example will not be set. So split > > > irqchip currently must disable the intx fast path completely. > > > > > > I guess we could also reimplement irqfd and resamplefd in the userspace > > > IOAPIC, and run the listener in a separate thread (using "-object > > > iothread" on the command line and AioContext in the code). > > > > This sounds like a performance regression vs KVM irqchip any way we > > slice it. Was this change a mistake? Without KVM support, the > > universal support in QEMU kicks in, where device mmaps are disabled > > when an INTx occurs, forcing trapped access to the device, and we > > assume that the next access is in response to an interrupt and trigger > > our own internal EOI and re-enable mmaps. A timer acts as a > > catch-all. Needless to say, this is functional but not fast. It would > > be a massive performance regression for devices depending on INTx and > > previously using the KVM bypass to switch to this. INTx is largely > > considered a legacy interrupt, so non-x86 archs don't encounter it as > > often, S390 even explicitly disables INTx support. ARM and POWER > > likely just don't see a lot of these devices, but nearly all devices > > (except SR-IOV VFs) on x86 expect an INTx fallback mode and some > > drivers may run the device in INTx for compatibility. This split > > irqchip change was likely fine for "enterprise" users concerned only > > with modern high speed devices, but very much not for device assignment > > used for compatibility use cases or commodity hardware users. > > > > What's a good 4.0.1 strategy to resolve this? Re-instate KVM irqchip > > as the Q35 default? I can't see that simply switching to current QEMU > > handling is a viable option for performance? What about 4.1? We could > > certainly improve EOI support in QEMU, there's essentially no support > > currently, but it seems like an uphill battle for an iothread based > > userspace ioapic to ever compare to KVM handling? Thanks, > > irqchip=split and irqchip=kernel aren't guest ABI compatible, are > they? That would make it impossible to fix this in pc-q35-4.0 > for a 4.0.1 update. I suppose it would require a pc-q35-4.0.1 machine type :-\ Thanks, Alex
On Mon, 29 Apr 2019 08:21:06 -0600 Alex Williamson <alex.williamson@redhat.com> wrote: > On Sat, 27 Apr 2019 10:09:51 +0200 > Paolo Bonzini <pbonzini@redhat.com> wrote: > > > On 27/04/19 07:29, Paolo Bonzini wrote: > > > > > >>> In my testing it looks like KVM advertises supporting the KVM_IRQFD > > >>> resample feature, but vfio never gets the unmask notification, so the > > >>> device remains with DisINTx set and no further interrupts are > > >>> generated. Do we expect KVM's IRQFD with resampler to work in the > > >>> split IRQ mode? We can certainly hope that "high performance" devices > > >>> use MSI or MSI/X, but this would be quite a performance regression with > > >>> split mode if our userspace bypass for INTx goes away. Thanks, > > >> > > >> arch/x86/kvm/lapic.c:kvm_ioapic_send_eoi() dumps to userspace before > > >> kvm_ioapic_update_eoi() can handle the irq_ack_notifier_list via > > >> kvm_notify_acked_gsi(), > > > > > > That wouldn't help because kvm_ioapic_update_eoi would not even be > > > able to access vcpu->kvm->arch.vioapic (it's NULL). > > > > > > The following untested patch would signal the resamplefd in kvm_ioapic_send_eoi, > > > before requesting the exit to userspace. However I am not sure how QEMU > > > sets up the VFIO eventfds: if I understand correctly, when VFIO writes again to > > > the irq eventfd, the interrupt request would not reach the userspace IOAPIC, but > > > only the in-kernel LAPIC. That would be incorrect, and if my understanding is > > > correct we need to trigger resampling from hw/intc/ioapic.c. > > > > Actually it's worse: because you're bypassing IOAPIC when raising the > > irq, the IOAPIC's remote_irr for example will not be set. So split > > irqchip currently must disable the intx fast path completely. > > > > I guess we could also reimplement irqfd and resamplefd in the userspace > > IOAPIC, and run the listener in a separate thread (using "-object > > iothread" on the command line and AioContext in the code). > > This sounds like a performance regression vs KVM irqchip any way we > slice it. Was this change a mistake? Without KVM support, the > universal support in QEMU kicks in, where device mmaps are disabled > when an INTx occurs, forcing trapped access to the device, and we > assume that the next access is in response to an interrupt and trigger > our own internal EOI and re-enable mmaps. A timer acts as a > catch-all. Needless to say, this is functional but not fast. It would > be a massive performance regression for devices depending on INTx and > previously using the KVM bypass to switch to this. INTx is largely > considered a legacy interrupt, so non-x86 archs don't encounter it as > often, S390 even explicitly disables INTx support. ARM and POWER > likely just don't see a lot of these devices, but nearly all devices > (except SR-IOV VFs) on x86 expect an INTx fallback mode and some > drivers may run the device in INTx for compatibility. This split > irqchip change was likely fine for "enterprise" users concerned only > with modern high speed devices, but very much not for device assignment > used for compatibility use cases or commodity hardware users. > > What's a good 4.0.1 strategy to resolve this? Re-instate KVM irqchip > as the Q35 default? I can't see that simply switching to current QEMU > handling is a viable option for performance? What about 4.1? We could > certainly improve EOI support in QEMU, there's essentially no support > currently, but it seems like an uphill battle for an iothread based > userspace ioapic to ever compare to KVM handling? Thanks, Poking at this a bit, we can add kvm_irqchip_is_split() to the set of things we test for in hw/vfio/pci.c:vfio_intx_enable_kvm() to avoid the KVM INTx bypass when using split IRQ chip. This at least avoids paths that cannot work currently. We'll fall back to vfio's universal EOI detection of toggling direct mapped MemoryRegions, which is enough for simple devices like NICs. However, it's barely functional with an NVIDIA GeForce card assigned to a Windows VM, it only takes a graphics test program to send it over the edge and trigger a TDR. Even with TDR disabled, the VM will hang. I also played with ioapic_eoi_broadcast() calling directly into vfio code to trigger the EOI (without the mmap toggling), it's even worse, Windows can't even get to the desktop before it hangs. So while I was initially impressed that netperf TCP_RR results for a gigabit NIC were not that different between in-kernel ioapic and split irqchip, the graphics results have me again wondering why we made this change and how userspace handling can get us back to a functional level. The only way I can get the GPU/Windows configuration usable is to assert the IRQ, immediately de-assert, and unmask the device all from vfio_intx_interrupt(). An interrupt intensive graphics benchmark runs at ~80% of KVM irqchip with about 10% more CPU load with this experiment (but it actually runs!). Potentially some devices could mimic INTx using MSI, like legacy KVM device assignment used to do with this mode, eliminating the unmask ioctl, but even the legacy driver noted compatibility issues with that mode and neither is a good reproduction of how INTx is supposed to work. Any other insights appreciated, and I really would like to understand what we've gained with split irqchip and whether it's worth this. Thanks, Alex
On 01/05/19 01:01, Alex Williamson wrote: > Poking at this a bit, we can add kvm_irqchip_is_split() to the set of > things we test for in hw/vfio/pci.c:vfio_intx_enable_kvm() to avoid the > KVM INTx bypass when using split IRQ chip. Yes, this should be enough. > The only way I can get the GPU/Windows configuration usable is to > assert the IRQ, immediately de-assert, and unmask the device all from > vfio_intx_interrupt(). An interrupt intensive graphics benchmark runs > at ~80% of KVM irqchip with about 10% more CPU load with this > experiment (but it actually runs!). Nice. If you can do it directly from hw/vfio there may be no need to do more changes to the IOAPIC, and least not immediately. But that is not a good emulation of INTX, isn't it? IIUC, it relies on the level-triggered interrupt never being masked in the IOAPIC. > Any other insights appreciated, and I really would like to understand > what we've gained with split irqchip and whether it's worth this. We've gained guest interrupt remapping support, we are not relying on newer kernel versions, and the attack surface from the guest to the hypervisor is smaller. Paolo
On Wed, 1 May 2019 01:09:48 +0200 Paolo Bonzini <pbonzini@redhat.com> wrote: > On 01/05/19 01:01, Alex Williamson wrote: > > Poking at this a bit, we can add kvm_irqchip_is_split() to the set of > > things we test for in hw/vfio/pci.c:vfio_intx_enable_kvm() to avoid the > > KVM INTx bypass when using split IRQ chip. > > Yes, this should be enough. > > > The only way I can get the GPU/Windows configuration usable is to > > assert the IRQ, immediately de-assert, and unmask the device all from > > vfio_intx_interrupt(). An interrupt intensive graphics benchmark runs > > at ~80% of KVM irqchip with about 10% more CPU load with this > > experiment (but it actually runs!). > > Nice. If you can do it directly from hw/vfio there may be no need to do > more changes to the IOAPIC, and least not immediately. But that is not > a good emulation of INTX, isn't it? IIUC, it relies on the > level-triggered interrupt never being masked in the IOAPIC. Yeah, that's the problem, well one of the problems in addition to the lower performance and increased overhead, is that it's a rather poor emulation of INTx. It matches pci_irq_pulse(), which has been discouraged from use. Anything but kernel irqchip makes me nervous for INTx, but emulating it with a different physical IRQ type definitely more so. > > Any other insights appreciated, and I really would like to understand > > what we've gained with split irqchip and whether it's worth this. > > We've gained guest interrupt remapping support, we are not relying on > newer kernel versions, and the attack surface from the guest to the > hypervisor is smaller. Interrupt remapping is really only necessary with high vCPU counts or maybe nested device assignment, seems doubtful that has a larger user base. I did re-watch the video of Steve Rutherford's talk from a couple years ago, but he does re-iterate that pulling the ioapic from KVM does have drawbacks for general purpose VMs that I thought would have precluded it from becoming the default for the base QEMU machine type. Getting a GeForce card to run with MSI requires (repeated) regedit'ing on Windows or module options on Linux. The audio device can require the same, otherwise we're probably mostly looking at old devices assigned for compatibility using INTx, NICs or USB devices would of course more likely use MSI/X for anything reasonably modern. Thanks, Alex
On Mon, Apr 29, 2019 at 09:22:12AM -0600, Alex Williamson wrote: [...] > > > What's a good 4.0.1 strategy to resolve this? Re-instate KVM irqchip > > > as the Q35 default? I can't see that simply switching to current QEMU > > > handling is a viable option for performance? What about 4.1? We could > > > certainly improve EOI support in QEMU, there's essentially no support > > > currently, but it seems like an uphill battle for an iothread based > > > userspace ioapic to ever compare to KVM handling? Thanks, > > > > irqchip=split and irqchip=kernel aren't guest ABI compatible, are > > they? That would make it impossible to fix this in pc-q35-4.0 > > for a 4.0.1 update. > > I suppose it would require a pc-q35-4.0.1 machine type :-\ Thanks, I wonder if it's possible to untangle this and make the irqchip option stop affecting guest ABI on 4.1+ machine-types? This way QEMU could choose smarter defaults in the future without the compatibility code hassle.
On Mon, Apr 29, 2019 at 11:55:56AM -0300, Eduardo Habkost wrote: > irqchip=split and irqchip=kernel aren't guest ABI compatible, are > they? Can you remind me why they aren't? > -- > Eduardo
On Fri, May 03, 2019 at 04:00:33PM -0400, Michael S. Tsirkin wrote: > On Mon, Apr 29, 2019 at 11:55:56AM -0300, Eduardo Habkost wrote: > > irqchip=split and irqchip=kernel aren't guest ABI compatible, are > > they? > > Can you remind me why they aren't? We have the code introduced by patch 3/3 from this series, but I don't know if it's the only difference: hw/i386/x86-iommu.c=static void x86_iommu_realize(DeviceState *dev, Error **errp) [...] hw/i386/x86-iommu.c: bool irq_all_kernel = kvm_irqchip_in_kernel() && !kvm_irqchip_is_split(); [...] hw/i386/x86-iommu.c- /* If the user didn't specify IR, choose a default value for it */ hw/i386/x86-iommu.c- if (x86_iommu->intr_supported == ON_OFF_AUTO_AUTO) { hw/i386/x86-iommu.c- x86_iommu->intr_supported = irq_all_kernel ? hw/i386/x86-iommu.c- ON_OFF_AUTO_OFF : ON_OFF_AUTO_ON; hw/i386/x86-iommu.c- }
On Fri, May 03, 2019 at 03:42:06PM -0300, Eduardo Habkost wrote: > On Mon, Apr 29, 2019 at 09:22:12AM -0600, Alex Williamson wrote: > [...] > > > > What's a good 4.0.1 strategy to resolve this? Re-instate KVM irqchip > > > > as the Q35 default? I can't see that simply switching to current QEMU > > > > handling is a viable option for performance? What about 4.1? We could > > > > certainly improve EOI support in QEMU, there's essentially no support > > > > currently, but it seems like an uphill battle for an iothread based > > > > userspace ioapic to ever compare to KVM handling? Thanks, > > > > > > irqchip=split and irqchip=kernel aren't guest ABI compatible, are > > > they? That would make it impossible to fix this in pc-q35-4.0 > > > for a 4.0.1 update. > > > > I suppose it would require a pc-q35-4.0.1 machine type :-\ Thanks, > > I wonder if it's possible to untangle this and make the irqchip > option stop affecting guest ABI on 4.1+ machine-types? This way > QEMU could choose smarter defaults in the future without the > compatibility code hassle. Hi, Eduardo, Do you mean to enable IOMMU IR for kernel-irqchip=on? If so, I would say it's not trivial... The major issue is that we probably need to explicitly kick QEMU for every kernel IOAPIC setups since QEMU is the only one who knows everything about interrupt remapping, while KVM don't even have such a mechanism so far. I'm thinking whether we should try to fix the functional problem first as proposed by Alex? The problem is even if we switch the default mode for Q35 the user could still specify that in the command line and I feel like we'd better fix the functional issue first before we consider the possible performance regression? The worst case I thought of is with the fix we offer a good QEMU 4.1/4.0.1 for users (I believe in most cases for individual users since this issue seems to not affect most enterprise users and with modern hardwares) then we also tell people to explicitly enable kernel-irqchip=on to avoid the potential performance degradation. (Sorry for the late chim-in, and of course sorry for breaking VFIO from the very beginning...)
On 05/05/19 04:06, Peter Xu wrote: >> I wonder if it's possible to untangle this and make the irqchip >> option stop affecting guest ABI on 4.1+ machine-types? This way >> QEMU could choose smarter defaults in the future without the >> compatibility code hassle. > Hi, Eduardo, > > Do you mean to enable IOMMU IR for kernel-irqchip=on? If so, I would > say it's not trivial... The major issue is that we probably need to > explicitly kick QEMU for every kernel IOAPIC setups since QEMU is the > only one who knows everything about interrupt remapping, while KVM > don't even have such a mechanism so far. Right, it's not easy and it would be anyway possible only on kernels. There would have to be a mechanism to setup IOAPIC->MSI routes, similar to irqchip=split, and as Peter mentions an MMIO exit on writes to the routing table. Paolo
On 03/05/19 15:00, Michael S. Tsirkin wrote: > On Mon, Apr 29, 2019 at 11:55:56AM -0300, Eduardo Habkost wrote: >> irqchip=split and irqchip=kernel aren't guest ABI compatible, are >> they? > > Can you remind me why they aren't? They are compatible if the userspace IOAPIC is configured with "-global ioapic.version=0x11". The userspace IOAPIC in addition supports version 0x20, which adds the EOI register (a requirement for interrupt remapping but not necessary outside that case), and makes it the default. Paolo
On Mon, May 06, 2019 at 11:13:28AM -0500, Paolo Bonzini wrote: > On 05/05/19 04:06, Peter Xu wrote: > >> I wonder if it's possible to untangle this and make the irqchip > >> option stop affecting guest ABI on 4.1+ machine-types? This way > >> QEMU could choose smarter defaults in the future without the > >> compatibility code hassle. > > Hi, Eduardo, > > > > Do you mean to enable IOMMU IR for kernel-irqchip=on? If so, I would > > say it's not trivial... The major issue is that we probably need to > > explicitly kick QEMU for every kernel IOAPIC setups since QEMU is the > > only one who knows everything about interrupt remapping, while KVM > > don't even have such a mechanism so far. > > Right, it's not easy and it would be anyway possible only on kernels. > There would have to be a mechanism to setup IOAPIC->MSI routes, similar > to irqchip=split, and as Peter mentions an MMIO exit on writes to the > routing table. I don't mean we necessarily should enable IR for kernel-irqchip=on too. I'd just prefer the default setting to not depend on the kernel-irqchip option. x86-iommu could either have intremap=on as the default (and refuse to run with kernel-irqchip=on without explicit intremap=off), or simply default to intremap=off. But as Paolo indicated elsewhere, this is not the only guest ABI difference between "on" and "split". Probably it's not worth the hassle to try to to untangle this.
diff --git a/hw/core/machine.c b/hw/core/machine.c index c51423b647..4439ea663f 100644 --- a/hw/core/machine.c +++ b/hw/core/machine.c @@ -653,8 +653,10 @@ static void machine_class_base_init(ObjectClass *oc, void *data) static void machine_initfn(Object *obj) { MachineState *ms = MACHINE(obj); + MachineClass *mc = MACHINE_GET_CLASS(obj); ms->kernel_irqchip_allowed = true; + ms->kernel_irqchip_split = mc->default_kernel_irqchip_split; ms->kvm_shadow_mem = -1; ms->dump_guest_core = true; ms->mem_merge = true; diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c index 58459bdab5..d2fb0fa49f 100644 --- a/hw/i386/pc_q35.c +++ b/hw/i386/pc_q35.c @@ -304,6 +304,7 @@ static void pc_q35_machine_options(MachineClass *m) m->units_per_default_bus = 1; m->default_machine_opts = "firmware=bios-256k.bin"; m->default_display = "std"; + m->default_kernel_irqchip_split = true; m->no_floppy = 1; machine_class_allow_dynamic_sysbus_dev(m, TYPE_AMD_IOMMU_DEVICE); machine_class_allow_dynamic_sysbus_dev(m, TYPE_INTEL_IOMMU_DEVICE); @@ -323,6 +324,7 @@ DEFINE_Q35_MACHINE(v4_0, "pc-q35-4.0", NULL, static void pc_q35_3_1_machine_options(MachineClass *m) { pc_q35_4_0_machine_options(m); + m->default_kernel_irqchip_split = false; m->alias = NULL; SET_MACHINE_COMPAT(m, PC_COMPAT_3_1); } diff --git a/include/hw/boards.h b/include/hw/boards.h index f82f28468b..362384815e 100644 --- a/include/hw/boards.h +++ b/include/hw/boards.h @@ -195,6 +195,7 @@ struct MachineClass { const char *hw_version; ram_addr_t default_ram_size; const char *default_cpu_type; + bool default_kernel_irqchip_split; bool option_rom_has_mr; bool rom_file_has_mr; int minimum_page_bits;
Starting from QEMU 4.0, let's specify "split" as the default value for kernel-irqchip. So for QEMU>=4.0 we'll have: allowed=Y,required=N,split=Y for QEMU<=3.1 we'll have: allowed=Y,required=N,split=N (omitting all the "kernel_irqchip_" prefix) Note that this will let the default q35 machine type to depend on Linux version 4.4 or newer because that's where split irqchip is introduced in kernel. But it's fine since we're boosting supported Linux version for QEMU 4.0 to around Linux 4.5. For more information please refer to the discussion on AMD's RDTSCP: https://lore.kernel.org/lkml/20181210181328.GA762@zn.tnic/ Signed-off-by: Peter Xu <peterx@redhat.com> --- hw/core/machine.c | 2 ++ hw/i386/pc_q35.c | 2 ++ include/hw/boards.h | 1 + 3 files changed, 5 insertions(+)