Message ID | 20200226225048.216508-1-peterx@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | vfio/pci: Fix up breakage against split irqchip and INTx | expand |
Hi Peter, On 2/26/20 11:50 PM, Peter Xu wrote: > VFIO INTx is not working with split irqchip. On new kernels KVM_IRQFD > will directly fail with resamplefd attached so QEMU will automatically > fallback to the INTx slow path. However on old kernels it's still > broken. I think you should quote the commit that changes the behavior: 654f1f13ea56 kvm: Check irqchip mode before assign irqfd? so that one can understand what it fixed. So on kernels that do not feature that commit, the ioctl succeeds and we pretend the resamplefd was properly attached whereas in practice it was never triggered. Thanks Eric > > Only until recently I noticed that this could also break PXE boot for > assigned NICs [1]. My wild guess is that the PXE ROM will be mostly > using INTx as well, which means we can't bypass that even if we > enables MSI for the guest kernel. > > This series tries to first fix this issue function-wise, then speed up > for the INTx again with resamplefd (mostly following the ideas > proposed by Paolo one year ago [2]). My TCP_RR test shows that: > > - Before this series: this is broken, no number to show > > - After patch 1 (enable slow path): get 63% perf comparing to full > kernel irqchip > > - After whole series (enable fast path partly, irq injection will be > the same as fast path, however userspace needs to intercept for > EOI broadcast to resamplefd, though should still be faster than > the MMIO trick for intx eoi): get 93% perf comparing to full > kernel irqchip, which is a 46% performance boost > > I think we can consider to apply patch 1 even sooner than the rest of > the series to unbreak intx+split first. > > The whole test matrix for reference: > > |----------+---------+-------------------+--------------+--------------------| > | IRQ type | irqchip | TCP_STREAM (Gbps) | TCP_RR (pps) | note | > |----------+---------+-------------------+--------------+--------------------| > | msi | on | 9.39 | 17567 | | > | nomsi | on | 9.29 | 14056 | | > | msi | split | 9.36 | 17330 | | > | nomsi | split | / | / | currently broken | > | nomsi | split | 8.98 | 8977 | after patch 1 | > | nomsi | split | 9.21 | 13142 | after whole series | > |----------+---------+-------------------+--------------+--------------------| > > Any review comment is welcomed. Thanks, > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1786404 > [2] https://patchwork.kernel.org/patch/10738541/#22609933 > > Peter Xu (5): > vfio/pci: Disable INTx fast path if using split irqchip > vfio/pci: Use kvm_irqchip_add_irqfd_notifier_gsi() for irqfds > KVM: Pass EventNotifier into kvm_irqchip_assign_irqfd > KVM: Kick resamplefd for split kernel irqchip > Revert "vfio/pci: Disable INTx fast path if using split irqchip" > > accel/kvm/kvm-all.c | 83 +++++++++++++++++++++++++++++++++++++++--- > accel/kvm/trace-events | 1 + > hw/intc/ioapic.c | 11 +++++- > hw/vfio/pci.c | 37 ++++++++----------- > include/sysemu/kvm.h | 4 ++ > 5 files changed, 106 insertions(+), 30 deletions(-) >
On Thu, Feb 27, 2020 at 04:32:24PM +0100, Auger Eric wrote: > Hi Peter, Hi, Eric, > > On 2/26/20 11:50 PM, Peter Xu wrote: > > VFIO INTx is not working with split irqchip. On new kernels KVM_IRQFD > > will directly fail with resamplefd attached so QEMU will automatically > > fallback to the INTx slow path. However on old kernels it's still > > broken. > I think you should quote the commit that changes the behavior: > 654f1f13ea56 kvm: Check irqchip mode before assign irqfd? > so that one can understand what it fixed. Right I should mention that. Actually I mentioned it in an old cover letter but unluckily something wrong happened with git-publish and merely my whole cover letter is gone... So what you saw is actually a simplified version and I think I must have lost something like this... :( Thanks for bringing that up. > > So on kernels that do not feature that commit, the ioctl succeeds and we > pretend the resamplefd was properly attached whereas in practice it was > never triggered. Right. Another thing I just noticed is that we must _not_ attach the resamplefd to the KVM_IRQFD ioctl when we registered this in the userspace. One thing is it's for sure not needed at all. More importantly, on new kernels (after commit 654f1f13ea56) the KVM_IRQFD could fail (because it sees both split irqchip and resamplefd) and QEMU will fallback again to the slow path even if the fast path can work. Thanks,
On Thu, Feb 27, 2020 at 10:51:46AM -0500, Peter Xu wrote: > Right. Another thing I just noticed is that we must _not_ attach the > resamplefd to the KVM_IRQFD ioctl when we registered this in the > userspace. One thing is it's for sure not needed at all. More > importantly, on new kernels (after commit 654f1f13ea56) the KVM_IRQFD > could fail (because it sees both split irqchip and resamplefd) and > QEMU will fallback again to the slow path even if the fast path can > work. I've just posted a v1.1 version of patch 4 only, which should fix this problem on new kernels, so fast path will always be preferred. The only changed function is kvm_irqchip_assign_irqfd().
On 26/02/20 23:50, Peter Xu wrote: > VFIO INTx is not working with split irqchip. On new kernels KVM_IRQFD > will directly fail with resamplefd attached so QEMU will automatically > fallback to the INTx slow path. However on old kernels it's still > broken. > > Only until recently I noticed that this could also break PXE boot for > assigned NICs [1]. My wild guess is that the PXE ROM will be mostly > using INTx as well, which means we can't bypass that even if we > enables MSI for the guest kernel. > > This series tries to first fix this issue function-wise, then speed up > for the INTx again with resamplefd (mostly following the ideas > proposed by Paolo one year ago [2]). My TCP_RR test shows that: > > - Before this series: this is broken, no number to show > > - After patch 1 (enable slow path): get 63% perf comparing to full > kernel irqchip Oh, I thought something like patch 1 had already been applied. One comment: because you're bypassing IOAPIC when raising the irq, the IOAPIC's remote_irr for example will not be set. Most OSes probably don't care, but it's at least worth a comment. Paolo > - After whole series (enable fast path partly, irq injection will be > the same as fast path, however userspace needs to intercept for > EOI broadcast to resamplefd, though should still be faster than > the MMIO trick for intx eoi): get 93% perf comparing to full > kernel irqchip, which is a 46% performance boost
On Fri, Feb 28, 2020 at 11:36:55AM +0100, Paolo Bonzini wrote: > On 26/02/20 23:50, Peter Xu wrote: > > VFIO INTx is not working with split irqchip. On new kernels KVM_IRQFD > > will directly fail with resamplefd attached so QEMU will automatically > > fallback to the INTx slow path. However on old kernels it's still > > broken. > > > > Only until recently I noticed that this could also break PXE boot for > > assigned NICs [1]. My wild guess is that the PXE ROM will be mostly > > using INTx as well, which means we can't bypass that even if we > > enables MSI for the guest kernel. > > > > This series tries to first fix this issue function-wise, then speed up > > for the INTx again with resamplefd (mostly following the ideas > > proposed by Paolo one year ago [2]). My TCP_RR test shows that: > > > > - Before this series: this is broken, no number to show > > > > - After patch 1 (enable slow path): get 63% perf comparing to full > > kernel irqchip > > Oh, I thought something like patch 1 had already been applied. > > One comment: because you're bypassing IOAPIC when raising the irq, the > IOAPIC's remote_irr for example will not be set. Most OSes probably > don't care, but it's at least worth a comment. Ouch I should definitely do that... How about something like this (in ioapic_eoi_broadcast(), I even changed kvm_resample_fd_notify to return a boolean to show whether some GSI is kicked so for this case we don't need to proceed on checking irr and remote irr): /* * When IOAPIC is in the userspace while APIC is still in * the kernel (i.e., split irqchip), we have a trick to * kick the resamplefd logic for registered irqfds from * userspace to deactivate the IRQ. When that happens, it * means the irq bypassed userspace IOAPIC (so the irr and * remote-irr of the table entry should be bypassed too * even if interrupt come), then we don't need to clear * the remote-IRR and check irr again because they'll * always be zeros. */ if (kvm_resample_fd_notify(n)) { continue; } I confess this is still tricky, and actually after some careful read I noticed you've proposed a similar kernel fix for the problem too which I overlooked (https://patchwork.kernel.org/patch/10738541/#22609933). My current thought is that we keep this hackery in userspace only so we keep split+resamplefd forbidden in the kernel and be clean there. What's your opinion? (I should have marked this series as RFC when post) Thanks,
On 28/02/20 16:25, Peter Xu wrote: > My current thought is that we keep this hackery in userspace only so > we keep split+resamplefd forbidden in the kernel and be clean there. It is better, yes. The kernel solution makes some sense because split irqchip _does_ have a concept of ioapic GSIs and routes. But the kernel patch would not fix the fact that remote-irr remains zero, which would need a userspace change anyway. Paolo