mbox series

[0/5] vfio/pci: Fix up breakage against split irqchip and INTx

Message ID 20200226225048.216508-1-peterx@redhat.com (mailing list archive)
Headers show
Series vfio/pci: Fix up breakage against split irqchip and INTx | expand

Message

Peter Xu Feb. 26, 2020, 10:50 p.m. UTC
VFIO INTx is not working with split irqchip.  On new kernels KVM_IRQFD
will directly fail with resamplefd attached so QEMU will automatically
fallback to the INTx slow path.  However on old kernels it's still
broken.

Only until recently I noticed that this could also break PXE boot for
assigned NICs [1].  My wild guess is that the PXE ROM will be mostly
using INTx as well, which means we can't bypass that even if we
enables MSI for the guest kernel.

This series tries to first fix this issue function-wise, then speed up
for the INTx again with resamplefd (mostly following the ideas
proposed by Paolo one year ago [2]).  My TCP_RR test shows that:

  - Before this series: this is broken, no number to show

  - After patch 1 (enable slow path): get 63% perf comparing to full
    kernel irqchip

  - After whole series (enable fast path partly, irq injection will be
    the same as fast path, however userspace needs to intercept for
    EOI broadcast to resamplefd, though should still be faster than
    the MMIO trick for intx eoi): get 93% perf comparing to full
    kernel irqchip, which is a 46% performance boost

I think we can consider to apply patch 1 even sooner than the rest of
the series to unbreak intx+split first.

The whole test matrix for reference:

  |----------+---------+-------------------+--------------+--------------------|
  | IRQ type | irqchip | TCP_STREAM (Gbps) | TCP_RR (pps) | note               |
  |----------+---------+-------------------+--------------+--------------------|
  | msi      | on      |              9.39 |        17567 |                    |
  | nomsi    | on      |              9.29 |        14056 |                    |
  | msi      | split   |              9.36 |        17330 |                    |
  | nomsi    | split   |                 / |            / | currently broken   |
  | nomsi    | split   |              8.98 |         8977 | after patch 1      |
  | nomsi    | split   |              9.21 |        13142 | after whole series |
  |----------+---------+-------------------+--------------+--------------------|

Any review comment is welcomed.  Thanks,

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1786404
[2] https://patchwork.kernel.org/patch/10738541/#22609933

Peter Xu (5):
  vfio/pci: Disable INTx fast path if using split irqchip
  vfio/pci: Use kvm_irqchip_add_irqfd_notifier_gsi() for irqfds
  KVM: Pass EventNotifier into kvm_irqchip_assign_irqfd
  KVM: Kick resamplefd for split kernel irqchip
  Revert "vfio/pci: Disable INTx fast path if using split irqchip"

 accel/kvm/kvm-all.c    | 83 +++++++++++++++++++++++++++++++++++++++---
 accel/kvm/trace-events |  1 +
 hw/intc/ioapic.c       | 11 +++++-
 hw/vfio/pci.c          | 37 ++++++++-----------
 include/sysemu/kvm.h   |  4 ++
 5 files changed, 106 insertions(+), 30 deletions(-)

Comments

Eric Auger Feb. 27, 2020, 3:32 p.m. UTC | #1
Hi Peter,

On 2/26/20 11:50 PM, Peter Xu wrote:
> VFIO INTx is not working with split irqchip.  On new kernels KVM_IRQFD
> will directly fail with resamplefd attached so QEMU will automatically
> fallback to the INTx slow path.  However on old kernels it's still
> broken.
I think you should quote the commit that changes the behavior:
654f1f13ea56  kvm: Check irqchip mode before assign irqfd?
so that one can understand what it fixed.

So on kernels that do not feature that commit, the ioctl succeeds and we
pretend the resamplefd was properly attached whereas in practice it was
never triggered.

Thanks

Eric

> 
> Only until recently I noticed that this could also break PXE boot for
> assigned NICs [1].  My wild guess is that the PXE ROM will be mostly
> using INTx as well, which means we can't bypass that even if we
> enables MSI for the guest kernel.
> 
> This series tries to first fix this issue function-wise, then speed up
> for the INTx again with resamplefd (mostly following the ideas
> proposed by Paolo one year ago [2]).  My TCP_RR test shows that:

> 
>   - Before this series: this is broken, no number to show
> 
>   - After patch 1 (enable slow path): get 63% perf comparing to full
>     kernel irqchip
> 
>   - After whole series (enable fast path partly, irq injection will be
>     the same as fast path, however userspace needs to intercept for
>     EOI broadcast to resamplefd, though should still be faster than
>     the MMIO trick for intx eoi): get 93% perf comparing to full
>     kernel irqchip, which is a 46% performance boost
> 
> I think we can consider to apply patch 1 even sooner than the rest of
> the series to unbreak intx+split first.
> 
> The whole test matrix for reference:
> 
>   |----------+---------+-------------------+--------------+--------------------|
>   | IRQ type | irqchip | TCP_STREAM (Gbps) | TCP_RR (pps) | note               |
>   |----------+---------+-------------------+--------------+--------------------|
>   | msi      | on      |              9.39 |        17567 |                    |
>   | nomsi    | on      |              9.29 |        14056 |                    |
>   | msi      | split   |              9.36 |        17330 |                    |
>   | nomsi    | split   |                 / |            / | currently broken   |
>   | nomsi    | split   |              8.98 |         8977 | after patch 1      |
>   | nomsi    | split   |              9.21 |        13142 | after whole series |
>   |----------+---------+-------------------+--------------+--------------------|
> 
> Any review comment is welcomed.  Thanks,
> 
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1786404
> [2] https://patchwork.kernel.org/patch/10738541/#22609933
> 
> Peter Xu (5):
>   vfio/pci: Disable INTx fast path if using split irqchip
>   vfio/pci: Use kvm_irqchip_add_irqfd_notifier_gsi() for irqfds
>   KVM: Pass EventNotifier into kvm_irqchip_assign_irqfd
>   KVM: Kick resamplefd for split kernel irqchip
>   Revert "vfio/pci: Disable INTx fast path if using split irqchip"
> 
>  accel/kvm/kvm-all.c    | 83 +++++++++++++++++++++++++++++++++++++++---
>  accel/kvm/trace-events |  1 +
>  hw/intc/ioapic.c       | 11 +++++-
>  hw/vfio/pci.c          | 37 ++++++++-----------
>  include/sysemu/kvm.h   |  4 ++
>  5 files changed, 106 insertions(+), 30 deletions(-)
>
Peter Xu Feb. 27, 2020, 3:51 p.m. UTC | #2
On Thu, Feb 27, 2020 at 04:32:24PM +0100, Auger Eric wrote:
> Hi Peter,

Hi, Eric,

> 
> On 2/26/20 11:50 PM, Peter Xu wrote:
> > VFIO INTx is not working with split irqchip.  On new kernels KVM_IRQFD
> > will directly fail with resamplefd attached so QEMU will automatically
> > fallback to the INTx slow path.  However on old kernels it's still
> > broken.
> I think you should quote the commit that changes the behavior:
> 654f1f13ea56  kvm: Check irqchip mode before assign irqfd?
> so that one can understand what it fixed.

Right I should mention that.

Actually I mentioned it in an old cover letter but unluckily something
wrong happened with git-publish and merely my whole cover letter is
gone...  So what you saw is actually a simplified version and I think
I must have lost something like this... :(

Thanks for bringing that up.

> 
> So on kernels that do not feature that commit, the ioctl succeeds and we
> pretend the resamplefd was properly attached whereas in practice it was
> never triggered.

Right.  Another thing I just noticed is that we must _not_ attach the
resamplefd to the KVM_IRQFD ioctl when we registered this in the
userspace.  One thing is it's for sure not needed at all.  More
importantly, on new kernels (after commit 654f1f13ea56) the KVM_IRQFD
could fail (because it sees both split irqchip and resamplefd) and
QEMU will fallback again to the slow path even if the fast path can
work.

Thanks,
Peter Xu Feb. 27, 2020, 5:02 p.m. UTC | #3
On Thu, Feb 27, 2020 at 10:51:46AM -0500, Peter Xu wrote:
> Right.  Another thing I just noticed is that we must _not_ attach the
> resamplefd to the KVM_IRQFD ioctl when we registered this in the
> userspace.  One thing is it's for sure not needed at all.  More
> importantly, on new kernels (after commit 654f1f13ea56) the KVM_IRQFD
> could fail (because it sees both split irqchip and resamplefd) and
> QEMU will fallback again to the slow path even if the fast path can
> work.

I've just posted a v1.1 version of patch 4 only, which should fix this
problem on new kernels, so fast path will always be preferred.  The
only changed function is kvm_irqchip_assign_irqfd().
Paolo Bonzini Feb. 28, 2020, 10:36 a.m. UTC | #4
On 26/02/20 23:50, Peter Xu wrote:
> VFIO INTx is not working with split irqchip.  On new kernels KVM_IRQFD
> will directly fail with resamplefd attached so QEMU will automatically
> fallback to the INTx slow path.  However on old kernels it's still
> broken.
> 
> Only until recently I noticed that this could also break PXE boot for
> assigned NICs [1].  My wild guess is that the PXE ROM will be mostly
> using INTx as well, which means we can't bypass that even if we
> enables MSI for the guest kernel.
> 
> This series tries to first fix this issue function-wise, then speed up
> for the INTx again with resamplefd (mostly following the ideas
> proposed by Paolo one year ago [2]).  My TCP_RR test shows that:
> 
>   - Before this series: this is broken, no number to show
> 
>   - After patch 1 (enable slow path): get 63% perf comparing to full
>     kernel irqchip

Oh, I thought something like patch 1 had already been applied.

One comment: because you're bypassing IOAPIC when raising the irq, the
IOAPIC's remote_irr for example will not be set.  Most OSes probably
don't care, but it's at least worth a comment.

Paolo

>   - After whole series (enable fast path partly, irq injection will be
>     the same as fast path, however userspace needs to intercept for
>     EOI broadcast to resamplefd, though should still be faster than
>     the MMIO trick for intx eoi): get 93% perf comparing to full
>     kernel irqchip, which is a 46% performance boost
Peter Xu Feb. 28, 2020, 3:25 p.m. UTC | #5
On Fri, Feb 28, 2020 at 11:36:55AM +0100, Paolo Bonzini wrote:
> On 26/02/20 23:50, Peter Xu wrote:
> > VFIO INTx is not working with split irqchip.  On new kernels KVM_IRQFD
> > will directly fail with resamplefd attached so QEMU will automatically
> > fallback to the INTx slow path.  However on old kernels it's still
> > broken.
> > 
> > Only until recently I noticed that this could also break PXE boot for
> > assigned NICs [1].  My wild guess is that the PXE ROM will be mostly
> > using INTx as well, which means we can't bypass that even if we
> > enables MSI for the guest kernel.
> > 
> > This series tries to first fix this issue function-wise, then speed up
> > for the INTx again with resamplefd (mostly following the ideas
> > proposed by Paolo one year ago [2]).  My TCP_RR test shows that:
> > 
> >   - Before this series: this is broken, no number to show
> > 
> >   - After patch 1 (enable slow path): get 63% perf comparing to full
> >     kernel irqchip
> 
> Oh, I thought something like patch 1 had already been applied.
> 
> One comment: because you're bypassing IOAPIC when raising the irq, the
> IOAPIC's remote_irr for example will not be set.  Most OSes probably
> don't care, but it's at least worth a comment.

Ouch I should definitely do that...  How about something like this
(in ioapic_eoi_broadcast(), I even changed kvm_resample_fd_notify to
return a boolean to show whether some GSI is kicked so for this case
we don't need to proceed on checking irr and remote irr):

            /*
             * When IOAPIC is in the userspace while APIC is still in
             * the kernel (i.e., split irqchip), we have a trick to
             * kick the resamplefd logic for registered irqfds from
             * userspace to deactivate the IRQ.  When that happens, it
             * means the irq bypassed userspace IOAPIC (so the irr and
             * remote-irr of the table entry should be bypassed too
             * even if interrupt come), then we don't need to clear
             * the remote-IRR and check irr again because they'll
             * always be zeros.
             */
            if (kvm_resample_fd_notify(n)) {
                continue;
            }

I confess this is still tricky, and actually after some careful read I
noticed you've proposed a similar kernel fix for the problem too which
I overlooked (https://patchwork.kernel.org/patch/10738541/#22609933).
My current thought is that we keep this hackery in userspace only so
we keep split+resamplefd forbidden in the kernel and be clean there.

What's your opinion?

(I should have marked this series as RFC when post)

Thanks,
Paolo Bonzini Feb. 28, 2020, 3:32 p.m. UTC | #6
On 28/02/20 16:25, Peter Xu wrote:
> My current thought is that we keep this hackery in userspace only so
> we keep split+resamplefd forbidden in the kernel and be clean there.

It is better, yes.  The kernel solution makes some sense because split
irqchip _does_ have a concept of ioapic GSIs and routes.  But the kernel
patch would not fix the fact that remote-irr remains zero, which would
need a userspace change anyway.

Paolo