Message ID | 20240112091128.3868059-1-foxywang@tencent.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: irqchip: synchronize srcu only if needed | expand |
+other KVM maintainers On Fri, Jan 12, 2024, Yi Wang wrote: > From: Yi Wang <foxywang@tencent.com> > > We found that it may cost more than 20 milliseconds very accidentally > to enable cap of KVM_CAP_SPLIT_IRQCHIP on a host which has many vms > already. > > The reason is that when vmm(qemu/CloudHypervisor) invokes > KVM_CAP_SPLIT_IRQCHIP kvm will call synchronize_srcu_expedited() and > might_sleep and kworker of srcu may cost some delay during this period. might_sleep() yielding is not justification for changing KVM. That's more or less saying "my task got preempted and took longer to run". Well, yeah. > Since this happens during creating vm, it's no need to synchronize srcu > now 'cause everything is not ready(vcpu/irqfd) and none uses irq_srcu now. > > Signed-off-by: Yi Wang <foxywang@tencent.com> > --- > arch/x86/kvm/irq_comm.c | 2 +- > include/linux/kvm_host.h | 2 ++ > virt/kvm/irqchip.c | 3 ++- > 3 files changed, 5 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c > index 16d076a1b91a..37c92b7486c7 100644 > --- a/arch/x86/kvm/irq_comm.c > +++ b/arch/x86/kvm/irq_comm.c > @@ -394,7 +394,7 @@ static const struct kvm_irq_routing_entry empty_routing[] = {}; > > int kvm_setup_empty_irq_routing(struct kvm *kvm) > { > - return kvm_set_irq_routing(kvm, empty_routing, 0, 0); > + return kvm_set_irq_routing(kvm, empty_routing, 0, NONEED_SYNC_SRCU); > } > > void kvm_arch_post_irq_routing_update(struct kvm *kvm) > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index 4944136efaa2..a46370cca355 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -1995,6 +1995,8 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm, > > #define KVM_MAX_IRQ_ROUTES 4096 /* might need extension/rework in the future */ > > +#define NONEED_SYNC_SRCU (1U << 0) > + > bool kvm_arch_can_set_irq_routing(struct kvm *kvm); > int kvm_set_irq_routing(struct kvm *kvm, > const struct kvm_irq_routing_entry *entries, > diff --git a/virt/kvm/irqchip.c b/virt/kvm/irqchip.c > index 1e567d1f6d3d..cea5c43c1a49 100644 > --- a/virt/kvm/irqchip.c > +++ b/virt/kvm/irqchip.c > @@ -224,7 +224,8 @@ int kvm_set_irq_routing(struct kvm *kvm, > > kvm_arch_post_irq_routing_update(kvm); > > - synchronize_srcu_expedited(&kvm->irq_srcu); > + if (!(flags & NONEED_SYNC_SRCU)) > + synchronize_srcu_expedited(&kvm->irq_srcu); I'm not a fan of x86 passing in a magic flag. It's not immediately clear why skipping synchronization is safe. Piecing things together, _on x86_, I believe the answer is that vCPU can't have yet been created, kvm->lock is held, _and_ kvm_arch_irqfd_allowed() will subtly reject attempts to assign irqfds if the local APIC isn't in-kernel. But AFAICT, s390's implementation of KVM_CREATE_IRQCHIP, which sets up identical dummy/empty routing, doesn't provide the same guarantees. case KVM_CREATE_IRQCHIP: { struct kvm_irq_routing_entry routing; r = -EINVAL; if (kvm->arch.use_irqchip) { /* Set up dummy routing. */ memset(&routing, 0, sizeof(routing)); r = kvm_set_irq_routing(kvm, &routing, 0, 0); } break; } It's entirely possible that someday, kvm_setup_empty_irq_routing() is moved to common KVM and used for s390, at which point we have a mess on our hands because it's not at all obvious whether or not it's safe for s390 to also skip synchronization. Rather than hack in a workaround for x86, I would rather we try and clean up this mess. Except for kvm_irq_map_gsi(), it looks like all flows assume irq_routing is non-NULL. But I'm not remotely confident that that holds true on all architectures, e.g. the only reason kvm_irq_map_gsi() checks for a NULL irq_routing is because syzkaller generated a splat (commit c622a3c21ede ("KVM: irqfd: fix NULL pointer dereference in kvm_irq_map_gsi")). And on x86, I'm pretty sure as of commit 654f1f13ea56 ("kvm: Check irqchip mode before assign irqfd"), which added kvm_arch_irqfd_allowed(), it's impossible for kvm_irq_map_gsi() to encounter a NULL irq_routing _on x86_. But I strongly suspect other architectures can reach kvm_irq_map_gsi() with a NULL irq_routing, e.g. RISC-V dynamically configures its interrupt controller, yet doesn't implement kvm_arch_intc_initialized(). So instead of special casing x86, what if we instead have KVM setup an empty IRQ routing table during kvm_create_vm(), and then avoid this mess entirely? That way x86 and s390 no longer need to set empty/dummy routing when creating an IRQCHIP, and the worst case scenario of userspace misusing an ioctl() is no longer a NULL pointer deref.
Many thanks for your such kind and detailed reply, Sean! On Sat, Jan 13, 2024 at 12:28 AM Sean Christopherson <seanjc@google.com> wrote: > > +other KVM maintainers > > On Fri, Jan 12, 2024, Yi Wang wrote: > > From: Yi Wang <foxywang@tencent.com> > > > > We found that it may cost more than 20 milliseconds very accidentally > > to enable cap of KVM_CAP_SPLIT_IRQCHIP on a host which has many vms > > already. > > > > The reason is that when vmm(qemu/CloudHypervisor) invokes > > KVM_CAP_SPLIT_IRQCHIP kvm will call synchronize_srcu_expedited() and > > might_sleep and kworker of srcu may cost some delay during this period. > > might_sleep() yielding is not justification for changing KVM. That's more or > less saying "my task got preempted and took longer to run". Well, yeah. Agree. But I suppose it may be one of the reasons that makes time of KVM_CAP_SPLIT_IRQCHIP delayed, of course, the kworker has the biggest suspicion :) > > > Since this happens during creating vm, it's no need to synchronize srcu > > now 'cause everything is not ready(vcpu/irqfd) and none uses irq_srcu now. .... > And on x86, I'm pretty sure as of commit 654f1f13ea56 ("kvm: Check irqchip mode > before assign irqfd"), which added kvm_arch_irqfd_allowed(), it's impossible for > kvm_irq_map_gsi() to encounter a NULL irq_routing _on x86_. > > But I strongly suspect other architectures can reach kvm_irq_map_gsi() with a > NULL irq_routing, e.g. RISC-V dynamically configures its interrupt controller, > yet doesn't implement kvm_arch_intc_initialized(). > > So instead of special casing x86, what if we instead have KVM setup an empty > IRQ routing table during kvm_create_vm(), and then avoid this mess entirely? > That way x86 and s390 no longer need to set empty/dummy routing when creating > an IRQCHIP, and the worst case scenario of userspace misusing an ioctl() is no > longer a NULL pointer deref. To setup an empty IRQ routing table during kvm_create_vm() sounds a good idea, at this time vCPU have not been created and kvm->lock is held so skipping synchronization is safe here. However, there is one drawback, if vmm wants to emulate irqchip itself, e.g. qemu with command line '-machine kernel-irqchip=off' may not need irqchip in kernel. How do we handle this issue? --- Best wishes Yi Wang
Am 15.01.24 um 17:01 schrieb Yi Wang: > Many thanks for your such kind and detailed reply, Sean! > > On Sat, Jan 13, 2024 at 12:28 AM Sean Christopherson <seanjc@google.com> wrote: >> >> +other KVM maintainers >> >> On Fri, Jan 12, 2024, Yi Wang wrote: >>> From: Yi Wang <foxywang@tencent.com> >>> >>> We found that it may cost more than 20 milliseconds very accidentally >>> to enable cap of KVM_CAP_SPLIT_IRQCHIP on a host which has many vms >>> already. >>> >>> The reason is that when vmm(qemu/CloudHypervisor) invokes >>> KVM_CAP_SPLIT_IRQCHIP kvm will call synchronize_srcu_expedited() and >>> might_sleep and kworker of srcu may cost some delay during this period. >> >> might_sleep() yielding is not justification for changing KVM. That's more or >> less saying "my task got preempted and took longer to run". Well, yeah. > > Agree. But I suppose it may be one of the reasons that makes time of > KVM_CAP_SPLIT_IRQCHIP delayed, of course, the kworker has the biggest > suspicion :) > >> >>> Since this happens during creating vm, it's no need to synchronize srcu >>> now 'cause everything is not ready(vcpu/irqfd) and none uses irq_srcu now. > > .... > >> And on x86, I'm pretty sure as of commit 654f1f13ea56 ("kvm: Check irqchip mode >> before assign irqfd"), which added kvm_arch_irqfd_allowed(), it's impossible for >> kvm_irq_map_gsi() to encounter a NULL irq_routing _on x86_. >> >> But I strongly suspect other architectures can reach kvm_irq_map_gsi() with a >> NULL irq_routing, e.g. RISC-V dynamically configures its interrupt controller, >> yet doesn't implement kvm_arch_intc_initialized(). >> >> So instead of special casing x86, what if we instead have KVM setup an empty >> IRQ routing table during kvm_create_vm(), and then avoid this mess entirely? >> That way x86 and s390 no longer need to set empty/dummy routing when creating >> an IRQCHIP, and the worst case scenario of userspace misusing an ioctl() is no >> longer a NULL pointer deref. Sounds like a good idea. This should also speedup guest creation on s390 since it would avoid one syncronize_irq. > > To setup an empty IRQ routing table during kvm_create_vm() sounds a good idea, > at this time vCPU have not been created and kvm->lock is held so skipping > synchronization is safe here. > > However, there is one drawback, if vmm wants to emulate irqchip > itself, e.g. qemu > with command line '-machine kernel-irqchip=off' may not need irqchip > in kernel. How > do we handle this issue? I would be fine with wasted memory. The only question is does it have a functional impact or can we simply ignore the dummy routing.
On Tue, Jan 16, 2024, Christian Borntraeger wrote: > > > Am 15.01.24 um 17:01 schrieb Yi Wang: > > Many thanks for your such kind and detailed reply, Sean! > > > > On Sat, Jan 13, 2024 at 12:28 AM Sean Christopherson <seanjc@google.com> wrote: > > > > > > +other KVM maintainers > > > > > > On Fri, Jan 12, 2024, Yi Wang wrote: > > > > From: Yi Wang <foxywang@tencent.com> > > > > > > > > We found that it may cost more than 20 milliseconds very accidentally > > > > to enable cap of KVM_CAP_SPLIT_IRQCHIP on a host which has many vms > > > > already. > > > > > > > > The reason is that when vmm(qemu/CloudHypervisor) invokes > > > > KVM_CAP_SPLIT_IRQCHIP kvm will call synchronize_srcu_expedited() and > > > > might_sleep and kworker of srcu may cost some delay during this period. > > > > > > might_sleep() yielding is not justification for changing KVM. That's more or > > > less saying "my task got preempted and took longer to run". Well, yeah. > > > > Agree. But I suppose it may be one of the reasons that makes time of > > KVM_CAP_SPLIT_IRQCHIP delayed, of course, the kworker has the biggest > > suspicion :) > > > > > > > > > Since this happens during creating vm, it's no need to synchronize srcu > > > > now 'cause everything is not ready(vcpu/irqfd) and none uses irq_srcu now. > > > > .... > > > > > And on x86, I'm pretty sure as of commit 654f1f13ea56 ("kvm: Check irqchip mode > > > before assign irqfd"), which added kvm_arch_irqfd_allowed(), it's impossible for > > > kvm_irq_map_gsi() to encounter a NULL irq_routing _on x86_. > > > > > > But I strongly suspect other architectures can reach kvm_irq_map_gsi() with a > > > NULL irq_routing, e.g. RISC-V dynamically configures its interrupt controller, > > > yet doesn't implement kvm_arch_intc_initialized(). > > > > > > So instead of special casing x86, what if we instead have KVM setup an empty > > > IRQ routing table during kvm_create_vm(), and then avoid this mess entirely? > > > That way x86 and s390 no longer need to set empty/dummy routing when creating > > > an IRQCHIP, and the worst case scenario of userspace misusing an ioctl() is no > > > longer a NULL pointer deref. > > Sounds like a good idea. This should also speedup guest creation on s390 since > it would avoid one syncronize_irq. > > > > To setup an empty IRQ routing table during kvm_create_vm() sounds a good idea, > > at this time vCPU have not been created and kvm->lock is held so skipping > > synchronization is safe here. > > > > However, there is one drawback, if vmm wants to emulate irqchip itself, > > e.g. qemu with command line '-machine kernel-irqchip=off' may not need > > irqchip in kernel. How do we handle this issue? > > I would be fine with wasted memory. +1. If we really, really want to avoid the negligible memory overhead, we could pre-configure a static global table and directly use that as the dummy table (and exempt it from being freed by free_irq_routing_table()). > The only question is does it have a functional impact or can we simply ignore > the dummy routing. Given the lack of sanity checks on kvm->irq_routing, I'm pretty sure the only way for there to be functional impact is if there's a latent NULL pointer deref hiding somewhere.
On Wed, Jan 17, 2024 at 12:50 AM Christian Borntraeger <borntraeger@linux.ibm.com> wrote: > > > > Am 15.01.24 um 17:01 schrieb Yi Wang: > > Many thanks for your such kind and detailed reply, Sean! > > .... > >> > >> So instead of special casing x86, what if we instead have KVM setup an empty > >> IRQ routing table during kvm_create_vm(), and then avoid this mess entirely? > >> That way x86 and s390 no longer need to set empty/dummy routing when creating > >> an IRQCHIP, and the worst case scenario of userspace misusing an ioctl() is no > >> longer a NULL pointer deref. > > Sounds like a good idea. This should also speedup guest creation on s390 since > it would avoid one syncronize_irq. > > > > To setup an empty IRQ routing table during kvm_create_vm() sounds a good idea, > > at this time vCPU have not been created and kvm->lock is held so skipping > > synchronization is safe here. > > > > However, there is one drawback, if vmm wants to emulate irqchip > > itself, e.g. qemu > > with command line '-machine kernel-irqchip=off' may not need irqchip > > in kernel. How > > do we handle this issue? > > I would be fine with wasted memory. The only question is does it have a functional > impact or can we simply ignore the dummy routing. > Thanks for your reply, I will update the patch.
On Wed, Jan 17, 2024 at 12:58 AM Sean Christopherson <seanjc@google.com> wrote: > > On Tue, Jan 16, 2024, Christian Borntraeger wrote: > > > > .... > > > > I would be fine with wasted memory. > > +1. If we really, really want to avoid the negligible memory overhead, we could > pre-configure a static global table and directly use that as the dummy table (and > exempt it from being freed by free_irq_routing_table()). Thanks for the suggestion! Well, in my opinion it may be better to fix the current issue and I'm glad to send another patch to optimize this. > > The only question is does it have a functional impact or can we simply ignore > > the dummy routing. > > Given the lack of sanity checks on kvm->irq_routing, I'm pretty sure the only way > for there to be functional impact is if there's a latent NULL pointer deref hiding > somewhere.
diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c index 16d076a1b91a..37c92b7486c7 100644 --- a/arch/x86/kvm/irq_comm.c +++ b/arch/x86/kvm/irq_comm.c @@ -394,7 +394,7 @@ static const struct kvm_irq_routing_entry empty_routing[] = {}; int kvm_setup_empty_irq_routing(struct kvm *kvm) { - return kvm_set_irq_routing(kvm, empty_routing, 0, 0); + return kvm_set_irq_routing(kvm, empty_routing, 0, NONEED_SYNC_SRCU); } void kvm_arch_post_irq_routing_update(struct kvm *kvm) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 4944136efaa2..a46370cca355 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1995,6 +1995,8 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm, #define KVM_MAX_IRQ_ROUTES 4096 /* might need extension/rework in the future */ +#define NONEED_SYNC_SRCU (1U << 0) + bool kvm_arch_can_set_irq_routing(struct kvm *kvm); int kvm_set_irq_routing(struct kvm *kvm, const struct kvm_irq_routing_entry *entries, diff --git a/virt/kvm/irqchip.c b/virt/kvm/irqchip.c index 1e567d1f6d3d..cea5c43c1a49 100644 --- a/virt/kvm/irqchip.c +++ b/virt/kvm/irqchip.c @@ -224,7 +224,8 @@ int kvm_set_irq_routing(struct kvm *kvm, kvm_arch_post_irq_routing_update(kvm); - synchronize_srcu_expedited(&kvm->irq_srcu); + if (!(flags & NONEED_SYNC_SRCU)) + synchronize_srcu_expedited(&kvm->irq_srcu); new = old; r = 0;