Message ID | 20250201011400.669483-2-seanjc@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: x86/xen: Restrict hypercall MSR index | expand |
On 01/02/2025 01:13, Sean Christopherson wrote: > Reject userspace attempts to set the Xen hypercall page MSR to an index > outside of the "standard" virtualization range [0x40000000, 0x4fffffff], > as KVM is not equipped to handle collisions with real MSRs, e.g. KVM > doesn't update MSR interception, conflicts with VMCS/VMCB fields, special > case writes in KVM, etc. > > Allowing userspace to redirect any MSR write can also be used to attack > the kernel, as kvm_xen_write_hypercall_page() takes multiple locks and > writes to guest memory. E.g. if userspace sets the MSR to MSR_IA32_XSS, > KVM's write to MSR_IA32_XSS during vCPU creation will trigger an SRCU > violation due to writing guest memory: > > ============================= > WARNING: suspicious RCU usage > 6.13.0-rc3 > ----------------------------- > include/linux/kvm_host.h:1046 suspicious rcu_dereference_check() usage! > > stack backtrace: > CPU: 6 UID: 1000 PID: 1101 Comm: repro Not tainted 6.13.0-rc3 > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 > Call Trace: > <TASK> > dump_stack_lvl+0x7f/0x90 > lockdep_rcu_suspicious+0x176/0x1c0 > kvm_vcpu_gfn_to_memslot+0x259/0x280 > kvm_vcpu_write_guest+0x3a/0xa0 > kvm_xen_write_hypercall_page+0x268/0x300 > kvm_set_msr_common+0xc44/0x1940 > vmx_set_msr+0x9db/0x1fc0 > kvm_vcpu_reset+0x857/0xb50 > kvm_arch_vcpu_create+0x37e/0x4d0 > kvm_vm_ioctl+0x669/0x2100 > __x64_sys_ioctl+0xc1/0xf0 > do_syscall_64+0xc5/0x210 > entry_SYSCALL_64_after_hwframe+0x4b/0x53 > RIP: 0033:0x7feda371b539 > > While the MSR index isn't strictly ABI, i.e. can theoretically float to > any value, in practice no known VMM sets the MSR index to anything other > than 0x40000000 or 0x40000200. > > Reported-by: syzbot+cdeaeec70992eca2d920@syzkaller.appspotmail.com > Closes: https://lore.kernel.org/all/679258d4.050a0220.2eae65.000a.GAE@google.com > Cc: Joao Martins <joao.m.martins@oracle.com> > Cc: Paul Durrant <paul@xen.org> > Cc: David Woodhouse <dwmw@amazon.co.uk> > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- > arch/x86/kvm/xen.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > Reviewed-by: Paul Durrant <paul@xen.org>
On Fri, 2025-01-31 at 17:13 -0800, Sean Christopherson wrote: > --- a/arch/x86/kvm/xen.c > +++ b/arch/x86/kvm/xen.c > @@ -1324,6 +1324,14 @@ int kvm_xen_hvm_config(struct kvm *kvm, struct kvm_xen_hvm_config *xhc) > xhc->blob_size_32 || xhc->blob_size_64)) > return -EINVAL; > > + /* > + * Restrict the MSR to the range that is unofficially reserved for > + * synthetic, virtualization-defined MSRs, e.g. to prevent confusing > + * KVM by colliding with a real MSR that requires special handling. > + */ > + if (xhc->msr && (xhc->msr < 0x40000000 || xhc->msr > 0x4fffffff)) > + return -EINVAL; > + > mutex_lock(&kvm->arch.xen.xen_lock); > > if (xhc->msr && !kvm->arch.xen_hvm_config.msr) I'd prefer to see #defines for those magic values. Especially as there is a corresponding requirement that they never be set from host context (which is where the potential locking issues come in). Which train of thought leads me to ponder this as an alternative (or additional) solution: --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3733,7 +3733,13 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) u32 msr = msr_info->index; u64 data = msr_info->data; - if (msr && msr == vcpu->kvm->arch.xen_hvm_config.msr) + /* + * Do not allow host-initiated writes to trigger the Xen hypercall + * page setup; it could incur locking paths which are not expected + * if userspace sets the MSR in an unusual location. + */ + if (msr && msr == vcpu->kvm->arch.xen_hvm_config.msr && + !msr_info->host_initiated) return kvm_xen_write_hypercall_page(vcpu, data); switch (msr) {
On Wed, Feb 05, 2025, David Woodhouse wrote: > On Fri, 2025-01-31 at 17:13 -0800, Sean Christopherson wrote: > > --- a/arch/x86/kvm/xen.c > > +++ b/arch/x86/kvm/xen.c > > @@ -1324,6 +1324,14 @@ int kvm_xen_hvm_config(struct kvm *kvm, struct kvm_xen_hvm_config *xhc) > > xhc->blob_size_32 || xhc->blob_size_64)) > > return -EINVAL; > > > > + /* > > + * Restrict the MSR to the range that is unofficially reserved for > > + * synthetic, virtualization-defined MSRs, e.g. to prevent confusing > > + * KVM by colliding with a real MSR that requires special handling. > > + */ > > + if (xhc->msr && (xhc->msr < 0x40000000 || xhc->msr > 0x4fffffff)) > > + return -EINVAL; > > + > > mutex_lock(&kvm->arch.xen.xen_lock); > > > > if (xhc->msr && !kvm->arch.xen_hvm_config.msr) > > I'd prefer to see #defines for those magic values. Can do. Hmm, and since this would be visible to userspace, arguably the #defines should go in arch/x86/include/uapi/asm/kvm.h > Especially as there is a corresponding requirement that they never be set > from host context (which is where the potential locking issues come in). > Which train of thought leads me to ponder this as an alternative (or > additional) solution: > > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -3733,7 +3733,13 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) > u32 msr = msr_info->index; > u64 data = msr_info->data; > > - if (msr && msr == vcpu->kvm->arch.xen_hvm_config.msr) > + /* > + * Do not allow host-initiated writes to trigger the Xen hypercall > + * page setup; it could incur locking paths which are not expected > + * if userspace sets the MSR in an unusual location. That's just as likely to break userspace. Doing a save/restore on the MSR doesn't make a whole lot of sense since it's effectively a "command" MSR, but IMO it's not any less likely than userspace putting the MSR index outside of the synthetic range. Side topic, upstream QEMU doesn't even appear to put the MSR at the Hyper-V address. It tells the guest that's where the MSR is located, but the config passed to KVM still uses the default. /* Hypercall MSR base address */ if (hyperv_enabled(cpu)) { c->ebx = XEN_HYPERCALL_MSR_HYPERV; kvm_xen_init(cs->kvm_state, c->ebx); } else { c->ebx = XEN_HYPERCALL_MSR; } ... /* hyperv_enabled() doesn't work yet. */ uint32_t msr = XEN_HYPERCALL_MSR; ret = kvm_xen_init(s, msr); if (ret < 0) { return ret; } Userspace breakage aside, disallowng host writes would fix the immediate issue, and I think would mitigate all concerns with putting the host at risk. But it's not enough to actually make an overlapping MSR index work. E.g. if the MSR is passed through to the guest, the write will go through to the hardware MSR, unless the WRMSR happens to be emulated. I really don't want to broadly support redirecting any MSR, because to truly go down that path we'd need to deal with x2APIC, EFER, and other MSRs that have special treatment and meaning. While KVM's stance is usually that a misconfigured vCPU model is userspace's problem, in this case I don't see any value in letting userspace be stupid. It can't work generally, it creates unique ABI for KVM_SET_MSRS, and unless there's a crazy use case I'm overlooking, there's no sane reason for userspace to put the index in outside of the synthetic range (whereas defining seemingly nonsensical CPUID feature bits is useful for testing purposes, implementing support in userspace, etc).
On Wed, 2025-02-05 at 07:06 -0800, Sean Christopherson wrote: > On Wed, Feb 05, 2025, David Woodhouse wrote: > > On Fri, 2025-01-31 at 17:13 -0800, Sean Christopherson wrote: > > > --- a/arch/x86/kvm/xen.c > > > +++ b/arch/x86/kvm/xen.c > > > @@ -1324,6 +1324,14 @@ int kvm_xen_hvm_config(struct kvm *kvm, struct kvm_xen_hvm_config *xhc) > > > xhc->blob_size_32 || xhc->blob_size_64)) > > > return -EINVAL; > > > > > > + /* > > > + * Restrict the MSR to the range that is unofficially reserved for > > > + * synthetic, virtualization-defined MSRs, e.g. to prevent confusing > > > + * KVM by colliding with a real MSR that requires special handling. > > > + */ > > > + if (xhc->msr && (xhc->msr < 0x40000000 || xhc->msr > 0x4fffffff)) > > > + return -EINVAL; > > > + > > > mutex_lock(&kvm->arch.xen.xen_lock); > > > > > > if (xhc->msr && !kvm->arch.xen_hvm_config.msr) > > > > I'd prefer to see #defines for those magic values. > > Can do. Hmm, and since this would be visible to userspace, arguably the #defines > should go in arch/x86/include/uapi/asm/kvm.h Thanks. > > Especially as there is a corresponding requirement that they never be set > > from host context (which is where the potential locking issues come in). > > Which train of thought leads me to ponder this as an alternative (or > > additional) solution: > > > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -3733,7 +3733,13 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) > > u32 msr = msr_info->index; > > u64 data = msr_info->data; > > > > - if (msr && msr == vcpu->kvm->arch.xen_hvm_config.msr) > > + /* > > + * Do not allow host-initiated writes to trigger the Xen hypercall > > + * page setup; it could incur locking paths which are not expected > > + * if userspace sets the MSR in an unusual location. > > That's just as likely to break userspace. Doing a save/restore on the MSR doesn't > make a whole lot of sense since it's effectively a "command" MSR, but IMO it's not > any less likely than userspace putting the MSR index outside of the synthetic range. Save/restore on the MSR makes no sense. It's a write-only MSR; writing to it has no effect *other* than populating the target page. In KVM we don't implement reading from it at all; I don't think Xen does either? And even if it was readable and would rather pointlessly return the last value written to it, save/restore arguably shouldn't actually trigger the guest memory to be overwritten again. The hypercall page should only be populated when the *guest* writes the MSR. With the recent elimination of the hypercall page from Linux Xen guests, we've suggested that Linux should still set up the hypercall page early (as it *does* have the side-effect of letting Xen know that the guest is 64-bit). And then just free the page without ever using it. We absolutely would not want a save/restore to scribble on that page again. I'm absolutely not worried about breaking userspace with such a change to make the hypercall page MSR only work when !host_initiated. In fact I think it's probably the right thing to do *anyway*. If userspace wants to write to guest memory, it can do that anyway; it doesn't need to ask the *kernel* to do it. > Side topic, upstream QEMU doesn't even appear to put the MSR at the Hyper-V > address. It tells the guest that's where the MSR is located, but the config > passed to KVM still uses the default. > > /* Hypercall MSR base address */ > if (hyperv_enabled(cpu)) { > c->ebx = XEN_HYPERCALL_MSR_HYPERV; > kvm_xen_init(cs->kvm_state, c->ebx); > } else { > c->ebx = XEN_HYPERCALL_MSR; > } > > ... > > /* hyperv_enabled() doesn't work yet. */ > uint32_t msr = XEN_HYPERCALL_MSR; > ret = kvm_xen_init(s, msr); > if (ret < 0) { > return ret; > } > Those two happen in reverse chronological order, don't they? And in the lower one the comment tells you that hyperv_enabled() doesn't work yet. When the higher one is called later, it calls kvm_xen_init() *again* to put the MSR in the right place. It could be prettier, but I don't think it's broken, is it? > Userspace breakage aside, disallowng host writes would fix the immediate issue, > and I think would mitigate all concerns with putting the host at risk. But it's > not enough to actually make an overlapping MSR index work. E.g. if the MSR is > passed through to the guest, the write will go through to the hardware MSR, unless > the WRMSR happens to be emulated. > > I really don't want to broadly support redirecting any MSR, because to truly go > down that path we'd need to deal with x2APIC, EFER, and other MSRs that have > special treatment and meaning. > > While KVM's stance is usually that a misconfigured vCPU model is userspace's > problem, in this case I don't see any value in letting userspace be stupid. It > can't work generally, it creates unique ABI for KVM_SET_MSRS, and unless there's > a crazy use case I'm overlooking, there's no sane reason for userspace to put the > index in outside of the synthetic range (whereas defining seemingly nonsensical > CPUID feature bits is useful for testing purposes, implementing support in > userspace, etc). Right, I think we should do *both*. Blocking host writes solves the issue of locking problems with the hypercall page setup. All it would take for that issue to recur is for us (or Microsoft) to invent a new MSR in the synthetic range which is also written on vCPU init/reset. And then the sanity check on where the VMM puts the Xen MSR doesn't save us. But yes, we should *also* do that sanity check.
On Wed, Feb 05, 2025, David Woodhouse wrote: > On Wed, 2025-02-05 at 07:06 -0800, Sean Christopherson wrote: > > On Wed, Feb 05, 2025, David Woodhouse wrote: > > > Especially as there is a corresponding requirement that they never be set > > > from host context (which is where the potential locking issues come in). > > > Which train of thought leads me to ponder this as an alternative (or > > > additional) solution: > > > > > > --- a/arch/x86/kvm/x86.c > > > +++ b/arch/x86/kvm/x86.c > > > @@ -3733,7 +3733,13 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) > > > u32 msr = msr_info->index; > > > u64 data = msr_info->data; > > > > > > - if (msr && msr == vcpu->kvm->arch.xen_hvm_config.msr) > > > + /* > > > + * Do not allow host-initiated writes to trigger the Xen hypercall > > > + * page setup; it could incur locking paths which are not expected > > > + * if userspace sets the MSR in an unusual location. > > > > That's just as likely to break userspace. Doing a save/restore on the MSR doesn't > > make a whole lot of sense since it's effectively a "command" MSR, but IMO it's not > > any less likely than userspace putting the MSR index outside of the synthetic range. > > Save/restore on the MSR makes no sense. It's a write-only MSR; writing > to it has no effect *other* than populating the target page. In KVM we > don't implement reading from it at all; I don't think Xen does either? Hah, that's another KVM bug, technically. KVM relies on the MSR not being handled in order to generate the write-only semantics, but if the MSR index collides with an MSR that KVM emulates, then the MSR would be readable. KVM supports Hyper-V's HV_X64_MSR_TSC_INVARIANT_CONTROL (0x40000118), so just a few hundred more MSRs until fireworks :-) If we want to close that hole, it'd be easy enough to add a check in kvm_get_msr_common(). > Those two happen in reverse chronological order, don't they? And in the > lower one the comment tells you that hyperv_enabled() doesn't work yet. > When the higher one is called later, it calls kvm_xen_init() *again* to > put the MSR in the right place. > > It could be prettier, but I don't think it's broken, is it? Gah, -ENOCOFFEE. > > Userspace breakage aside, disallowng host writes would fix the immediate issue, > > and I think would mitigate all concerns with putting the host at risk. But it's > > not enough to actually make an overlapping MSR index work. E.g. if the MSR is > > passed through to the guest, the write will go through to the hardware MSR, unless > > the WRMSR happens to be emulated. > > > > I really don't want to broadly support redirecting any MSR, because to truly go > > down that path we'd need to deal with x2APIC, EFER, and other MSRs that have > > special treatment and meaning. > > > > While KVM's stance is usually that a misconfigured vCPU model is userspace's > > problem, in this case I don't see any value in letting userspace be stupid. It > > can't work generally, it creates unique ABI for KVM_SET_MSRS, and unless there's > > a crazy use case I'm overlooking, there's no sane reason for userspace to put the > > index in outside of the synthetic range (whereas defining seemingly nonsensical > > CPUID feature bits is useful for testing purposes, implementing support in > > userspace, etc). > > Right, I think we should do *both*. Blocking host writes solves the > issue of locking problems with the hypercall page setup. All it would > take for that issue to recur is for us (or Microsoft) to invent a new > MSR in the synthetic range which is also written on vCPU init/reset. > And then the sanity check on where the VMM puts the Xen MSR doesn't > save us. Ugh, indeed. MSRs are quite the conundrum. Userspace MSR filters have a similar problem, where it's impossible to know the semantics of future hardware MSRs, and so it's impossible to document which MSRs userspace is allowed to intercept :-/ Oh! It doesn't help KVM avoid breaking userspace, but a way for QEMU to avoid a future collision would be to have QEMU start at 0x40000200 when Hyper-V is enabled, but then use KVM_GET_MSR_INDEX_LIST to detect a collision with KVM Hyper-V, e.g. increment the index until an available index is found (with sanity checks and whatnot). > But yes, we should *also* do that sanity check. Ah, I'm a-ok with that.
On 5 February 2025 15:51:01 GMT, Sean Christopherson <seanjc@google.com> wrote: >On Wed, Feb 05, 2025, David Woodhouse wrote: >> >> Save/restore on the MSR makes no sense. It's a write-only MSR; writing >> to it has no effect *other* than populating the target page. In KVM we >> don't implement reading from it at all; I don't think Xen does either? > >Hah, that's another KVM bug, technically. KVM relies on the MSR not being handled >in order to generate the write-only semantics, but if the MSR index collides with >an MSR that KVM emulates, then the MSR would be readable. KVM supports Hyper-V's >HV_X64_MSR_TSC_INVARIANT_CONTROL (0x40000118), so just a few hundred more MSRs >until fireworks :-) Nah, I don't see that as a bug. If there's a conflict then the Xen hypercall MSR "steals" writes from the one it conflicts with, sure. But since it's a write-only MSR the conflicting one still works fine for reads. Which means that Xen can "conflict" with a read-only MSR and nobody cares; arguably there's no bug at all in that case. >> Those two happen in reverse chronological order, don't they? And in the >> lower one the comment tells you that hyperv_enabled() doesn't work yet. >> When the higher one is called later, it calls kvm_xen_init() *again* to >> put the MSR in the right place. >> >> It could be prettier, but I don't think it's broken, is it? > >Gah, -ENOCOFFEE. I'll take the criticism though; that code is distinctly non-obvious, even with that hint in the comment about hyperv_enabled() not being usable yet. ISTR we needed to do the Xen init early on, even before we knew precisely which MSR to use. And that's why we do it that way and then just call the function again later if we need to change the MSR. I'll see if that can be simplified, and at the very least I'll update the existing comment to explicitly state that the function will get called again later if needed. You shouldn't *need* coffee to understand the code. >> > Userspace breakage aside, disallowng host writes would fix the immediate issue, >> > and I think would mitigate all concerns with putting the host at risk. But it's >> > not enough to actually make an overlapping MSR index work. E.g. if the MSR is >> > passed through to the guest, the write will go through to the hardware MSR, unless >> > the WRMSR happens to be emulated. >> > >> > I really don't want to broadly support redirecting any MSR, because to truly go >> > down that path we'd need to deal with x2APIC, EFER, and other MSRs that have >> > special treatment and meaning. >> > >> > While KVM's stance is usually that a misconfigured vCPU model is userspace's >> > problem, in this case I don't see any value in letting userspace be stupid. It >> > can't work generally, it creates unique ABI for KVM_SET_MSRS, and unless there's >> > a crazy use case I'm overlooking, there's no sane reason for userspace to put the >> > index in outside of the synthetic range (whereas defining seemingly nonsensical >> > CPUID feature bits is useful for testing purposes, implementing support in >> > userspace, etc). >> >> Right, I think we should do *both*. Blocking host writes solves the >> issue of locking problems with the hypercall page setup. All it would >> take for that issue to recur is for us (or Microsoft) to invent a new >> MSR in the synthetic range which is also written on vCPU init/reset. >> And then the sanity check on where the VMM puts the Xen MSR doesn't >> save us. > >Ugh, indeed. MSRs are quite the conundrum. Userspace MSR filters have a similar >problem, where it's impossible to know the semantics of future hardware MSRs, and >so it's impossible to document which MSRs userspace is allowed to intercept :-/ > >Oh! It doesn't help KVM avoid breaking userspace, but a way for QEMU to avoid a >future collision would be to have QEMU start at 0x40000200 when Hyper-V is enabled, >but then use KVM_GET_MSR_INDEX_LIST to detect a collision with KVM Hyper-V, e.g. >increment the index until an available index is found (with sanity checks and whatnot). Makes sense. I think that's a third separate patch, yes? >> But yes, we should *also* do that sanity check. > >Ah, I'm a-ok with that.
On Wed, 2025-02-05 at 16:18 +0000, David Woodhouse wrote: > > > Oh! It doesn't help KVM avoid breaking userspace, but a way for QEMU to avoid a > > future collision would be to have QEMU start at 0x40000200 when Hyper-V is enabled, > > but then use KVM_GET_MSR_INDEX_LIST to detect a collision with KVM Hyper-V, e.g. > > increment the index until an available index is found (with sanity checks and whatnot). > > Makes sense. I think that's a third separate patch, yes? To be clear, I think I mean a third patch which further restricts kvm_xen_hvm_config() to disallow indices for which kvm_is_advertised_msr() returns true? We could roll that into your original patch instead, if you prefer. Q: Should kvm_is_advertised_msr() include the Xen hypercall MSR, if one is already configured? Life is easier if we answer 'no'...
On Wed, Feb 05, 2025, David Woodhouse wrote: > On Wed, 2025-02-05 at 16:18 +0000, David Woodhouse wrote: > > > > > Oh! It doesn't help KVM avoid breaking userspace, but a way for QEMU to avoid a > > > future collision would be to have QEMU start at 0x40000200 when Hyper-V is enabled, > > > but then use KVM_GET_MSR_INDEX_LIST to detect a collision with KVM Hyper-V, e.g. > > > increment the index until an available index is found (with sanity checks and whatnot). > > > > Makes sense. I think that's a third separate patch, yes? > > To be clear, I think I mean a third patch which further restricts > kvm_xen_hvm_config() to disallow indices for which > kvm_is_advertised_msr() returns true? > > We could roll that into your original patch instead, if you prefer. Nah, I like the idea of separate patch. > Q: Should kvm_is_advertised_msr() include the Xen hypercall MSR, if one > is already configured? Life is easier if we answer 'no'... No :-) The idea with kvm_is_advertised_msr() is to ignore accesses to MSRs that don't exist according the to vCPU model, but that KVM advertised to userspace (via KVM_GET_MSR_INDEX_LIST) and so may be saved/restored by a naive/unoptimized userspace. For the Xen MSR, KVM never advertises the MSR, and IIUC, KVM will never treat the MSR as non-existent because defining the MSR brings it into existence.
On Wed, 2025-02-05 at 07:51 -0800, Sean Christopherson wrote: > > > Those two happen in reverse chronological order, don't they? And in the > > lower one the comment tells you that hyperv_enabled() doesn't work yet. > > When the higher one is called later, it calls kvm_xen_init() *again* to > > put the MSR in the right place. > > > > It could be prettier, but I don't think it's broken, is it? > > Gah, -ENOCOFFEE. I trust this version would require less coffee to parse... From a90c2df0bd9589609085dd42f94b61de1bf48eb7 Mon Sep 17 00:00:00 2001 From: David Woodhouse <dwmw@amazon.co.uk> Date: Thu, 6 Feb 2025 08:52:52 +0000 Subject: [PATCH] i386/xen: Move KVM_XEN_HVM_CONFIG ioctl to kvm_xen_init_vcpu() At the time kvm_xen_init() is called, hyperv_enabled() doesn't yet work, so the correct MSR index to use for the hypercall page isn't known. Rather than setting it to the default and then shifting it later for the Hyper-V case with a confusing second call to kvm_init_xen(), just do it once in kvm_xen_init_vcpu(). Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> --- target/i386/kvm/kvm.c | 16 +++++--------- target/i386/kvm/xen-emu.c | 44 ++++++++++++++++++++------------------- target/i386/kvm/xen-emu.h | 4 ++-- 3 files changed, 30 insertions(+), 34 deletions(-) diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c index 6c749d4ee8..b4deec6255 100644 --- a/target/i386/kvm/kvm.c +++ b/target/i386/kvm/kvm.c @@ -2129,6 +2129,8 @@ int kvm_arch_init_vcpu(CPUState *cs) if (cs->kvm_state->xen_version) { #ifdef CONFIG_XEN_EMU struct kvm_cpuid_entry2 *xen_max_leaf; + uint32_t hypercall_msr = + hyperv_enabled(cpu) ? XEN_HYPERCALL_MSR_HYPERV : XEN_HYPERCALL_MSR; memcpy(signature, "XenVMMXenVMM", 12); @@ -2150,13 +2152,7 @@ int kvm_arch_init_vcpu(CPUState *cs) c->function = kvm_base + XEN_CPUID_HVM_MSR; /* Number of hypercall-transfer pages */ c->eax = 1; - /* Hypercall MSR base address */ - if (hyperv_enabled(cpu)) { - c->ebx = XEN_HYPERCALL_MSR_HYPERV; - kvm_xen_init(cs->kvm_state, c->ebx); - } else { - c->ebx = XEN_HYPERCALL_MSR; - } + c->ebx = hypercall_msr; c->ecx = 0; c->edx = 0; @@ -2194,7 +2190,7 @@ int kvm_arch_init_vcpu(CPUState *cs) } } - r = kvm_xen_init_vcpu(cs); + r = kvm_xen_init_vcpu(cs, hypercall_msr); if (r) { return r; } @@ -3245,9 +3241,7 @@ int kvm_arch_init(MachineState *ms, KVMState *s) error_report("kvm: Xen support only available in PC machine"); return -ENOTSUP; } - /* hyperv_enabled() doesn't work yet. */ - uint32_t msr = XEN_HYPERCALL_MSR; - ret = kvm_xen_init(s, msr); + ret = kvm_xen_init(s); if (ret < 0) { return ret; } diff --git a/target/i386/kvm/xen-emu.c b/target/i386/kvm/xen-emu.c index e81a245881..1144a6efcd 100644 --- a/target/i386/kvm/xen-emu.c +++ b/target/i386/kvm/xen-emu.c @@ -108,15 +108,11 @@ static inline int kvm_copy_to_gva(CPUState *cs, uint64_t gva, void *buf, return kvm_gva_rw(cs, gva, buf, sz, true); } -int kvm_xen_init(KVMState *s, uint32_t hypercall_msr) +int kvm_xen_init(KVMState *s) { const int required_caps = KVM_XEN_HVM_CONFIG_HYPERCALL_MSR | KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL | KVM_XEN_HVM_CONFIG_SHARED_INFO; - struct kvm_xen_hvm_config cfg = { - .msr = hypercall_msr, - .flags = KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL, - }; - int xen_caps, ret; + int xen_caps; xen_caps = kvm_check_extension(s, KVM_CAP_XEN_HVM); if (required_caps & ~xen_caps) { @@ -130,20 +126,6 @@ int kvm_xen_init(KVMState *s, uint32_t hypercall_msr) .u.xen_version = s->xen_version, }; (void)kvm_vm_ioctl(s, KVM_XEN_HVM_SET_ATTR, &ha); - - cfg.flags |= KVM_XEN_HVM_CONFIG_EVTCHN_SEND; - } - - ret = kvm_vm_ioctl(s, KVM_XEN_HVM_CONFIG, &cfg); - if (ret < 0) { - error_report("kvm: Failed to enable Xen HVM support: %s", - strerror(-ret)); - return ret; - } - - /* If called a second time, don't repeat the rest of the setup. */ - if (s->xen_caps) { - return 0; } /* @@ -185,10 +167,14 @@ int kvm_xen_init(KVMState *s, uint32_t hypercall_msr) return 0; } -int kvm_xen_init_vcpu(CPUState *cs) +int kvm_xen_init_vcpu(CPUState *cs, uint32_t hypercall_msr) { X86CPU *cpu = X86_CPU(cs); CPUX86State *env = &cpu->env; + struct kvm_xen_hvm_config cfg = { + .msr = hypercall_msr, + .flags = KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL, + }; int err; /* @@ -210,6 +196,22 @@ int kvm_xen_init_vcpu(CPUState *cs) strerror(-err)); return err; } + + cfg.flags |= KVM_XEN_HVM_CONFIG_EVTCHN_SEND; + } + + /* + * This is a per-KVM setting, but hyperv_enabled() can't be used + * when kvm_xen_init() is called from kvm_arch_init(), so do it + * when the BSP is initialized. + */ + if (cs->cpu_index == 0) { + err = kvm_vm_ioctl(cs->kvm_state, KVM_XEN_HVM_CONFIG, &cfg); + if (err) { + error_report("kvm: Failed to enable Xen HVM support: %s", + strerror(-err)); + return err; + } } env->xen_vcpu_info_gpa = INVALID_GPA; diff --git a/target/i386/kvm/xen-emu.h b/target/i386/kvm/xen-emu.h index fe85e0b195..7a7c72eee5 100644 --- a/target/i386/kvm/xen-emu.h +++ b/target/i386/kvm/xen-emu.h @@ -23,8 +23,8 @@ #define XEN_VERSION(maj, min) ((maj) << 16 | (min)) -int kvm_xen_init(KVMState *s, uint32_t hypercall_msr); -int kvm_xen_init_vcpu(CPUState *cs); +int kvm_xen_init(KVMState *s); +int kvm_xen_init_vcpu(CPUState *cs, uint32_t hypercall_msr); int kvm_xen_handle_exit(X86CPU *cpu, struct kvm_xen_exit *exit); int kvm_put_xen_state(CPUState *cs); int kvm_get_xen_state(CPUState *cs);
On Fri, 2025-01-31 at 17:13 -0800, Sean Christopherson wrote: > Reject userspace attempts to set the Xen hypercall page MSR to an index > outside of the "standard" virtualization range [0x40000000, 0x4fffffff], > as KVM is not equipped to handle collisions with real MSRs, e.g. KVM > doesn't update MSR interception, conflicts with VMCS/VMCB fields, special > case writes in KVM, etc. > > Allowing userspace to redirect any MSR write can also be used to attack > the kernel, as kvm_xen_write_hypercall_page() takes multiple locks and > writes to guest memory. E.g. if userspace sets the MSR to MSR_IA32_XSS, > KVM's write to MSR_IA32_XSS during vCPU creation will trigger an SRCU > violation due to writing guest memory: > > ============================= > WARNING: suspicious RCU usage > 6.13.0-rc3 > ----------------------------- > include/linux/kvm_host.h:1046 suspicious rcu_dereference_check() usage! > > stack backtrace: > CPU: 6 UID: 1000 PID: 1101 Comm: repro Not tainted 6.13.0-rc3 > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 > Call Trace: > <TASK> > dump_stack_lvl+0x7f/0x90 > lockdep_rcu_suspicious+0x176/0x1c0 > kvm_vcpu_gfn_to_memslot+0x259/0x280 > kvm_vcpu_write_guest+0x3a/0xa0 > kvm_xen_write_hypercall_page+0x268/0x300 > kvm_set_msr_common+0xc44/0x1940 > vmx_set_msr+0x9db/0x1fc0 > kvm_vcpu_reset+0x857/0xb50 > kvm_arch_vcpu_create+0x37e/0x4d0 > kvm_vm_ioctl+0x669/0x2100 > __x64_sys_ioctl+0xc1/0xf0 > do_syscall_64+0xc5/0x210 > entry_SYSCALL_64_after_hwframe+0x4b/0x53 > RIP: 0033:0x7feda371b539 > > While the MSR index isn't strictly ABI, i.e. can theoretically float to > any value, in practice no known VMM sets the MSR index to anything other > than 0x40000000 or 0x40000200. > > Reported-by: syzbot+cdeaeec70992eca2d920@syzkaller.appspotmail.com > Closes: https://lore.kernel.org/all/679258d4.050a0220.2eae65.000a.GAE@google.com > Cc: Joao Martins <joao.m.martins@oracle.com> > Cc: Paul Durrant <paul@xen.org> > Cc: David Woodhouse <dwmw@amazon.co.uk> > Signed-off-by: Sean Christopherson <seanjc@google.com> With macros for the magic numbers as discussed (and a corresponding update to the documentation), and with the Reported-by: and Closes: tags dropped because they should move to the commit which makes the hypercall page only trigger for !host_initiated writes and resolves it in a more future-proof way for the general case, Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
On Wed, 2025-02-05 at 11:20 -0800, Sean Christopherson wrote: > On Wed, Feb 05, 2025, David Woodhouse wrote: > > On Wed, 2025-02-05 at 16:18 +0000, David Woodhouse wrote: > > > > > > > Oh! It doesn't help KVM avoid breaking userspace, but a way for QEMU to avoid a > > > > future collision would be to have QEMU start at 0x40000200 when Hyper-V is enabled, > > > > but then use KVM_GET_MSR_INDEX_LIST to detect a collision with KVM Hyper-V, e.g. > > > > increment the index until an available index is found (with sanity checks and whatnot). > > > > > > Makes sense. I think that's a third separate patch, yes? > > > > To be clear, I think I mean a third patch which further restricts > > kvm_xen_hvm_config() to disallow indices for which > > kvm_is_advertised_msr() returns true? > > > > We could roll that into your original patch instead, if you prefer. > > Nah, I like the idea of separate patch. Helpfully, kvm_is_advertised_msr() doesn't actually return true for MSR_IA32_XSS. Is that a bug? And kvm_vcpu_reset() attempts to set MSR_IA32_XSS even if the guest doesn't have X86_FEATURE_XSAVES. Is that a bug?
On Thu, Feb 06, 2025, David Woodhouse wrote: > On Wed, 2025-02-05 at 11:20 -0800, Sean Christopherson wrote: > > On Wed, Feb 05, 2025, David Woodhouse wrote: > > > On Wed, 2025-02-05 at 16:18 +0000, David Woodhouse wrote: > > > > > > > > > Oh! It doesn't help KVM avoid breaking userspace, but a way for QEMU to avoid a > > > > > future collision would be to have QEMU start at 0x40000200 when Hyper-V is enabled, > > > > > but then use KVM_GET_MSR_INDEX_LIST to detect a collision with KVM Hyper-V, e.g. > > > > > increment the index until an available index is found (with sanity checks and whatnot). > > > > > > > > Makes sense. I think that's a third separate patch, yes? > > > > > > To be clear, I think I mean a third patch which further restricts > > > kvm_xen_hvm_config() to disallow indices for which > > > kvm_is_advertised_msr() returns true? > > > > > > We could roll that into your original patch instead, if you prefer. > > > > Nah, I like the idea of separate patch. > > Helpfully, kvm_is_advertised_msr() doesn't actually return true for > MSR_IA32_XSS. Is that a bug? Technically, no. KVM doesn't support a non-zero XSS, yet, and so there's nothing for userspace to save/restore. But the word "yet"... > And kvm_vcpu_reset() attempts to set MSR_IA32_XSS even if the guest > doesn't have X86_FEATURE_XSAVES. Is that a bug? Ugh, sort of. Functionally, it's fine. Though it's quite the mess. The write can be straight up deleted, as the vCPU structure is zero-allocated and the CPUID side effects that using __kvm_set_msr() is intended to deal with are irrelevant because the CPUID array can't yet exist. The code came about due to an SDM bug and racing patches, and we missed that the __kvm_set_msr() would be pointless. The SDM had a bug that said XSS was cleared to '0' on INIT, and then KVM had a bug in its emulation of the buggy INIT logic where KVM open coded clearing ia32_xss, which led to stale CPUID information (XSTATE sizes weren't updated). While the patch[1] that became commit 05a9e065059e ("KVM: x86: Sync the states size with the XCR0/IA32_XSS at, any time") was in flight, Xiayoao reported the SDM bug and sent a fix[2]. I merged the two changes, but overlooked that at RESET, the CPUID array is guaranteed to be null/empty, and so calling into kvm_update_cpuid_runtime() is technically pointless. And because guest CPUID is empty, the vCPU can't possibly have X86_FEATURE_XSAVES, so gating the write on XSAVES would be even weirder and more confusing. I'm not sure what the best answer is. I'm leaning towards simply deleting the write. I'd love to have a better MSR framework in KVM, e.g. to document which MSRs are modified by INIT, but at this point I think writing '0' to an MSR during RESET (a.k.a. vCPU creation) does more harm than good. [1] https://lore.kernel.org/all/20220117082631.86143-1-likexu@tencent.com [2] https://lore.kernel.org/all/20220126034750.2495371-1-xiaoyao.li@intel.com
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c index a909b817b9c0..35ecafc410f0 100644 --- a/arch/x86/kvm/xen.c +++ b/arch/x86/kvm/xen.c @@ -1324,6 +1324,14 @@ int kvm_xen_hvm_config(struct kvm *kvm, struct kvm_xen_hvm_config *xhc) xhc->blob_size_32 || xhc->blob_size_64)) return -EINVAL; + /* + * Restrict the MSR to the range that is unofficially reserved for + * synthetic, virtualization-defined MSRs, e.g. to prevent confusing + * KVM by colliding with a real MSR that requires special handling. + */ + if (xhc->msr && (xhc->msr < 0x40000000 || xhc->msr > 0x4fffffff)) + return -EINVAL; + mutex_lock(&kvm->arch.xen.xen_lock); if (xhc->msr && !kvm->arch.xen_hvm_config.msr)
Reject userspace attempts to set the Xen hypercall page MSR to an index outside of the "standard" virtualization range [0x40000000, 0x4fffffff], as KVM is not equipped to handle collisions with real MSRs, e.g. KVM doesn't update MSR interception, conflicts with VMCS/VMCB fields, special case writes in KVM, etc. Allowing userspace to redirect any MSR write can also be used to attack the kernel, as kvm_xen_write_hypercall_page() takes multiple locks and writes to guest memory. E.g. if userspace sets the MSR to MSR_IA32_XSS, KVM's write to MSR_IA32_XSS during vCPU creation will trigger an SRCU violation due to writing guest memory: ============================= WARNING: suspicious RCU usage 6.13.0-rc3 ----------------------------- include/linux/kvm_host.h:1046 suspicious rcu_dereference_check() usage! stack backtrace: CPU: 6 UID: 1000 PID: 1101 Comm: repro Not tainted 6.13.0-rc3 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Call Trace: <TASK> dump_stack_lvl+0x7f/0x90 lockdep_rcu_suspicious+0x176/0x1c0 kvm_vcpu_gfn_to_memslot+0x259/0x280 kvm_vcpu_write_guest+0x3a/0xa0 kvm_xen_write_hypercall_page+0x268/0x300 kvm_set_msr_common+0xc44/0x1940 vmx_set_msr+0x9db/0x1fc0 kvm_vcpu_reset+0x857/0xb50 kvm_arch_vcpu_create+0x37e/0x4d0 kvm_vm_ioctl+0x669/0x2100 __x64_sys_ioctl+0xc1/0xf0 do_syscall_64+0xc5/0x210 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7feda371b539 While the MSR index isn't strictly ABI, i.e. can theoretically float to any value, in practice no known VMM sets the MSR index to anything other than 0x40000000 or 0x40000200. Reported-by: syzbot+cdeaeec70992eca2d920@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/679258d4.050a0220.2eae65.000a.GAE@google.com Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Paul Durrant <paul@xen.org> Cc: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Sean Christopherson <seanjc@google.com> --- arch/x86/kvm/xen.c | 8 ++++++++ 1 file changed, 8 insertions(+)