diff mbox series

[1/5] KVM: x86/xen: Restrict hypercall MSR to unofficial synthetic range

Message ID 20250201011400.669483-2-seanjc@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86/xen: Restrict hypercall MSR index | expand

Commit Message

Sean Christopherson Feb. 1, 2025, 1:13 a.m. UTC
Reject userspace attempts to set the Xen hypercall page MSR to an index
outside of the "standard" virtualization range [0x40000000, 0x4fffffff],
as KVM is not equipped to handle collisions with real MSRs, e.g. KVM
doesn't update MSR interception, conflicts with VMCS/VMCB fields, special
case writes in KVM, etc.

Allowing userspace to redirect any MSR write can also be used to attack
the kernel, as kvm_xen_write_hypercall_page() takes multiple locks and
writes to guest memory.  E.g. if userspace sets the MSR to MSR_IA32_XSS,
KVM's write to MSR_IA32_XSS during vCPU creation will trigger an SRCU
violation due to writing guest memory:

  =============================
  WARNING: suspicious RCU usage
  6.13.0-rc3
  -----------------------------
  include/linux/kvm_host.h:1046 suspicious rcu_dereference_check() usage!

  stack backtrace:
  CPU: 6 UID: 1000 PID: 1101 Comm: repro Not tainted 6.13.0-rc3
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  Call Trace:
   <TASK>
   dump_stack_lvl+0x7f/0x90
   lockdep_rcu_suspicious+0x176/0x1c0
   kvm_vcpu_gfn_to_memslot+0x259/0x280
   kvm_vcpu_write_guest+0x3a/0xa0
   kvm_xen_write_hypercall_page+0x268/0x300
   kvm_set_msr_common+0xc44/0x1940
   vmx_set_msr+0x9db/0x1fc0
   kvm_vcpu_reset+0x857/0xb50
   kvm_arch_vcpu_create+0x37e/0x4d0
   kvm_vm_ioctl+0x669/0x2100
   __x64_sys_ioctl+0xc1/0xf0
   do_syscall_64+0xc5/0x210
   entry_SYSCALL_64_after_hwframe+0x4b/0x53
  RIP: 0033:0x7feda371b539

While the MSR index isn't strictly ABI, i.e. can theoretically float to
any value, in practice no known VMM sets the MSR index to anything other
than 0x40000000 or 0x40000200.

Reported-by: syzbot+cdeaeec70992eca2d920@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/679258d4.050a0220.2eae65.000a.GAE@google.com
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Paul Durrant <paul@xen.org>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/xen.c | 8 ++++++++
 1 file changed, 8 insertions(+)

Comments

Paul Durrant Feb. 3, 2025, 9:09 a.m. UTC | #1
On 01/02/2025 01:13, Sean Christopherson wrote:
> Reject userspace attempts to set the Xen hypercall page MSR to an index
> outside of the "standard" virtualization range [0x40000000, 0x4fffffff],
> as KVM is not equipped to handle collisions with real MSRs, e.g. KVM
> doesn't update MSR interception, conflicts with VMCS/VMCB fields, special
> case writes in KVM, etc.
> 
> Allowing userspace to redirect any MSR write can also be used to attack
> the kernel, as kvm_xen_write_hypercall_page() takes multiple locks and
> writes to guest memory.  E.g. if userspace sets the MSR to MSR_IA32_XSS,
> KVM's write to MSR_IA32_XSS during vCPU creation will trigger an SRCU
> violation due to writing guest memory:
> 
>    =============================
>    WARNING: suspicious RCU usage
>    6.13.0-rc3
>    -----------------------------
>    include/linux/kvm_host.h:1046 suspicious rcu_dereference_check() usage!
> 
>    stack backtrace:
>    CPU: 6 UID: 1000 PID: 1101 Comm: repro Not tainted 6.13.0-rc3
>    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>    Call Trace:
>     <TASK>
>     dump_stack_lvl+0x7f/0x90
>     lockdep_rcu_suspicious+0x176/0x1c0
>     kvm_vcpu_gfn_to_memslot+0x259/0x280
>     kvm_vcpu_write_guest+0x3a/0xa0
>     kvm_xen_write_hypercall_page+0x268/0x300
>     kvm_set_msr_common+0xc44/0x1940
>     vmx_set_msr+0x9db/0x1fc0
>     kvm_vcpu_reset+0x857/0xb50
>     kvm_arch_vcpu_create+0x37e/0x4d0
>     kvm_vm_ioctl+0x669/0x2100
>     __x64_sys_ioctl+0xc1/0xf0
>     do_syscall_64+0xc5/0x210
>     entry_SYSCALL_64_after_hwframe+0x4b/0x53
>    RIP: 0033:0x7feda371b539
> 
> While the MSR index isn't strictly ABI, i.e. can theoretically float to
> any value, in practice no known VMM sets the MSR index to anything other
> than 0x40000000 or 0x40000200.
> 
> Reported-by: syzbot+cdeaeec70992eca2d920@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/all/679258d4.050a0220.2eae65.000a.GAE@google.com
> Cc: Joao Martins <joao.m.martins@oracle.com>
> Cc: Paul Durrant <paul@xen.org>
> Cc: David Woodhouse <dwmw@amazon.co.uk>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/kvm/xen.c | 8 ++++++++
>   1 file changed, 8 insertions(+)
> 

Reviewed-by: Paul Durrant <paul@xen.org>
David Woodhouse Feb. 5, 2025, 9:27 a.m. UTC | #2
On Fri, 2025-01-31 at 17:13 -0800, Sean Christopherson wrote:
> --- a/arch/x86/kvm/xen.c
> +++ b/arch/x86/kvm/xen.c
> @@ -1324,6 +1324,14 @@ int kvm_xen_hvm_config(struct kvm *kvm, struct kvm_xen_hvm_config *xhc)
>  	     xhc->blob_size_32 || xhc->blob_size_64))
>  		return -EINVAL;
>  
> +	/*
> +	 * Restrict the MSR to the range that is unofficially reserved for
> +	 * synthetic, virtualization-defined MSRs, e.g. to prevent confusing
> +	 * KVM by colliding with a real MSR that requires special handling.
> +	 */
> +	if (xhc->msr && (xhc->msr < 0x40000000 || xhc->msr > 0x4fffffff))
> +		return -EINVAL;
> +
>  	mutex_lock(&kvm->arch.xen.xen_lock);
>  
>  	if (xhc->msr && !kvm->arch.xen_hvm_config.msr)

I'd prefer to see #defines for those magic values. Especially as there
is a corresponding requirement that they never be set from host context
(which is where the potential locking issues come in). Which train of
thought leads me to ponder this as an alternative (or additional)
solution:

--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3733,7 +3733,13 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
        u32 msr = msr_info->index;
        u64 data = msr_info->data;
 
-       if (msr && msr == vcpu->kvm->arch.xen_hvm_config.msr)
+       /*
+        * Do not allow host-initiated writes to trigger the Xen hypercall
+        * page setup; it could incur locking paths which are not expected
+        * if userspace sets the MSR in an unusual location.
+        */
+       if (msr && msr == vcpu->kvm->arch.xen_hvm_config.msr &&
+           !msr_info->host_initiated)
                return kvm_xen_write_hypercall_page(vcpu, data);
 
        switch (msr) {
Sean Christopherson Feb. 5, 2025, 3:06 p.m. UTC | #3
On Wed, Feb 05, 2025, David Woodhouse wrote:
> On Fri, 2025-01-31 at 17:13 -0800, Sean Christopherson wrote:
> > --- a/arch/x86/kvm/xen.c
> > +++ b/arch/x86/kvm/xen.c
> > @@ -1324,6 +1324,14 @@ int kvm_xen_hvm_config(struct kvm *kvm, struct kvm_xen_hvm_config *xhc)
> >  	     xhc->blob_size_32 || xhc->blob_size_64))
> >  		return -EINVAL;
> >  
> > +	/*
> > +	 * Restrict the MSR to the range that is unofficially reserved for
> > +	 * synthetic, virtualization-defined MSRs, e.g. to prevent confusing
> > +	 * KVM by colliding with a real MSR that requires special handling.
> > +	 */
> > +	if (xhc->msr && (xhc->msr < 0x40000000 || xhc->msr > 0x4fffffff))
> > +		return -EINVAL;
> > +
> >  	mutex_lock(&kvm->arch.xen.xen_lock);
> >  
> >  	if (xhc->msr && !kvm->arch.xen_hvm_config.msr)
> 
> I'd prefer to see #defines for those magic values.

Can do.  Hmm, and since this would be visible to userspace, arguably the #defines
should go in arch/x86/include/uapi/asm/kvm.h

> Especially as there is a corresponding requirement that they never be set
> from host context (which is where the potential locking issues come in).
> Which train of thought leads me to ponder this as an alternative (or
> additional) solution:
> 
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3733,7 +3733,13 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>         u32 msr = msr_info->index;
>         u64 data = msr_info->data;
>  
> -       if (msr && msr == vcpu->kvm->arch.xen_hvm_config.msr)
> +       /*
> +        * Do not allow host-initiated writes to trigger the Xen hypercall
> +        * page setup; it could incur locking paths which are not expected
> +        * if userspace sets the MSR in an unusual location.

That's just as likely to break userspace.  Doing a save/restore on the MSR doesn't
make a whole lot of sense since it's effectively a "command" MSR, but IMO it's not
any less likely than userspace putting the MSR index outside of the synthetic range.

Side topic, upstream QEMU doesn't even appear to put the MSR at the Hyper-V
address.  It tells the guest that's where the MSR is located, but the config
passed to KVM still uses the default.

        /* Hypercall MSR base address */
        if (hyperv_enabled(cpu)) {
            c->ebx = XEN_HYPERCALL_MSR_HYPERV;
            kvm_xen_init(cs->kvm_state, c->ebx);
        } else {
            c->ebx = XEN_HYPERCALL_MSR;
        }

...

        /* hyperv_enabled() doesn't work yet. */
        uint32_t msr = XEN_HYPERCALL_MSR;
        ret = kvm_xen_init(s, msr);
        if (ret < 0) {
            return ret;
        }

Userspace breakage aside, disallowng host writes would fix the immediate issue,
and I think would mitigate all concerns with putting the host at risk.  But it's
not enough to actually make an overlapping MSR index work.  E.g. if the MSR is
passed through to the guest, the write will go through to the hardware MSR, unless
the WRMSR happens to be emulated.

I really don't want to broadly support redirecting any MSR, because to truly go
down that path we'd need to deal with x2APIC, EFER, and other MSRs that have
special treatment and meaning.

While KVM's stance is usually that a misconfigured vCPU model is userspace's
problem, in this case I don't see any value in letting userspace be stupid.  It
can't work generally, it creates unique ABI for KVM_SET_MSRS, and unless there's
a crazy use case I'm overlooking, there's no sane reason for userspace to put the
index in outside of the synthetic range (whereas defining seemingly nonsensical
CPUID feature bits is useful for testing purposes, implementing support in
userspace, etc).
David Woodhouse Feb. 5, 2025, 3:26 p.m. UTC | #4
On Wed, 2025-02-05 at 07:06 -0800, Sean Christopherson wrote:
> On Wed, Feb 05, 2025, David Woodhouse wrote:
> > On Fri, 2025-01-31 at 17:13 -0800, Sean Christopherson wrote:
> > > --- a/arch/x86/kvm/xen.c
> > > +++ b/arch/x86/kvm/xen.c
> > > @@ -1324,6 +1324,14 @@ int kvm_xen_hvm_config(struct kvm *kvm, struct kvm_xen_hvm_config *xhc)
> > >  	     xhc->blob_size_32 || xhc->blob_size_64))
> > >  		return -EINVAL;
> > >  
> > > +	/*
> > > +	 * Restrict the MSR to the range that is unofficially reserved for
> > > +	 * synthetic, virtualization-defined MSRs, e.g. to prevent confusing
> > > +	 * KVM by colliding with a real MSR that requires special handling.
> > > +	 */
> > > +	if (xhc->msr && (xhc->msr < 0x40000000 || xhc->msr > 0x4fffffff))
> > > +		return -EINVAL;
> > > +
> > >  	mutex_lock(&kvm->arch.xen.xen_lock);
> > >  
> > >  	if (xhc->msr && !kvm->arch.xen_hvm_config.msr)
> > 
> > I'd prefer to see #defines for those magic values.
> 
> Can do.  Hmm, and since this would be visible to userspace, arguably the #defines
> should go in arch/x86/include/uapi/asm/kvm.h

Thanks.

> > Especially as there is a corresponding requirement that they never be set
> > from host context (which is where the potential locking issues come in).
> > Which train of thought leads me to ponder this as an alternative (or
> > additional) solution:
> > 
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -3733,7 +3733,13 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> >         u32 msr = msr_info->index;
> >         u64 data = msr_info->data;
> >  
> > -       if (msr && msr == vcpu->kvm->arch.xen_hvm_config.msr)
> > +       /*
> > +        * Do not allow host-initiated writes to trigger the Xen hypercall
> > +        * page setup; it could incur locking paths which are not expected
> > +        * if userspace sets the MSR in an unusual location.
> 
> That's just as likely to break userspace.  Doing a save/restore on the MSR doesn't
> make a whole lot of sense since it's effectively a "command" MSR, but IMO it's not
> any less likely than userspace putting the MSR index outside of the synthetic range.

Save/restore on the MSR makes no sense. It's a write-only MSR; writing
to it has no effect *other* than populating the target page. In KVM we
don't implement reading from it at all; I don't think Xen does either?

And even if it was readable and would rather pointlessly return the
last value written to it, save/restore arguably shouldn't actually
trigger the guest memory to be overwritten again. The hypercall page
should only be populated when the *guest* writes the MSR.

With the recent elimination of the hypercall page from Linux Xen
guests, we've suggested that Linux should still set up the hypercall
page early (as it *does* have the side-effect of letting Xen know that
the guest is 64-bit). And then just free the page without ever using
it. We absolutely would not want a save/restore to scribble on that
page again.

I'm absolutely not worried about breaking userspace with such a change
to make the hypercall page MSR only work when !host_initiated. In fact
I think it's probably the right thing to do *anyway*.

If userspace wants to write to guest memory, it can do that anyway; it
doesn't need to ask the *kernel* to do it.

> Side topic, upstream QEMU doesn't even appear to put the MSR at the Hyper-V
> address.  It tells the guest that's where the MSR is located, but the config
> passed to KVM still uses the default.
> 
>         /* Hypercall MSR base address */
>         if (hyperv_enabled(cpu)) {
>             c->ebx = XEN_HYPERCALL_MSR_HYPERV;
>             kvm_xen_init(cs->kvm_state, c->ebx);
>         } else {
>             c->ebx = XEN_HYPERCALL_MSR;
>         }
> 
> ...
> 
>         /* hyperv_enabled() doesn't work yet. */
>         uint32_t msr = XEN_HYPERCALL_MSR;
>         ret = kvm_xen_init(s, msr);
>         if (ret < 0) {
>             return ret;
>         }
> 

Those two happen in reverse chronological order, don't they? And in the
lower one the comment tells you that hyperv_enabled() doesn't work yet.
When the higher one is called later, it calls kvm_xen_init() *again* to
put the MSR in the right place.

It could be prettier, but I don't think it's broken, is it?

> Userspace breakage aside, disallowng host writes would fix the immediate issue,
> and I think would mitigate all concerns with putting the host at risk.  But it's
> not enough to actually make an overlapping MSR index work.  E.g. if the MSR is
> passed through to the guest, the write will go through to the hardware MSR, unless
> the WRMSR happens to be emulated.
> 
> I really don't want to broadly support redirecting any MSR, because to truly go
> down that path we'd need to deal with x2APIC, EFER, and other MSRs that have
> special treatment and meaning.
> 
> While KVM's stance is usually that a misconfigured vCPU model is userspace's
> problem, in this case I don't see any value in letting userspace be stupid.  It
> can't work generally, it creates unique ABI for KVM_SET_MSRS, and unless there's
> a crazy use case I'm overlooking, there's no sane reason for userspace to put the
> index in outside of the synthetic range (whereas defining seemingly nonsensical
> CPUID feature bits is useful for testing purposes, implementing support in
> userspace, etc).

Right, I think we should do *both*. Blocking host writes solves the
issue of locking problems with the hypercall page setup. All it would
take for that issue to recur is for us (or Microsoft) to invent a new
MSR in the synthetic range which is also written on vCPU init/reset.
And then the sanity check on where the VMM puts the Xen MSR doesn't
save us.

But yes, we should *also* do that sanity check.
Sean Christopherson Feb. 5, 2025, 3:51 p.m. UTC | #5
On Wed, Feb 05, 2025, David Woodhouse wrote:
> On Wed, 2025-02-05 at 07:06 -0800, Sean Christopherson wrote:
> > On Wed, Feb 05, 2025, David Woodhouse wrote:
> > > Especially as there is a corresponding requirement that they never be set
> > > from host context (which is where the potential locking issues come in).
> > > Which train of thought leads me to ponder this as an alternative (or
> > > additional) solution:
> > > 
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -3733,7 +3733,13 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> > >         u32 msr = msr_info->index;
> > >         u64 data = msr_info->data;
> > >  
> > > -       if (msr && msr == vcpu->kvm->arch.xen_hvm_config.msr)
> > > +       /*
> > > +        * Do not allow host-initiated writes to trigger the Xen hypercall
> > > +        * page setup; it could incur locking paths which are not expected
> > > +        * if userspace sets the MSR in an unusual location.
> > 
> > That's just as likely to break userspace.  Doing a save/restore on the MSR doesn't
> > make a whole lot of sense since it's effectively a "command" MSR, but IMO it's not
> > any less likely than userspace putting the MSR index outside of the synthetic range.
> 
> Save/restore on the MSR makes no sense. It's a write-only MSR; writing
> to it has no effect *other* than populating the target page. In KVM we
> don't implement reading from it at all; I don't think Xen does either?

Hah, that's another KVM bug, technically.  KVM relies on the MSR not being handled
in order to generate the write-only semantics, but if the MSR index collides with
an MSR that KVM emulates, then the MSR would be readable.  KVM supports Hyper-V's
HV_X64_MSR_TSC_INVARIANT_CONTROL (0x40000118), so just a few hundred more MSRs
until fireworks :-)

If we want to close that hole, it'd be easy enough to add a check in
kvm_get_msr_common().

> Those two happen in reverse chronological order, don't they? And in the
> lower one the comment tells you that hyperv_enabled() doesn't work yet.
> When the higher one is called later, it calls kvm_xen_init() *again* to
> put the MSR in the right place.
> 
> It could be prettier, but I don't think it's broken, is it?

Gah, -ENOCOFFEE.

> > Userspace breakage aside, disallowng host writes would fix the immediate issue,
> > and I think would mitigate all concerns with putting the host at risk.  But it's
> > not enough to actually make an overlapping MSR index work.  E.g. if the MSR is
> > passed through to the guest, the write will go through to the hardware MSR, unless
> > the WRMSR happens to be emulated.
> > 
> > I really don't want to broadly support redirecting any MSR, because to truly go
> > down that path we'd need to deal with x2APIC, EFER, and other MSRs that have
> > special treatment and meaning.
> > 
> > While KVM's stance is usually that a misconfigured vCPU model is userspace's
> > problem, in this case I don't see any value in letting userspace be stupid.  It
> > can't work generally, it creates unique ABI for KVM_SET_MSRS, and unless there's
> > a crazy use case I'm overlooking, there's no sane reason for userspace to put the
> > index in outside of the synthetic range (whereas defining seemingly nonsensical
> > CPUID feature bits is useful for testing purposes, implementing support in
> > userspace, etc).
> 
> Right, I think we should do *both*. Blocking host writes solves the
> issue of locking problems with the hypercall page setup. All it would
> take for that issue to recur is for us (or Microsoft) to invent a new
> MSR in the synthetic range which is also written on vCPU init/reset.
> And then the sanity check on where the VMM puts the Xen MSR doesn't
> save us.

Ugh, indeed.  MSRs are quite the conundrum.  Userspace MSR filters have a similar
problem, where it's impossible to know the semantics of future hardware MSRs, and
so it's impossible to document which MSRs userspace is allowed to intercept :-/

Oh!  It doesn't help KVM avoid breaking userspace, but a way for QEMU to avoid a
future collision would be to have QEMU start at 0x40000200 when Hyper-V is enabled,
but then use KVM_GET_MSR_INDEX_LIST to detect a collision with KVM Hyper-V, e.g.
increment the index until an available index is found (with sanity checks and whatnot).

> But yes, we should *also* do that sanity check.

Ah, I'm a-ok with that.
David Woodhouse Feb. 5, 2025, 4:18 p.m. UTC | #6
On 5 February 2025 15:51:01 GMT, Sean Christopherson <seanjc@google.com> wrote:
>On Wed, Feb 05, 2025, David Woodhouse wrote:
>> 
>> Save/restore on the MSR makes no sense. It's a write-only MSR; writing
>> to it has no effect *other* than populating the target page. In KVM we
>> don't implement reading from it at all; I don't think Xen does either?
>
>Hah, that's another KVM bug, technically.  KVM relies on the MSR not being handled
>in order to generate the write-only semantics, but if the MSR index collides with
>an MSR that KVM emulates, then the MSR would be readable.  KVM supports Hyper-V's
>HV_X64_MSR_TSC_INVARIANT_CONTROL (0x40000118), so just a few hundred more MSRs
>until fireworks :-)

Nah, I don't see that as a bug. If there's a conflict then the Xen hypercall MSR "steals" writes from the one it conflicts with, sure. But since it's a write-only MSR the conflicting one still works fine for reads. Which means that Xen can "conflict" with a read-only MSR and nobody cares; arguably there's no bug at all in that case.

>> Those two happen in reverse chronological order, don't they? And in the
>> lower one the comment tells you that hyperv_enabled() doesn't work yet.
>> When the higher one is called later, it calls kvm_xen_init() *again* to
>> put the MSR in the right place.
>> 
>> It could be prettier, but I don't think it's broken, is it?
>
>Gah, -ENOCOFFEE.

I'll take the criticism though; that code is distinctly non-obvious, even with that hint in the comment about hyperv_enabled() not being usable yet.

ISTR we needed to do the Xen init early on, even before we knew precisely which MSR to use. And that's why we do it that way and then just call the function again later if we need to change the MSR. I'll see if that can be simplified, and at the very least I'll update the existing comment to explicitly state that the function will get called again later if needed.

You shouldn't *need* coffee to understand the code.

>> > Userspace breakage aside, disallowng host writes would fix the immediate issue,
>> > and I think would mitigate all concerns with putting the host at risk.  But it's
>> > not enough to actually make an overlapping MSR index work.  E.g. if the MSR is
>> > passed through to the guest, the write will go through to the hardware MSR, unless
>> > the WRMSR happens to be emulated.
>> > 
>> > I really don't want to broadly support redirecting any MSR, because to truly go
>> > down that path we'd need to deal with x2APIC, EFER, and other MSRs that have
>> > special treatment and meaning.
>> > 
>> > While KVM's stance is usually that a misconfigured vCPU model is userspace's
>> > problem, in this case I don't see any value in letting userspace be stupid.  It
>> > can't work generally, it creates unique ABI for KVM_SET_MSRS, and unless there's
>> > a crazy use case I'm overlooking, there's no sane reason for userspace to put the
>> > index in outside of the synthetic range (whereas defining seemingly nonsensical
>> > CPUID feature bits is useful for testing purposes, implementing support in
>> > userspace, etc).
>> 
>> Right, I think we should do *both*. Blocking host writes solves the
>> issue of locking problems with the hypercall page setup. All it would
>> take for that issue to recur is for us (or Microsoft) to invent a new
>> MSR in the synthetic range which is also written on vCPU init/reset.
>> And then the sanity check on where the VMM puts the Xen MSR doesn't
>> save us.
>
>Ugh, indeed.  MSRs are quite the conundrum.  Userspace MSR filters have a similar
>problem, where it's impossible to know the semantics of future hardware MSRs, and
>so it's impossible to document which MSRs userspace is allowed to intercept :-/
>
>Oh!  It doesn't help KVM avoid breaking userspace, but a way for QEMU to avoid a
>future collision would be to have QEMU start at 0x40000200 when Hyper-V is enabled,
>but then use KVM_GET_MSR_INDEX_LIST to detect a collision with KVM Hyper-V, e.g.
>increment the index until an available index is found (with sanity checks and whatnot).

Makes sense. I think that's a third separate patch, yes?

>> But yes, we should *also* do that sanity check.
>
>Ah, I'm a-ok with that.
David Woodhouse Feb. 5, 2025, 5:15 p.m. UTC | #7
On Wed, 2025-02-05 at 16:18 +0000, David Woodhouse wrote:
> 
> > Oh!  It doesn't help KVM avoid breaking userspace, but a way for QEMU to avoid a
> > future collision would be to have QEMU start at 0x40000200 when Hyper-V is enabled,
> > but then use KVM_GET_MSR_INDEX_LIST to detect a collision with KVM Hyper-V, e.g.
> > increment the index until an available index is found (with sanity checks and whatnot).
> 
> Makes sense. I think that's a third separate patch, yes?

To be clear, I think I mean a third patch which further restricts
kvm_xen_hvm_config() to disallow indices for which
kvm_is_advertised_msr() returns true?

We could roll that into your original patch instead, if you prefer.

Q: Should kvm_is_advertised_msr() include the Xen hypercall MSR, if one
is already configured? Life is easier if we answer 'no'...
Sean Christopherson Feb. 5, 2025, 7:20 p.m. UTC | #8
On Wed, Feb 05, 2025, David Woodhouse wrote:
> On Wed, 2025-02-05 at 16:18 +0000, David Woodhouse wrote:
> > 
> > > Oh!  It doesn't help KVM avoid breaking userspace, but a way for QEMU to avoid a
> > > future collision would be to have QEMU start at 0x40000200 when Hyper-V is enabled,
> > > but then use KVM_GET_MSR_INDEX_LIST to detect a collision with KVM Hyper-V, e.g.
> > > increment the index until an available index is found (with sanity checks and whatnot).
> > 
> > Makes sense. I think that's a third separate patch, yes?
> 
> To be clear, I think I mean a third patch which further restricts
> kvm_xen_hvm_config() to disallow indices for which
> kvm_is_advertised_msr() returns true?
> 
> We could roll that into your original patch instead, if you prefer.

Nah, I like the idea of separate patch.

> Q: Should kvm_is_advertised_msr() include the Xen hypercall MSR, if one
> is already configured? Life is easier if we answer 'no'...

No :-)

The idea with kvm_is_advertised_msr() is to ignore accesses to MSRs that don't
exist according the to vCPU model, but that KVM advertised to userspace (via
KVM_GET_MSR_INDEX_LIST) and so may be saved/restored by a naive/unoptimized
userspace.

For the Xen MSR, KVM never advertises the MSR, and IIUC, KVM will never treat
the MSR as non-existent because defining the MSR brings it into existence.
David Woodhouse Feb. 6, 2025, 9:18 a.m. UTC | #9
On Wed, 2025-02-05 at 07:51 -0800, Sean Christopherson wrote:
> 
> > Those two happen in reverse chronological order, don't they? And in the
> > lower one the comment tells you that hyperv_enabled() doesn't work yet.
> > When the higher one is called later, it calls kvm_xen_init() *again* to
> > put the MSR in the right place.
> > 
> > It could be prettier, but I don't think it's broken, is it?
> 
> Gah, -ENOCOFFEE.

I trust this version would require less coffee to parse...

From a90c2df0bd9589609085dd42f94b61de1bf48eb7 Mon Sep 17 00:00:00 2001
From: David Woodhouse <dwmw@amazon.co.uk>
Date: Thu, 6 Feb 2025 08:52:52 +0000
Subject: [PATCH] i386/xen: Move KVM_XEN_HVM_CONFIG ioctl to
 kvm_xen_init_vcpu()

At the time kvm_xen_init() is called, hyperv_enabled() doesn't yet work, so
the correct MSR index to use for the hypercall page isn't known.

Rather than setting it to the default and then shifting it later for the
Hyper-V case with a confusing second call to kvm_init_xen(), just do it
once in kvm_xen_init_vcpu().

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 target/i386/kvm/kvm.c     | 16 +++++---------
 target/i386/kvm/xen-emu.c | 44 ++++++++++++++++++++-------------------
 target/i386/kvm/xen-emu.h |  4 ++--
 3 files changed, 30 insertions(+), 34 deletions(-)

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 6c749d4ee8..b4deec6255 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -2129,6 +2129,8 @@ int kvm_arch_init_vcpu(CPUState *cs)
     if (cs->kvm_state->xen_version) {
 #ifdef CONFIG_XEN_EMU
         struct kvm_cpuid_entry2 *xen_max_leaf;
+        uint32_t hypercall_msr =
+            hyperv_enabled(cpu) ? XEN_HYPERCALL_MSR_HYPERV : XEN_HYPERCALL_MSR;
 
         memcpy(signature, "XenVMMXenVMM", 12);
 
@@ -2150,13 +2152,7 @@ int kvm_arch_init_vcpu(CPUState *cs)
         c->function = kvm_base + XEN_CPUID_HVM_MSR;
         /* Number of hypercall-transfer pages */
         c->eax = 1;
-        /* Hypercall MSR base address */
-        if (hyperv_enabled(cpu)) {
-            c->ebx = XEN_HYPERCALL_MSR_HYPERV;
-            kvm_xen_init(cs->kvm_state, c->ebx);
-        } else {
-            c->ebx = XEN_HYPERCALL_MSR;
-        }
+        c->ebx = hypercall_msr;
         c->ecx = 0;
         c->edx = 0;
 
@@ -2194,7 +2190,7 @@ int kvm_arch_init_vcpu(CPUState *cs)
             }
         }
 
-        r = kvm_xen_init_vcpu(cs);
+        r = kvm_xen_init_vcpu(cs, hypercall_msr);
         if (r) {
             return r;
         }
@@ -3245,9 +3241,7 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
             error_report("kvm: Xen support only available in PC machine");
             return -ENOTSUP;
         }
-        /* hyperv_enabled() doesn't work yet. */
-        uint32_t msr = XEN_HYPERCALL_MSR;
-        ret = kvm_xen_init(s, msr);
+        ret = kvm_xen_init(s);
         if (ret < 0) {
             return ret;
         }
diff --git a/target/i386/kvm/xen-emu.c b/target/i386/kvm/xen-emu.c
index e81a245881..1144a6efcd 100644
--- a/target/i386/kvm/xen-emu.c
+++ b/target/i386/kvm/xen-emu.c
@@ -108,15 +108,11 @@ static inline int kvm_copy_to_gva(CPUState *cs, uint64_t gva, void *buf,
     return kvm_gva_rw(cs, gva, buf, sz, true);
 }
 
-int kvm_xen_init(KVMState *s, uint32_t hypercall_msr)
+int kvm_xen_init(KVMState *s)
 {
     const int required_caps = KVM_XEN_HVM_CONFIG_HYPERCALL_MSR |
         KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL | KVM_XEN_HVM_CONFIG_SHARED_INFO;
-    struct kvm_xen_hvm_config cfg = {
-        .msr = hypercall_msr,
-        .flags = KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL,
-    };
-    int xen_caps, ret;
+    int xen_caps;
 
     xen_caps = kvm_check_extension(s, KVM_CAP_XEN_HVM);
     if (required_caps & ~xen_caps) {
@@ -130,20 +126,6 @@ int kvm_xen_init(KVMState *s, uint32_t hypercall_msr)
             .u.xen_version = s->xen_version,
         };
         (void)kvm_vm_ioctl(s, KVM_XEN_HVM_SET_ATTR, &ha);
-
-        cfg.flags |= KVM_XEN_HVM_CONFIG_EVTCHN_SEND;
-    }
-
-    ret = kvm_vm_ioctl(s, KVM_XEN_HVM_CONFIG, &cfg);
-    if (ret < 0) {
-        error_report("kvm: Failed to enable Xen HVM support: %s",
-                     strerror(-ret));
-        return ret;
-    }
-
-    /* If called a second time, don't repeat the rest of the setup. */
-    if (s->xen_caps) {
-        return 0;
     }
 
     /*
@@ -185,10 +167,14 @@ int kvm_xen_init(KVMState *s, uint32_t hypercall_msr)
     return 0;
 }
 
-int kvm_xen_init_vcpu(CPUState *cs)
+int kvm_xen_init_vcpu(CPUState *cs, uint32_t hypercall_msr)
 {
     X86CPU *cpu = X86_CPU(cs);
     CPUX86State *env = &cpu->env;
+    struct kvm_xen_hvm_config cfg = {
+        .msr = hypercall_msr,
+        .flags = KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL,
+    };
     int err;
 
     /*
@@ -210,6 +196,22 @@ int kvm_xen_init_vcpu(CPUState *cs)
                          strerror(-err));
             return err;
         }
+
+        cfg.flags |= KVM_XEN_HVM_CONFIG_EVTCHN_SEND;
+    }
+
+    /*
+     * This is a per-KVM setting, but hyperv_enabled() can't be used
+     * when kvm_xen_init() is called from kvm_arch_init(), so do it
+     * when the BSP is initialized.
+     */
+    if (cs->cpu_index == 0) {
+        err = kvm_vm_ioctl(cs->kvm_state, KVM_XEN_HVM_CONFIG, &cfg);
+        if (err) {
+            error_report("kvm: Failed to enable Xen HVM support: %s",
+                         strerror(-err));
+            return err;
+        }
     }
 
     env->xen_vcpu_info_gpa = INVALID_GPA;
diff --git a/target/i386/kvm/xen-emu.h b/target/i386/kvm/xen-emu.h
index fe85e0b195..7a7c72eee5 100644
--- a/target/i386/kvm/xen-emu.h
+++ b/target/i386/kvm/xen-emu.h
@@ -23,8 +23,8 @@
 
 #define XEN_VERSION(maj, min) ((maj) << 16 | (min))
 
-int kvm_xen_init(KVMState *s, uint32_t hypercall_msr);
-int kvm_xen_init_vcpu(CPUState *cs);
+int kvm_xen_init(KVMState *s);
+int kvm_xen_init_vcpu(CPUState *cs, uint32_t hypercall_msr);
 int kvm_xen_handle_exit(X86CPU *cpu, struct kvm_xen_exit *exit);
 int kvm_put_xen_state(CPUState *cs);
 int kvm_get_xen_state(CPUState *cs);
David Woodhouse Feb. 6, 2025, 4:51 p.m. UTC | #10
On Fri, 2025-01-31 at 17:13 -0800, Sean Christopherson wrote:
> Reject userspace attempts to set the Xen hypercall page MSR to an index
> outside of the "standard" virtualization range [0x40000000, 0x4fffffff],
> as KVM is not equipped to handle collisions with real MSRs, e.g. KVM
> doesn't update MSR interception, conflicts with VMCS/VMCB fields, special
> case writes in KVM, etc.
> 
> Allowing userspace to redirect any MSR write can also be used to attack
> the kernel, as kvm_xen_write_hypercall_page() takes multiple locks and
> writes to guest memory.  E.g. if userspace sets the MSR to MSR_IA32_XSS,
> KVM's write to MSR_IA32_XSS during vCPU creation will trigger an SRCU
> violation due to writing guest memory:
> 
>   =============================
>   WARNING: suspicious RCU usage
>   6.13.0-rc3
>   -----------------------------
>   include/linux/kvm_host.h:1046 suspicious rcu_dereference_check() usage!
> 
>   stack backtrace:
>   CPU: 6 UID: 1000 PID: 1101 Comm: repro Not tainted 6.13.0-rc3
>   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>   Call Trace:
>    <TASK>
>    dump_stack_lvl+0x7f/0x90
>    lockdep_rcu_suspicious+0x176/0x1c0
>    kvm_vcpu_gfn_to_memslot+0x259/0x280
>    kvm_vcpu_write_guest+0x3a/0xa0
>    kvm_xen_write_hypercall_page+0x268/0x300
>    kvm_set_msr_common+0xc44/0x1940
>    vmx_set_msr+0x9db/0x1fc0
>    kvm_vcpu_reset+0x857/0xb50
>    kvm_arch_vcpu_create+0x37e/0x4d0
>    kvm_vm_ioctl+0x669/0x2100
>    __x64_sys_ioctl+0xc1/0xf0
>    do_syscall_64+0xc5/0x210
>    entry_SYSCALL_64_after_hwframe+0x4b/0x53
>   RIP: 0033:0x7feda371b539
> 
> While the MSR index isn't strictly ABI, i.e. can theoretically float to
> any value, in practice no known VMM sets the MSR index to anything other
> than 0x40000000 or 0x40000200.
> 
> Reported-by: syzbot+cdeaeec70992eca2d920@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/all/679258d4.050a0220.2eae65.000a.GAE@google.com
> Cc: Joao Martins <joao.m.martins@oracle.com>
> Cc: Paul Durrant <paul@xen.org>
> Cc: David Woodhouse <dwmw@amazon.co.uk>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

With macros for the magic numbers as discussed (and a corresponding
update to the documentation), and with the Reported-by: and Closes:
tags dropped because they should move to the commit which makes the
hypercall page only trigger for !host_initiated writes and resolves it
in a more future-proof way for the general case,

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
David Woodhouse Feb. 6, 2025, 6:58 p.m. UTC | #11
On Wed, 2025-02-05 at 11:20 -0800, Sean Christopherson wrote:
> On Wed, Feb 05, 2025, David Woodhouse wrote:
> > On Wed, 2025-02-05 at 16:18 +0000, David Woodhouse wrote:
> > > 
> > > > Oh!  It doesn't help KVM avoid breaking userspace, but a way for QEMU to avoid a
> > > > future collision would be to have QEMU start at 0x40000200 when Hyper-V is enabled,
> > > > but then use KVM_GET_MSR_INDEX_LIST to detect a collision with KVM Hyper-V, e.g.
> > > > increment the index until an available index is found (with sanity checks and whatnot).
> > > 
> > > Makes sense. I think that's a third separate patch, yes?
> > 
> > To be clear, I think I mean a third patch which further restricts
> > kvm_xen_hvm_config() to disallow indices for which
> > kvm_is_advertised_msr() returns true?
> > 
> > We could roll that into your original patch instead, if you prefer.
> 
> Nah, I like the idea of separate patch.

Helpfully, kvm_is_advertised_msr() doesn't actually return true for
MSR_IA32_XSS. Is that a bug?

And kvm_vcpu_reset() attempts to set MSR_IA32_XSS even if the guest
doesn't have X86_FEATURE_XSAVES. Is that a bug?
Sean Christopherson Feb. 7, 2025, 5:18 p.m. UTC | #12
On Thu, Feb 06, 2025, David Woodhouse wrote:
> On Wed, 2025-02-05 at 11:20 -0800, Sean Christopherson wrote:
> > On Wed, Feb 05, 2025, David Woodhouse wrote:
> > > On Wed, 2025-02-05 at 16:18 +0000, David Woodhouse wrote:
> > > > 
> > > > > Oh!  It doesn't help KVM avoid breaking userspace, but a way for QEMU to avoid a
> > > > > future collision would be to have QEMU start at 0x40000200 when Hyper-V is enabled,
> > > > > but then use KVM_GET_MSR_INDEX_LIST to detect a collision with KVM Hyper-V, e.g.
> > > > > increment the index until an available index is found (with sanity checks and whatnot).
> > > > 
> > > > Makes sense. I think that's a third separate patch, yes?
> > > 
> > > To be clear, I think I mean a third patch which further restricts
> > > kvm_xen_hvm_config() to disallow indices for which
> > > kvm_is_advertised_msr() returns true?
> > > 
> > > We could roll that into your original patch instead, if you prefer.
> > 
> > Nah, I like the idea of separate patch.
> 
> Helpfully, kvm_is_advertised_msr() doesn't actually return true for
> MSR_IA32_XSS. Is that a bug?

Technically, no.  KVM doesn't support a non-zero XSS, yet, and so there's nothing
for userspace to save/restore.  But the word "yet"...

> And kvm_vcpu_reset() attempts to set MSR_IA32_XSS even if the guest
> doesn't have X86_FEATURE_XSAVES. Is that a bug?

Ugh, sort of.  Functionally, it's fine.  Though it's quite the mess.  The write
can be straight up deleted, as the vCPU structure is zero-allocated and the CPUID
side effects that using __kvm_set_msr() is intended to deal with are irrelevant
because the CPUID array can't yet exist.

The code came about due to an SDM bug and racing patches, and we missed that the
__kvm_set_msr() would be pointless.

The SDM had a bug that said XSS was cleared to '0' on INIT, and then KVM had a
bug in its emulation of the buggy INIT logic where KVM open coded clearing ia32_xss,
which led to stale CPUID information (XSTATE sizes weren't updated).

While the patch[1] that became commit 05a9e065059e ("KVM: x86: Sync the states
size with the XCR0/IA32_XSS at, any time") was in flight, Xiayoao reported the
SDM bug and sent a fix[2].

I merged the two changes, but overlooked that at RESET, the CPUID array is
guaranteed to be null/empty, and so calling into kvm_update_cpuid_runtime() is
technically pointless.  And because guest CPUID is empty, the vCPU can't
possibly have X86_FEATURE_XSAVES, so gating the write on XSAVES would be even
weirder and more confusing.

I'm not sure what the best answer is.  I'm leaning towards simply deleting the
write.  I'd love to have a better MSR framework in KVM, e.g. to document which
MSRs are modified by INIT, but at this point I think writing '0' to an MSR during
RESET (a.k.a. vCPU creation) does more harm than good.

[1] https://lore.kernel.org/all/20220117082631.86143-1-likexu@tencent.com
[2] https://lore.kernel.org/all/20220126034750.2495371-1-xiaoyao.li@intel.com
diff mbox series

Patch

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index a909b817b9c0..35ecafc410f0 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -1324,6 +1324,14 @@  int kvm_xen_hvm_config(struct kvm *kvm, struct kvm_xen_hvm_config *xhc)
 	     xhc->blob_size_32 || xhc->blob_size_64))
 		return -EINVAL;
 
+	/*
+	 * Restrict the MSR to the range that is unofficially reserved for
+	 * synthetic, virtualization-defined MSRs, e.g. to prevent confusing
+	 * KVM by colliding with a real MSR that requires special handling.
+	 */
+	if (xhc->msr && (xhc->msr < 0x40000000 || xhc->msr > 0x4fffffff))
+		return -EINVAL;
+
 	mutex_lock(&kvm->arch.xen.xen_lock);
 
 	if (xhc->msr && !kvm->arch.xen_hvm_config.msr)