[2/3] KVM: VMX: Move RESET emulation to vmx_vcpu_reset()

Message ID	20210914230840.3030620-3-seanjc@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> Reply-To: Sean Christopherson <seanjc@google.com> Date: Tue, 14 Sep 2021 16:08:39 -0700 In-Reply-To: <20210914230840.3030620-1-seanjc@google.com> Message-Id: <20210914230840.3030620-3-seanjc@google.com> Mime-Version: 1.0 References: <20210914230840.3030620-1-seanjc@google.com> Subject: [PATCH 2/3] KVM: VMX: Move RESET emulation to vmx_vcpu_reset() From: Sean Christopherson <seanjc@google.com> To: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com>, Vitaly Kuznetsov <vkuznets@redhat.com>, Wanpeng Li <wanpengli@tencent.com>, Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Reiji Watanabe <reijiw@google.com> Content-Type: text/plain; charset="UTF-8" Precedence: bulk
Series	KVM: x86: Clean up RESET "emulation" \| expand [0/3] KVM: x86: Clean up RESET "emulation" [1/3] KVM: VMX: Drop explicit zeroing of MSR guest values at vCPU creation [2/3] KVM: VMX: Move RESET emulation to vmx_vcpu_reset() [3/3] KVM: SVM: Move RESET emulation to svm_vcpu_reset()

Message ID

20210914230840.3030620-3-seanjc@google.com (mailing list archive)

State

New, archived

Headers

Reply-To: Sean Christopherson <seanjc@google.com>
Date: Tue, 14 Sep 2021 16:08:39 -0700
In-Reply-To: <20210914230840.3030620-1-seanjc@google.com>
Message-Id: <20210914230840.3030620-3-seanjc@google.com>
Mime-Version: 1.0
References: <20210914230840.3030620-1-seanjc@google.com>
Subject: [PATCH 2/3] KVM: VMX: Move RESET emulation to vmx_vcpu_reset()
From: Sean Christopherson <seanjc@google.com>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>,
        Vitaly Kuznetsov <vkuznets@redhat.com>,
        Wanpeng Li <wanpengli@tencent.com>,
        Jim Mattson <jmattson@google.com>,
        Joerg Roedel <joro@8bytes.org>, kvm@vger.kernel.org,
        linux-kernel@vger.kernel.org, Reiji Watanabe <reijiw@google.com>
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk

Series

KVM: x86: Clean up RESET "emulation" | expand

Commit Message

Sean Christopherson Sept. 14, 2021, 11:08 p.m. UTC

Move vCPU RESET emulation, including initializating of select VMCS state,
to vmx_vcpu_reset().  Drop the open coded "vCPU load" sequence, as
->vcpu_reset() is invoked while the vCPU is properly loaded (which is
kind of the point of ->vcpu_reset()...).  Hopefully KVM will someday
expose a dedicated RESET ioctl(), and in the meantime separating "create"
from "RESET" is a nice cleanup.

Deferring VMCS initialization is effectively a nop as it's impossible to
safely access the VMCS between the current call site and its new home, as
both the vCPU and the pCPU are put immediately after init_vmcs(), i.e.
the VMCS isn't guaranteed to be loaded.

Note, task preemption is not a problem as vmx_sched_in() _can't_ touch
the VMCS as ->sched_in() is invoked before the vCPU, and thus VMCS, is
reloaded.  I.e. the preemption path also can't consume VMCS state.

Cc: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/vmx.c | 61 +++++++++++++++++++++---------------------
 1 file changed, 31 insertions(+), 30 deletions(-)

Comments

Vitaly Kuznetsov Sept. 15, 2021, 10:30 a.m. UTC | #1

Sean Christopherson <seanjc@google.com> writes:

> Move vCPU RESET emulation, including initializating of select VMCS state,
> to vmx_vcpu_reset().  Drop the open coded "vCPU load" sequence, as
> ->vcpu_reset() is invoked while the vCPU is properly loaded (which is
> kind of the point of ->vcpu_reset()...).  Hopefully KVM will someday
> expose a dedicated RESET ioctl(), and in the meantime separating "create"
> from "RESET" is a nice cleanup.
>
> Deferring VMCS initialization is effectively a nop as it's impossible to
> safely access the VMCS between the current call site and its new home, as
> both the vCPU and the pCPU are put immediately after init_vmcs(), i.e.
> the VMCS isn't guaranteed to be loaded.
>
> Note, task preemption is not a problem as vmx_sched_in() _can't_ touch
> the VMCS as ->sched_in() is invoked before the vCPU, and thus VMCS, is
> reloaded.  I.e. the preemption path also can't consume VMCS state.
>
> Cc: Reiji Watanabe <reijiw@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/vmx/vmx.c | 61 +++++++++++++++++++++---------------------
>  1 file changed, 31 insertions(+), 30 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index dc274b4c9912..629427cf8f4e 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -4327,10 +4327,6 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
>  
>  #define VMX_XSS_EXIT_BITMAP 0
>  
> -/*
> - * Noting that the initialization of Guest-state Area of VMCS is in
> - * vmx_vcpu_reset().
> - */
>  static void init_vmcs(struct vcpu_vmx *vmx)
>  {
>  	if (nested)
> @@ -4435,10 +4431,39 @@ static void init_vmcs(struct vcpu_vmx *vmx)
>  	vmx_setup_uret_msrs(vmx);
>  }
>  
> +static void __vmx_vcpu_reset(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> +	init_vmcs(vmx);
> +
> +	if (nested)
> +		memcpy(&vmx->nested.msrs, &vmcs_config.nested, sizeof(vmx->nested.msrs));
> +
> +	vcpu_setup_sgx_lepubkeyhash(vcpu);
> +
> +	vmx->nested.posted_intr_nv = -1;
> +	vmx->nested.current_vmptr = -1ull;
> +	vmx->nested.hv_evmcs_vmptr = EVMPTR_INVALID;

What would happen in (hypothetical) case when enlightened VMCS is
currently in use? If we zap 'hv_evmcs_vmptr' here, the consequent
nested_release_evmcs() (called from
nested_vmx_handle_enlightened_vmptrld(), for example) will not do 
kvm_vcpu_unmap() while it should.

This, however, got me thinking: should we free all-things-nested with
free_nested()/nested_vmx_free_vcpu() upon vcpu reset? I can't seem to
find us doing that... (I do remember that INIT is blocked in VMX-root
mode and nobody else besides kvm_arch_vcpu_create()/
kvm_apic_accept_events() seems to call kvm_vcpu_reset()) but maybe we
should at least add a WARN_ON() guardian here?

Side topic: we don't seem to reset Hyper-V specific MSRs either,
probably relying on userspace VMM doing the right thing but this is also
not ideal I believe.

> +
> +	vcpu->arch.microcode_version = 0x100000000ULL;
> +	vmx->msr_ia32_feature_control_valid_bits = FEAT_CTL_LOCKED;
> +
> +	/*
> +	 * Enforce invariant: pi_desc.nv is always either POSTED_INTR_VECTOR
> +	 * or POSTED_INTR_WAKEUP_VECTOR.
> +	 */
> +	vmx->pi_desc.nv = POSTED_INTR_VECTOR;
> +	vmx->pi_desc.sn = 1;
> +}
> +
>  static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>  {
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
>  
> +	if (!init_event)
> +		__vmx_vcpu_reset(vcpu);
> +
>  	vmx->rmode.vm86_active = 0;
>  	vmx->spec_ctrl = 0;
>  
> @@ -6797,7 +6822,7 @@ static int vmx_create_vcpu(struct kvm_vcpu *vcpu)
>  {
>  	struct vmx_uret_msr *tsx_ctrl;
>  	struct vcpu_vmx *vmx;
> -	int i, cpu, err;
> +	int i, err;
>  
>  	BUILD_BUG_ON(offsetof(struct vcpu_vmx, vcpu) != 0);
>  	vmx = to_vmx(vcpu);
> @@ -6856,12 +6881,7 @@ static int vmx_create_vcpu(struct kvm_vcpu *vcpu)
>  	}
>  
>  	vmx->loaded_vmcs = &vmx->vmcs01;
> -	cpu = get_cpu();
> -	vmx_vcpu_load(vcpu, cpu);
> -	vcpu->cpu = cpu;
> -	init_vmcs(vmx);
> -	vmx_vcpu_put(vcpu);
> -	put_cpu();
> +
>  	if (cpu_need_virtualize_apic_accesses(vcpu)) {
>  		err = alloc_apic_access_page(vcpu->kvm);
>  		if (err)
> @@ -6874,25 +6894,6 @@ static int vmx_create_vcpu(struct kvm_vcpu *vcpu)
>  			goto free_vmcs;
>  	}
>  
> -	if (nested)
> -		memcpy(&vmx->nested.msrs, &vmcs_config.nested, sizeof(vmx->nested.msrs));
> -
> -	vcpu_setup_sgx_lepubkeyhash(vcpu);
> -
> -	vmx->nested.posted_intr_nv = -1;
> -	vmx->nested.current_vmptr = -1ull;
> -	vmx->nested.hv_evmcs_vmptr = EVMPTR_INVALID;
> -
> -	vcpu->arch.microcode_version = 0x100000000ULL;
> -	vmx->msr_ia32_feature_control_valid_bits = FEAT_CTL_LOCKED;
> -
> -	/*
> -	 * Enforce invariant: pi_desc.nv is always either POSTED_INTR_VECTOR
> -	 * or POSTED_INTR_WAKEUP_VECTOR.
> -	 */
> -	vmx->pi_desc.nv = POSTED_INTR_VECTOR;
> -	vmx->pi_desc.sn = 1;
> -
>  	return 0;
>  
>  free_vmcs:

Sean Christopherson Sept. 15, 2021, 5:34 p.m. UTC | #2

On Wed, Sep 15, 2021, Vitaly Kuznetsov wrote:
> Sean Christopherson <seanjc@google.com> writes:
> > +static void __vmx_vcpu_reset(struct kvm_vcpu *vcpu)
> > +{
> > +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> > +
> > +	init_vmcs(vmx);
> > +
> > +	if (nested)
> > +		memcpy(&vmx->nested.msrs, &vmcs_config.nested, sizeof(vmx->nested.msrs));
> > +
> > +	vcpu_setup_sgx_lepubkeyhash(vcpu);
> > +
> > +	vmx->nested.posted_intr_nv = -1;
> > +	vmx->nested.current_vmptr = -1ull;
> > +	vmx->nested.hv_evmcs_vmptr = EVMPTR_INVALID;
> 
> What would happen in (hypothetical) case when enlightened VMCS is
> currently in use? If we zap 'hv_evmcs_vmptr' here, the consequent
> nested_release_evmcs() (called from
> nested_vmx_handle_enlightened_vmptrld(), for example) will not do 
> kvm_vcpu_unmap() while it should.

The short answer is that there's a lot of stuff that needs to be addressed before
KVM can expose a RESET ioctl().  My goal with these patches is to carve out the
stubs and move the few bits of RESET emulation into the "stubs".  This is the same
answer for the MSR question/comment at the end.

> This, however, got me thinking: should we free all-things-nested with
> free_nested()/nested_vmx_free_vcpu() upon vcpu reset? I can't seem to
> find us doing that... (I do remember that INIT is blocked in VMX-root
> mode and nobody else besides kvm_arch_vcpu_create()/
> kvm_apic_accept_events() seems to call kvm_vcpu_reset()) but maybe we
> should at least add a WARN_ON() guardian here?

I think that makes sense.  Maybe use CR0 as a sentinel since it has a non-zero
RESET value?  E.g. WARN if CR0 is non-zero at RESET.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 86539c1686fa..3ac074376821 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10813,6 +10813,11 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
        unsigned long new_cr0;
        u32 eax, dummy;

+       /*
+        * <comment about KVM not supporting arbitrary RESET>
+        */
+       WARN_ON_ONCE(!init_event && old_cr0);
+
        kvm_lapic_reset(vcpu, init_event);

        vcpu->arch.hflags = 0;

Huh, typing that out made me realize commit 0aa1837533e5 ("KVM: x86: Properly
reset MMU context at vCPU RESET/INIT") technically introduced a bug.  kvm_vcpu_reset()
does kvm_read_cr0() and thus reads vmcs.GUEST_CR0 because vcpu->arch.regs_avail is
(correctly) not stuffed to ALL_ONES until later in kvm_vcpu_reset().  init_vmcs()
doesn't explicitly zero vmcs.GUEST_CR0 (along with many other guest fields), and
so VMREAD(GUEST_CR0) is technically consuming garbage.  In practice, it's consuming
'0' because no known CPU or VMM inverts values in the VMCS, i.e. zero allocating
the VMCS is functionally equivalent to writing '0' to all fields via VMWRITE.

And staring more at kvm_vcpu_reset(), this code is terrifying for INIT

	memset(vcpu->arch.regs, 0, sizeof(vcpu->arch.regs));
	vcpu->arch.regs_avail = ~0;
	vcpu->arch.regs_dirty = ~0;

because it means cr0 and cr4 are marked available+dirty without immediately writing
vcpu->arch.cr0/cr4.  And VMX subtly relies on that, as vmx_set_cr0() grabs CR0.PG
via kvm_read_cr0_bits(), i.e. zeroing vcpu->arch.cr0 would "break" the INIT flow.
Ignoring for the moment that CR0.PG is never guest-owned and thus never stale in
vcpu->arch.cr0, KVM is also technically relying on the earlier kvm_read_cr0() in
kvm_vcpu_reset() to ensure vcpu->arch.cr0 is fresh.

Stuffing regs_avail technically means vmx_set_rflags() -> vmx_get_rflags() is
consuming stale data.  It doesn't matter in practice because the old value is
only used to re-evaluate vmx->emulation_required, which is guaranteed to be up
refreshed by vmx_set_cr0() and friends.  PDPTRs and EXIT_INFO are in a similar
boat; KVM shouldn't be reading those fields (CR0.PG=0, not an exit path), but
marking them dirty without actually updating the cached values is wrong.

There's also one concrete bug: KVM doesn't set vcpu->arch.cr3=0 on RESET/INIT.
That bug has gone unnoticed because no real word BIOS/kernel is going to rely on
INIT to set CR3=0.  

I'm strongly leaning towards stuffing regs_avail/dirty in kvm_arch_vcpu_create(),
and relying on explicit kvm_register_mark_dirty() calls for the 3 or so cases where
x86 code writes a vcpu->arch register directly.  That would fix the CR0 read bug
and also prevent subtle bugs from sneaking in.  Adding new EXREGS would be slightly
more costly, but IMO that's a good thing as would force us to actually think about
how to handle each register.

E.g. implement this over 2-3 patches:

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 114847253e0a..743146ac8307 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4385,6 +4385,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
        kvm_set_cr8(vcpu, 0);

        vmx_segment_cache_clear(vmx);
+       kvm_register_mark_available(vcpu, VCPU_EXREG_SEGMENTS);

        seg_setup(VCPU_SREG_CS);
        vmcs_write16(GUEST_CS_SELECTOR, 0xf000);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 86539c1686fa..ab907a0b9eeb 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10656,6 +10656,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
        int r;

        vcpu->arch.last_vmentry_cpu = -1;
+       vcpu->arch.regs_avail = ~0;
+       vcpu->arch.regs_dirty = ~0;

        if (!irqchip_in_kernel(vcpu->kvm) || kvm_vcpu_is_reset_bsp(vcpu))
                vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
@@ -10874,9 +10876,9 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
                vcpu->arch.xcr0 = XFEATURE_MASK_FP;
        }

+       /* All GPRs except RDX (handled below) are zeroed on RESET/INIT. */
        memset(vcpu->arch.regs, 0, sizeof(vcpu->arch.regs));
-       vcpu->arch.regs_avail = ~0;
-       vcpu->arch.regs_dirty = ~0;
+       kvm_register_mark_dirty(vcpu, VCPU_REGS_RSP);

        /*
         * Fall back to KVM's default Family/Model/Stepping of 0x600 (P6/Athlon)
@@ -10897,6 +10899,9 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
        kvm_set_rflags(vcpu, X86_EFLAGS_FIXED);
        kvm_rip_write(vcpu, 0xfff0);

+       vcpu->arch.cr3 = 0;
+       kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3);
+
        /*
         * CR0.CD/NW are set on RESET, preserved on INIT.  Note, some versions
         * of Intel's SDM list CD/NW as being set on INIT, but they contradict

> Side topic: we don't seem to reset Hyper-V specific MSRs either,
> probably relying on userspace VMM doing the right thing but this is also
> not ideal I believe.

Vitaly Kuznetsov Sept. 16, 2021, 7:19 a.m. UTC | #3

Sean Christopherson <seanjc@google.com> writes:

> On Wed, Sep 15, 2021, Vitaly Kuznetsov wrote:
>> Sean Christopherson <seanjc@google.com> writes:
>> > +static void __vmx_vcpu_reset(struct kvm_vcpu *vcpu)
>> > +{
>> > +	struct vcpu_vmx *vmx = to_vmx(vcpu);
>> > +
>> > +	init_vmcs(vmx);
>> > +
>> > +	if (nested)
>> > +		memcpy(&vmx->nested.msrs, &vmcs_config.nested, sizeof(vmx->nested.msrs));
>> > +
>> > +	vcpu_setup_sgx_lepubkeyhash(vcpu);
>> > +
>> > +	vmx->nested.posted_intr_nv = -1;
>> > +	vmx->nested.current_vmptr = -1ull;
>> > +	vmx->nested.hv_evmcs_vmptr = EVMPTR_INVALID;
>> 
>> What would happen in (hypothetical) case when enlightened VMCS is
>> currently in use? If we zap 'hv_evmcs_vmptr' here, the consequent
>> nested_release_evmcs() (called from
>> nested_vmx_handle_enlightened_vmptrld(), for example) will not do 
>> kvm_vcpu_unmap() while it should.
>
> The short answer is that there's a lot of stuff that needs to be addressed before
> KVM can expose a RESET ioctl().  My goal with these patches is to carve out the
> stubs and move the few bits of RESET emulation into the "stubs".  This is the same
> answer for the MSR question/comment at the end.
>
>> This, however, got me thinking: should we free all-things-nested with
>> free_nested()/nested_vmx_free_vcpu() upon vcpu reset? I can't seem to
>> find us doing that... (I do remember that INIT is blocked in VMX-root
>> mode and nobody else besides kvm_arch_vcpu_create()/
>> kvm_apic_accept_events() seems to call kvm_vcpu_reset()) but maybe we
>> should at least add a WARN_ON() guardian here?
>
> I think that makes sense.  Maybe use CR0 as a sentinel since it has a non-zero
> RESET value?  E.g. WARN if CR0 is non-zero at RESET.
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 86539c1686fa..3ac074376821 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10813,6 +10813,11 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>         unsigned long new_cr0;
>         u32 eax, dummy;
>
> +       /*
> +        * <comment about KVM not supporting arbitrary RESET>
> +        */
> +       WARN_ON_ONCE(!init_event && old_cr0);
> +
>         kvm_lapic_reset(vcpu, init_event);
>
>         vcpu->arch.hflags = 0;
>

Should work (assuming nothing in VMX would want to call vmx_vcpu_reset()/
__vmx_vcpu_reset() directly).

> Huh, typing that out made me realize commit 0aa1837533e5 ("KVM: x86: Properly
> reset MMU context at vCPU RESET/INIT") technically introduced a bug.  kvm_vcpu_reset()
> does kvm_read_cr0() and thus reads vmcs.GUEST_CR0 because vcpu->arch.regs_avail is
> (correctly) not stuffed to ALL_ONES until later in kvm_vcpu_reset().  init_vmcs()
> doesn't explicitly zero vmcs.GUEST_CR0 (along with many other guest fields), and
> so VMREAD(GUEST_CR0) is technically consuming garbage.  In practice, it's consuming
> '0' because no known CPU or VMM inverts values in the VMCS, i.e. zero allocating
> the VMCS is functionally equivalent to writing '0' to all fields via VMWRITE.
>
> And staring more at kvm_vcpu_reset(), this code is terrifying for INIT
>
> 	memset(vcpu->arch.regs, 0, sizeof(vcpu->arch.regs));
> 	vcpu->arch.regs_avail = ~0;
> 	vcpu->arch.regs_dirty = ~0;
>
> because it means cr0 and cr4 are marked available+dirty without immediately writing
> vcpu->arch.cr0/cr4.  And VMX subtly relies on that, as vmx_set_cr0() grabs CR0.PG
> via kvm_read_cr0_bits(), i.e. zeroing vcpu->arch.cr0 would "break" the INIT flow.
> Ignoring for the moment that CR0.PG is never guest-owned and thus never stale in
> vcpu->arch.cr0, KVM is also technically relying on the earlier kvm_read_cr0() in
> kvm_vcpu_reset() to ensure vcpu->arch.cr0 is fresh.
>
> Stuffing regs_avail technically means vmx_set_rflags() -> vmx_get_rflags() is
> consuming stale data.  It doesn't matter in practice because the old value is
> only used to re-evaluate vmx->emulation_required, which is guaranteed to be up
> refreshed by vmx_set_cr0() and friends.  PDPTRs and EXIT_INFO are in a similar
> boat; KVM shouldn't be reading those fields (CR0.PG=0, not an exit path), but
> marking them dirty without actually updating the cached values is wrong.
>
> There's also one concrete bug: KVM doesn't set vcpu->arch.cr3=0 on RESET/INIT.
> That bug has gone unnoticed because no real word BIOS/kernel is going to rely on
> INIT to set CR3=0.  
>
> I'm strongly leaning towards stuffing regs_avail/dirty in kvm_arch_vcpu_create(),
> and relying on explicit kvm_register_mark_dirty() calls for the 3 or so cases where
> x86 code writes a vcpu->arch register directly.  That would fix the CR0 read bug
> and also prevent subtle bugs from sneaking in.  Adding new EXREGS would be slightly
> more costly, but IMO that's a good thing as would force us to actually think about
> how to handle each register.
>
> E.g. implement this over 2-3 patches:
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 114847253e0a..743146ac8307 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -4385,6 +4385,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>         kvm_set_cr8(vcpu, 0);
>
>         vmx_segment_cache_clear(vmx);
> +       kvm_register_mark_available(vcpu, VCPU_EXREG_SEGMENTS);
>
>         seg_setup(VCPU_SREG_CS);
>         vmcs_write16(GUEST_CS_SELECTOR, 0xf000);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 86539c1686fa..ab907a0b9eeb 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10656,6 +10656,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>         int r;
>
>         vcpu->arch.last_vmentry_cpu = -1;
> +       vcpu->arch.regs_avail = ~0;
> +       vcpu->arch.regs_dirty = ~0;
>
>         if (!irqchip_in_kernel(vcpu->kvm) || kvm_vcpu_is_reset_bsp(vcpu))
>                 vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> @@ -10874,9 +10876,9 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>                 vcpu->arch.xcr0 = XFEATURE_MASK_FP;
>         }
>
> +       /* All GPRs except RDX (handled below) are zeroed on RESET/INIT. */
>         memset(vcpu->arch.regs, 0, sizeof(vcpu->arch.regs));
> -       vcpu->arch.regs_avail = ~0;
> -       vcpu->arch.regs_dirty = ~0;
> +       kvm_register_mark_dirty(vcpu, VCPU_REGS_RSP);
>
>         /*
>          * Fall back to KVM's default Family/Model/Stepping of 0x600 (P6/Athlon)
> @@ -10897,6 +10899,9 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>         kvm_set_rflags(vcpu, X86_EFLAGS_FIXED);
>         kvm_rip_write(vcpu, 0xfff0);
>
> +       vcpu->arch.cr3 = 0;
> +       kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3);
> +
>         /*
>          * CR0.CD/NW are set on RESET, preserved on INIT.  Note, some versions
>          * of Intel's SDM list CD/NW as being set on INIT, but they contradict
>

A selftest for vCPU create/reset would be really helpful. I can even
volunteer to [eventually] write one :-)

Sean Christopherson Sept. 16, 2021, 7:01 p.m. UTC | #4

On Thu, Sep 16, 2021, Vitaly Kuznetsov wrote:
> Sean Christopherson <seanjc@google.com> writes:
> > @@ -10897,6 +10899,9 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> >         kvm_set_rflags(vcpu, X86_EFLAGS_FIXED);
> >         kvm_rip_write(vcpu, 0xfff0);
> >
> > +       vcpu->arch.cr3 = 0;
> > +       kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3);
> > +
> >         /*
> >          * CR0.CD/NW are set on RESET, preserved on INIT.  Note, some versions
> >          * of Intel's SDM list CD/NW as being set on INIT, but they contradict
> >
> 
> A selftest for vCPU create/reset would be really helpful. I can even
> volunteer to [eventually] write one :-)

Hmm, I wonder if it would be possible to share code/infrastructure with Erdem's
in-progress TDX selftest framework[*].  TDX forces vCPUs to start at the legacy
reset vector with paging disabled, so it needs a lot of the same glue code as a
from-RESET test would need.  TDX forces 32-bit PM instead of RM, but it should
be easy enough to allow an optional opening sequence to get into 32-bit PM.

We could also test INIT without much trouble since INIT to the BSP will send it
back to the reset vector, e.g. set a flag somewhere to avoid an infinite loop and
INIT self.

Let me work with Erdem to see if we can concoct something that will work for
both TDX and tests that want to take control at RESET.

[*] https://lkml.kernel.org/r/20210726183816.1343022-3-erdemaktas@google.com

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index dc274b4c9912..629427cf8f4e 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4327,10 +4327,6 @@  static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
 
 #define VMX_XSS_EXIT_BITMAP 0
 
-/*
- * Noting that the initialization of Guest-state Area of VMCS is in
- * vmx_vcpu_reset().
- */
 static void init_vmcs(struct vcpu_vmx *vmx)
 {
 	if (nested)
@@ -4435,10 +4431,39 @@  static void init_vmcs(struct vcpu_vmx *vmx)
 	vmx_setup_uret_msrs(vmx);
 }
 
+static void __vmx_vcpu_reset(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	init_vmcs(vmx);
+
+	if (nested)
+		memcpy(&vmx->nested.msrs, &vmcs_config.nested, sizeof(vmx->nested.msrs));
+
+	vcpu_setup_sgx_lepubkeyhash(vcpu);
+
+	vmx->nested.posted_intr_nv = -1;
+	vmx->nested.current_vmptr = -1ull;
+	vmx->nested.hv_evmcs_vmptr = EVMPTR_INVALID;
+
+	vcpu->arch.microcode_version = 0x100000000ULL;
+	vmx->msr_ia32_feature_control_valid_bits = FEAT_CTL_LOCKED;
+
+	/*
+	 * Enforce invariant: pi_desc.nv is always either POSTED_INTR_VECTOR
+	 * or POSTED_INTR_WAKEUP_VECTOR.
+	 */
+	vmx->pi_desc.nv = POSTED_INTR_VECTOR;
+	vmx->pi_desc.sn = 1;
+}
+
 static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
+	if (!init_event)
+		__vmx_vcpu_reset(vcpu);
+
 	vmx->rmode.vm86_active = 0;
 	vmx->spec_ctrl = 0;
 
@@ -6797,7 +6822,7 @@  static int vmx_create_vcpu(struct kvm_vcpu *vcpu)
 {
 	struct vmx_uret_msr *tsx_ctrl;
 	struct vcpu_vmx *vmx;
-	int i, cpu, err;
+	int i, err;
 
 	BUILD_BUG_ON(offsetof(struct vcpu_vmx, vcpu) != 0);
 	vmx = to_vmx(vcpu);
@@ -6856,12 +6881,7 @@  static int vmx_create_vcpu(struct kvm_vcpu *vcpu)
 	}
 
 	vmx->loaded_vmcs = &vmx->vmcs01;
-	cpu = get_cpu();
-	vmx_vcpu_load(vcpu, cpu);
-	vcpu->cpu = cpu;
-	init_vmcs(vmx);
-	vmx_vcpu_put(vcpu);
-	put_cpu();
+
 	if (cpu_need_virtualize_apic_accesses(vcpu)) {
 		err = alloc_apic_access_page(vcpu->kvm);
 		if (err)
@@ -6874,25 +6894,6 @@  static int vmx_create_vcpu(struct kvm_vcpu *vcpu)
 			goto free_vmcs;
 	}
 
-	if (nested)
-		memcpy(&vmx->nested.msrs, &vmcs_config.nested, sizeof(vmx->nested.msrs));
-
-	vcpu_setup_sgx_lepubkeyhash(vcpu);
-
-	vmx->nested.posted_intr_nv = -1;
-	vmx->nested.current_vmptr = -1ull;
-	vmx->nested.hv_evmcs_vmptr = EVMPTR_INVALID;
-
-	vcpu->arch.microcode_version = 0x100000000ULL;
-	vmx->msr_ia32_feature_control_valid_bits = FEAT_CTL_LOCKED;
-
-	/*
-	 * Enforce invariant: pi_desc.nv is always either POSTED_INTR_VECTOR
-	 * or POSTED_INTR_WAKEUP_VECTOR.
-	 */
-	vmx->pi_desc.nv = POSTED_INTR_VECTOR;
-	vmx->pi_desc.sn = 1;
-
 	return 0;
 
 free_vmcs:

[2/3] KVM: VMX: Move RESET emulation to vmx_vcpu_reset()

Commit Message

Comments

Patch