diff mbox series

[Part2,RFC,v4,40/40] KVM: SVM: Support SEV-SNP AP Creation NAE event

Message ID 20210707183616.5620-41-brijesh.singh@amd.com (mailing list archive)
State New
Headers show
Series Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support | expand

Commit Message

Brijesh Singh July 7, 2021, 6:36 p.m. UTC
From: Tom Lendacky <thomas.lendacky@amd.com>

Add support for the SEV-SNP AP Creation NAE event. This allows SEV-SNP
guests to create and start APs on their own.

A new event, KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, is created and used
so as to avoid updating the VMSA pointer while the vCPU is running.

For CREATE
  The guest supplies the GPA of the VMSA to be used for the vCPU with the
  specified APIC ID. The GPA is saved in the svm struct of the target
  vCPU, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is added to the
  vCPU and then the vCPU is kicked.

For CREATE_ON_INIT:
  The guest supplies the GPA of the VMSA to be used for the vCPU with the
  specified APIC ID the next time an INIT is performed. The GPA is saved
  in the svm struct of the target vCPU.

For DESTROY:
  The guest indicates it wishes to stop the vCPU. The GPA is cleared from
  the svm struct, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is added
  to vCPU and then the vCPU is kicked.


The KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event handler will be invoked as
a result of the event or as a result of an INIT. The handler sets the vCPU
to the KVM_MP_STATE_UNINITIALIZED state, so that any errors will leave the
vCPU as not runnable. Any previous VMSA pages that were installed as
part of an SEV-SNP AP Creation NAE event are un-pinned. If a new VMSA is
to be installed, the VMSA guest page is pinned and set as the VMSA in the
vCPU VMCB and the vCPU state is set to KVM_MP_STATE_RUNNABLE. If a new
VMSA is not to be installed, the VMSA is cleared in the vCPU VMCB and the
vCPU state is left as KVM_MP_STATE_UNINITIALIZED to prevent it from being
run.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/include/asm/kvm_host.h |   3 +
 arch/x86/include/asm/svm.h      |   3 +
 arch/x86/kvm/svm/sev.c          | 133 ++++++++++++++++++++++++++++++++
 arch/x86/kvm/svm/svm.c          |   7 +-
 arch/x86/kvm/svm/svm.h          |  16 +++-
 arch/x86/kvm/x86.c              |  11 ++-
 6 files changed, 170 insertions(+), 3 deletions(-)

Comments

Sean Christopherson July 21, 2021, 12:01 a.m. UTC | #1
On Wed, Jul 07, 2021, Brijesh Singh wrote:
> From: Tom Lendacky <thomas.lendacky@amd.com>
> 
> Add support for the SEV-SNP AP Creation NAE event. This allows SEV-SNP
> guests to create and start APs on their own.

The changelog really needs to clarify that this doesn't allow the guest to create
arbitrary vCPUs.  The GHCB uses CREATE/DESTROY terminology, but this patch and its
comments/documentation should very clearly call out that KVM's implementation is
more along the line of vCPU online/offline.

It should also be noted that KVM still onlines APs by default.  That also raises
the question of whether or not KVM should support creating an offlined vCPU.
E.g. several of the use cases I'm aware of want to do something along the lines
of creating a VM with the max number of theoretical vCPUs, but in most instances
only run a handful of vCPUs.  That's a fair amount of potential memory savings
if the max theoretical number of vCPUs is high.

> A new event, KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, is created and used
> so as to avoid updating the VMSA pointer while the vCPU is running.
> 
> For CREATE
>   The guest supplies the GPA of the VMSA to be used for the vCPU with the
>   specified APIC ID. The GPA is saved in the svm struct of the target
>   vCPU, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is added to the
>   vCPU and then the vCPU is kicked.
> 
> For CREATE_ON_INIT:
>   The guest supplies the GPA of the VMSA to be used for the vCPU with the
>   specified APIC ID the next time an INIT is performed. The GPA is saved
>   in the svm struct of the target vCPU.
> 
> For DESTROY:
>   The guest indicates it wishes to stop the vCPU. The GPA is cleared from
>   the svm struct, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is added
>   to vCPU and then the vCPU is kicked.
> 
> 
> The KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event handler will be invoked as
> a result of the event or as a result of an INIT. The handler sets the vCPU
> to the KVM_MP_STATE_UNINITIALIZED state, so that any errors will leave the
> vCPU as not runnable. Any previous VMSA pages that were installed as
> part of an SEV-SNP AP Creation NAE event are un-pinned. If a new VMSA is
> to be installed, the VMSA guest page is pinned and set as the VMSA in the
> vCPU VMCB and the vCPU state is set to KVM_MP_STATE_RUNNABLE. If a new
> VMSA is not to be installed, the VMSA is cleared in the vCPU VMCB and the
> vCPU state is left as KVM_MP_STATE_UNINITIALIZED to prevent it from being
> run.
> 
> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
> ---
>  arch/x86/include/asm/kvm_host.h |   3 +
>  arch/x86/include/asm/svm.h      |   3 +
>  arch/x86/kvm/svm/sev.c          | 133 ++++++++++++++++++++++++++++++++
>  arch/x86/kvm/svm/svm.c          |   7 +-
>  arch/x86/kvm/svm/svm.h          |  16 +++-
>  arch/x86/kvm/x86.c              |  11 ++-
>  6 files changed, 170 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 117e2e08d7ed..881e05b3f74e 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -91,6 +91,7 @@
>  #define KVM_REQ_MSR_FILTER_CHANGED	KVM_ARCH_REQ(29)
>  #define KVM_REQ_UPDATE_CPU_DIRTY_LOGGING \
>  	KVM_ARCH_REQ_FLAGS(30, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> +#define KVM_REQ_UPDATE_PROTECTED_GUEST_STATE	KVM_ARCH_REQ(31)
>  
>  #define CR0_RESERVED_BITS                                               \
>  	(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
> @@ -1402,6 +1403,8 @@ struct kvm_x86_ops {
>  
>  	int (*handle_rmp_page_fault)(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t pfn,
>  			int level, u64 error_code);
> +
> +	void (*update_protected_guest_state)(struct kvm_vcpu *vcpu);
>  };
>  
>  struct kvm_x86_nested_ops {
> diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
> index 5e72faa00cf2..6634a952563e 100644
> --- a/arch/x86/include/asm/svm.h
> +++ b/arch/x86/include/asm/svm.h
> @@ -220,6 +220,9 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
>  #define SVM_SEV_FEATURES_DEBUG_SWAP		BIT(5)
>  #define SVM_SEV_FEATURES_PREVENT_HOST_IBS	BIT(6)
>  #define SVM_SEV_FEATURES_BTB_ISOLATION		BIT(7)
> +#define SVM_SEV_FEATURES_INT_INJ_MODES			\
> +	(SVM_SEV_FEATURES_RESTRICTED_INJECTION |	\
> +	 SVM_SEV_FEATURES_ALTERNATE_INJECTION)
>  
>  struct vmcb_seg {
>  	u16 selector;
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index d8ad6dd58c87..95f5d25b4f08 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -582,6 +582,7 @@ static int sev_launch_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
>  
>  static int sev_es_sync_vmsa(struct vcpu_svm *svm)
>  {
> +	struct kvm_sev_info *sev = &to_kvm_svm(svm->vcpu.kvm)->sev_info;
>  	struct sev_es_save_area *save = svm->vmsa;
>  
>  	/* Check some debug related fields before encrypting the VMSA */
> @@ -625,6 +626,12 @@ static int sev_es_sync_vmsa(struct vcpu_svm *svm)
>  	if (sev_snp_guest(svm->vcpu.kvm))
>  		save->sev_features |= SVM_SEV_FEATURES_SNP_ACTIVE;
>  
> +	/*
> +	 * Save the VMSA synced SEV features. For now, they are the same for
> +	 * all vCPUs, so just save each time.
> +	 */
> +	sev->sev_features = save->sev_features;
> +
>  	return 0;
>  }
>  
> @@ -2682,6 +2689,10 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
>  		if (!ghcb_sw_scratch_is_valid(ghcb))
>  			goto vmgexit_err;
>  		break;
> +	case SVM_VMGEXIT_AP_CREATION:
> +		if (!ghcb_rax_is_valid(ghcb))
> +			goto vmgexit_err;
> +		break;
>  	case SVM_VMGEXIT_NMI_COMPLETE:
>  	case SVM_VMGEXIT_AP_HLT_LOOP:
>  	case SVM_VMGEXIT_AP_JUMP_TABLE:
> @@ -3395,6 +3406,121 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
>  	return ret;
>  }
>  
> +void sev_snp_update_protected_guest_state(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_svm *svm = to_svm(vcpu);
> +	kvm_pfn_t pfn;
> +
> +	mutex_lock(&svm->snp_vmsa_mutex);
> +
> +	vcpu->arch.mp_state = KVM_MP_STATE_UNINITIALIZED;
> +
> +	/* Clear use of the VMSA in the sev_es_init_vmcb() path */
> +	svm->vmsa_pa = 0;
> +
> +	/* Clear use of the VMSA from the VMCB */
> +	svm->vmcb->control.vmsa_pa = 0;

PA=0 is not an invalid address.  I don't care what value the GHCB uses for
"invalid GPA", KVM should always use INVALID_PAGE to track an invalid physical
address.

> +	/* Un-pin previous VMSA */
> +	if (svm->snp_vmsa_pfn) {
> +		kvm_release_pfn_dirty(svm->snp_vmsa_pfn);

Oof, I was wondering why KVM tracks three versions of VMSA.  Actually, I'm still
wondering why there are three versions.  Aren't snp_vmsa_pfn and vmsa_pa tracking
the same thing?  Ah, finally figured it out.  vmsa_pa points at svm->vmsa by
default.  Blech.

> +		svm->snp_vmsa_pfn = 0;
> +	}
> +
> +	if (svm->snp_vmsa_gpa) {

This is bogus, GPA=0 is perfectly valid.  As above, use INVALID_PAGE.  A comment
explaining that the vCPU is offline when VMSA is invalid would also be helpful.

> +		/* Validate that the GPA is page aligned */
> +		if (!PAGE_ALIGNED(svm->snp_vmsa_gpa))

This needs to be moved to the VMGEXIT, and it should use page_address_valid() so
that KVM also checks for a legal GPA.

> +			goto e_unlock;
> +
> +		/*
> +		 * The VMSA is referenced by thy hypervisor physical address,

s/thy/the, although converting to archaic English could be hilarious...

> +		 * so retrieve the PFN and pin it.
> +		 */
> +		pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(svm->snp_vmsa_gpa));
> +		if (is_error_pfn(pfn))
> +			goto e_unlock;

Silently ignoring the guest request is bad behavior, at worst KVM should exit to
userspace with an emulation error.

> +
> +		svm->snp_vmsa_pfn = pfn;
> +
> +		/* Use the new VMSA in the sev_es_init_vmcb() path */
> +		svm->vmsa_pa = pfn_to_hpa(pfn);
> +		svm->vmcb->control.vmsa_pa = svm->vmsa_pa;
> +
> +		vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> +	} else {
> +		vcpu->arch.pv.pv_unhalted = false;

Shouldn't the RUNNABLE path also clear pv_unhalted?

> +		vcpu->arch.mp_state = KVM_MP_STATE_UNINITIALIZED;

What happens if userspace calls kvm_arch_vcpu_ioctl_set_mpstate, or even worse
the guest sends INIT-SIPI?  Unless I'm mistaken, either case will cause KVM to
run the vCPU with vmcb->control.vmsa_pa==0.

My initial reaction is that the "offline" case needs a new mp_state, or maybe
just use KVM_MP_STATE_STOPPED.

> +	}
> +
> +e_unlock:
> +	mutex_unlock(&svm->snp_vmsa_mutex);
> +}
> +
> +static void sev_snp_ap_creation(struct vcpu_svm *svm)
> +{
> +	struct kvm_sev_info *sev = &to_kvm_svm(svm->vcpu.kvm)->sev_info;
> +	struct kvm_vcpu *vcpu = &svm->vcpu;
> +	struct kvm_vcpu *target_vcpu;
> +	struct vcpu_svm *target_svm;
> +	unsigned int request;
> +	unsigned int apic_id;
> +	bool kick;
> +
> +	request = lower_32_bits(svm->vmcb->control.exit_info_1);
> +	apic_id = upper_32_bits(svm->vmcb->control.exit_info_1);
> +
> +	/* Validate the APIC ID */
> +	target_vcpu = kvm_get_vcpu_by_id(vcpu->kvm, apic_id);
> +	if (!target_vcpu)
> +		return;

KVM should not silently ignore bad requests, this needs to return an error to the
guest.

> +
> +	target_svm = to_svm(target_vcpu);
> +
> +	kick = true;

This is wrong, e.g. KVM will kick the target vCPU even if the request fails.
I suspect the correct behavior would be to:

  1. do all sanity checks
  2. take the necessary lock(s)
  3. modify target vCPU state
  4. kick target vCPU unless request==SVM_VMGEXIT_AP_CREATE_ON_INIT

> +	mutex_lock(&target_svm->snp_vmsa_mutex);

This seems like it's missing a big pile of sanity checks.  E.g. KVM should reject
SVM_VMGEXIT_AP_CREATE if the target vCPU is already "created", including the case
where it was "created_on_init" but hasn't yet received INIT-SIPI.

> +
> +	target_svm->snp_vmsa_gpa = 0;
> +	target_svm->snp_vmsa_update_on_init = false;
> +
> +	/* Interrupt injection mode shouldn't change for AP creation */
> +	if (request < SVM_VMGEXIT_AP_DESTROY) {
> +		u64 sev_features;
> +
> +		sev_features = vcpu->arch.regs[VCPU_REGS_RAX];
> +		sev_features ^= sev->sev_features;
> +		if (sev_features & SVM_SEV_FEATURES_INT_INJ_MODES) {

Why is only INT_INJ_MODES checked?  The new comment in sev_es_sync_vmsa() explicitly
states that sev_features are the same for all vCPUs, but that's not enforced here.
At a bare minimum I would expect this to sanity check SVM_SEV_FEATURES_SNP_ACTIVE.

> +			vcpu_unimpl(vcpu, "vmgexit: invalid AP injection mode [%#lx] from guest\n",
> +				    vcpu->arch.regs[VCPU_REGS_RAX]);
> +			goto out;
> +		}
> +	}
> +
> +	switch (request) {
> +	case SVM_VMGEXIT_AP_CREATE_ON_INIT:

Out of curiosity, what's the use case for this variant?  I assume the guest has
to preconfigure the VMSA and ensure the target vCPU's RIP points at something
sane anyways, otherwise the hypervisor could attack the guest by immediately
attempting to run the deferred vCPU.  At that point, a guest could simply use an
existing mechanism to put the target vCPU into a holding pattern.

> +		kick = false;
> +		target_svm->snp_vmsa_update_on_init = true;
> +		fallthrough;
> +	case SVM_VMGEXIT_AP_CREATE:
> +		target_svm->snp_vmsa_gpa = svm->vmcb->control.exit_info_2;

The incoming GPA needs to be checked for validity, at least as much possible.
E.g. the PAGE_ALIGNED() check should be done here and be morphed to a synchronous
error for the guest, not a silent "oops, didn't run your vCPU".

> +		break;
> +	case SVM_VMGEXIT_AP_DESTROY:
> +		break;
> +	default:
> +		vcpu_unimpl(vcpu, "vmgexit: invalid AP creation request [%#x] from guest\n",
> +			    request);
> +		break;
> +	}
> +
> +out:
> +	mutex_unlock(&target_svm->snp_vmsa_mutex);
> +
> +	if (kick) {
> +		kvm_make_request(KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, target_vcpu);
> +		kvm_vcpu_kick(target_vcpu);
> +	}
> +}
> +
>  int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
>  {
>  	struct vcpu_svm *svm = to_svm(vcpu);
> @@ -3523,6 +3649,11 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
>  		ret = 1;
>  		break;
>  	}
> +	case SVM_VMGEXIT_AP_CREATION:
> +		sev_snp_ap_creation(svm);
> +
> +		ret = 1;
> +		break;
>  	case SVM_VMGEXIT_UNSUPPORTED_EVENT:
>  		vcpu_unimpl(vcpu,
>  			    "vmgexit: unsupported event - exit_info_1=%#llx, exit_info_2=%#llx\n",
> @@ -3597,6 +3728,8 @@ void sev_es_create_vcpu(struct vcpu_svm *svm)
>  	set_ghcb_msr(svm, GHCB_MSR_SEV_INFO(GHCB_VERSION_MAX,
>  					    GHCB_VERSION_MIN,
>  					    sev_enc_bit));
> +
> +	mutex_init(&svm->snp_vmsa_mutex);
>  }
>  
>  void sev_es_prepare_guest_switch(struct vcpu_svm *svm, unsigned int cpu)
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 74bc635c9608..078a569c85a8 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -1304,7 +1304,10 @@ static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>  	svm->spec_ctrl = 0;
>  	svm->virt_spec_ctrl = 0;
>  
> -	if (!init_event) {
> +	if (init_event && svm->snp_vmsa_update_on_init) {

This can race with sev_snp_ap_creation() since the new snp_vmsa_mutex isn't held.
There needs to be smp_rmb() and smp_wmb() barriers to ensure correct ordering
between snp_vmsa_update_on_init and consuming the new VMSA gpa.  And of course
sev_snp_ap_creation() needs to have correct ordering, e.g. as is this code can
see snp_vmsa_update_on_init=true before the new snp_vmsa_gpa is set.

> +		svm->snp_vmsa_update_on_init = false;
> +		sev_snp_update_protected_guest_state(vcpu);
> +	} else {
>  		vcpu->arch.apic_base = APIC_DEFAULT_PHYS_BASE |
>  				       MSR_IA32_APICBASE_ENABLE;
>  		if (kvm_vcpu_is_reset_bsp(vcpu))
Tom Lendacky July 21, 2021, 5:47 p.m. UTC | #2
On 7/20/21 7:01 PM, Sean Christopherson wrote:
> On Wed, Jul 07, 2021, Brijesh Singh wrote:
>> From: Tom Lendacky <thomas.lendacky@amd.com>
>>
>> Add support for the SEV-SNP AP Creation NAE event. This allows SEV-SNP
>> guests to create and start APs on their own.
> 
> The changelog really needs to clarify that this doesn't allow the guest to create
> arbitrary vCPUs.  The GHCB uses CREATE/DESTROY terminology, but this patch and its
> comments/documentation should very clearly call out that KVM's implementation is
> more along the line of vCPU online/offline.

Will update.

> 
> It should also be noted that KVM still onlines APs by default.  That also raises
> the question of whether or not KVM should support creating an offlined vCPU.
> E.g. several of the use cases I'm aware of want to do something along the lines
> of creating a VM with the max number of theoretical vCPUs, but in most instances
> only run a handful of vCPUs.  That's a fair amount of potential memory savings
> if the max theoretical number of vCPUs is high.
> 
>> A new event, KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, is created and used
>> so as to avoid updating the VMSA pointer while the vCPU is running.
>>
>> For CREATE
>>   The guest supplies the GPA of the VMSA to be used for the vCPU with the
>>   specified APIC ID. The GPA is saved in the svm struct of the target
>>   vCPU, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is added to the
>>   vCPU and then the vCPU is kicked.
>>
>> For CREATE_ON_INIT:
>>   The guest supplies the GPA of the VMSA to be used for the vCPU with the
>>   specified APIC ID the next time an INIT is performed. The GPA is saved
>>   in the svm struct of the target vCPU.
>>
>> For DESTROY:
>>   The guest indicates it wishes to stop the vCPU. The GPA is cleared from
>>   the svm struct, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is added
>>   to vCPU and then the vCPU is kicked.
>>
>>
>> The KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event handler will be invoked as
>> a result of the event or as a result of an INIT. The handler sets the vCPU
>> to the KVM_MP_STATE_UNINITIALIZED state, so that any errors will leave the
>> vCPU as not runnable. Any previous VMSA pages that were installed as
>> part of an SEV-SNP AP Creation NAE event are un-pinned. If a new VMSA is
>> to be installed, the VMSA guest page is pinned and set as the VMSA in the
>> vCPU VMCB and the vCPU state is set to KVM_MP_STATE_RUNNABLE. If a new
>> VMSA is not to be installed, the VMSA is cleared in the vCPU VMCB and the
>> vCPU state is left as KVM_MP_STATE_UNINITIALIZED to prevent it from being
>> run.
>>
>> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
>> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
>> ---
>>  arch/x86/include/asm/kvm_host.h |   3 +
>>  arch/x86/include/asm/svm.h      |   3 +
>>  arch/x86/kvm/svm/sev.c          | 133 ++++++++++++++++++++++++++++++++
>>  arch/x86/kvm/svm/svm.c          |   7 +-
>>  arch/x86/kvm/svm/svm.h          |  16 +++-
>>  arch/x86/kvm/x86.c              |  11 ++-
>>  6 files changed, 170 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 117e2e08d7ed..881e05b3f74e 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -91,6 +91,7 @@
>>  #define KVM_REQ_MSR_FILTER_CHANGED	KVM_ARCH_REQ(29)
>>  #define KVM_REQ_UPDATE_CPU_DIRTY_LOGGING \
>>  	KVM_ARCH_REQ_FLAGS(30, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>> +#define KVM_REQ_UPDATE_PROTECTED_GUEST_STATE	KVM_ARCH_REQ(31)
>>  
>>  #define CR0_RESERVED_BITS                                               \
>>  	(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
>> @@ -1402,6 +1403,8 @@ struct kvm_x86_ops {
>>  
>>  	int (*handle_rmp_page_fault)(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t pfn,
>>  			int level, u64 error_code);
>> +
>> +	void (*update_protected_guest_state)(struct kvm_vcpu *vcpu);
>>  };
>>  
>>  struct kvm_x86_nested_ops {
>> diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
>> index 5e72faa00cf2..6634a952563e 100644
>> --- a/arch/x86/include/asm/svm.h
>> +++ b/arch/x86/include/asm/svm.h
>> @@ -220,6 +220,9 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
>>  #define SVM_SEV_FEATURES_DEBUG_SWAP		BIT(5)
>>  #define SVM_SEV_FEATURES_PREVENT_HOST_IBS	BIT(6)
>>  #define SVM_SEV_FEATURES_BTB_ISOLATION		BIT(7)
>> +#define SVM_SEV_FEATURES_INT_INJ_MODES			\
>> +	(SVM_SEV_FEATURES_RESTRICTED_INJECTION |	\
>> +	 SVM_SEV_FEATURES_ALTERNATE_INJECTION)
>>  
>>  struct vmcb_seg {
>>  	u16 selector;
>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>> index d8ad6dd58c87..95f5d25b4f08 100644
>> --- a/arch/x86/kvm/svm/sev.c
>> +++ b/arch/x86/kvm/svm/sev.c
>> @@ -582,6 +582,7 @@ static int sev_launch_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
>>  
>>  static int sev_es_sync_vmsa(struct vcpu_svm *svm)
>>  {
>> +	struct kvm_sev_info *sev = &to_kvm_svm(svm->vcpu.kvm)->sev_info;
>>  	struct sev_es_save_area *save = svm->vmsa;
>>  
>>  	/* Check some debug related fields before encrypting the VMSA */
>> @@ -625,6 +626,12 @@ static int sev_es_sync_vmsa(struct vcpu_svm *svm)
>>  	if (sev_snp_guest(svm->vcpu.kvm))
>>  		save->sev_features |= SVM_SEV_FEATURES_SNP_ACTIVE;
>>  
>> +	/*
>> +	 * Save the VMSA synced SEV features. For now, they are the same for
>> +	 * all vCPUs, so just save each time.
>> +	 */
>> +	sev->sev_features = save->sev_features;
>> +
>>  	return 0;
>>  }
>>  
>> @@ -2682,6 +2689,10 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
>>  		if (!ghcb_sw_scratch_is_valid(ghcb))
>>  			goto vmgexit_err;
>>  		break;
>> +	case SVM_VMGEXIT_AP_CREATION:
>> +		if (!ghcb_rax_is_valid(ghcb))
>> +			goto vmgexit_err;
>> +		break;
>>  	case SVM_VMGEXIT_NMI_COMPLETE:
>>  	case SVM_VMGEXIT_AP_HLT_LOOP:
>>  	case SVM_VMGEXIT_AP_JUMP_TABLE:
>> @@ -3395,6 +3406,121 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
>>  	return ret;
>>  }
>>  
>> +void sev_snp_update_protected_guest_state(struct kvm_vcpu *vcpu)
>> +{
>> +	struct vcpu_svm *svm = to_svm(vcpu);
>> +	kvm_pfn_t pfn;
>> +
>> +	mutex_lock(&svm->snp_vmsa_mutex);
>> +
>> +	vcpu->arch.mp_state = KVM_MP_STATE_UNINITIALIZED;
>> +
>> +	/* Clear use of the VMSA in the sev_es_init_vmcb() path */
>> +	svm->vmsa_pa = 0;
>> +
>> +	/* Clear use of the VMSA from the VMCB */
>> +	svm->vmcb->control.vmsa_pa = 0;
> 
> PA=0 is not an invalid address.  I don't care what value the GHCB uses for
> "invalid GPA", KVM should always use INVALID_PAGE to track an invalid physical
> address.

Right, I'll switch everything over to INVALID_PAGE, which will also cause
an error on VMRUN, so that is good.

> 
>> +	/* Un-pin previous VMSA */
>> +	if (svm->snp_vmsa_pfn) {
>> +		kvm_release_pfn_dirty(svm->snp_vmsa_pfn);
> 
> Oof, I was wondering why KVM tracks three versions of VMSA.  Actually, I'm still
> wondering why there are three versions.  Aren't snp_vmsa_pfn and vmsa_pa tracking
> the same thing?  Ah, finally figured it out.  vmsa_pa points at svm->vmsa by
> default.  Blech.
> 
>> +		svm->snp_vmsa_pfn = 0;
>> +	}
>> +
>> +	if (svm->snp_vmsa_gpa) {
> 
> This is bogus, GPA=0 is perfectly valid.  As above, use INVALID_PAGE.  A comment
> explaining that the vCPU is offline when VMSA is invalid would also be helpful.

Will do.

> 
>> +		/* Validate that the GPA is page aligned */
>> +		if (!PAGE_ALIGNED(svm->snp_vmsa_gpa))
> 
> This needs to be moved to the VMGEXIT, and it should use page_address_valid() so
> that KVM also checks for a legal GPA.

Will do.

> 
>> +			goto e_unlock;
>> +
>> +		/*
>> +		 * The VMSA is referenced by thy hypervisor physical address,
> 
> s/thy/the, although converting to archaic English could be hilarious...

Heh, I'll fix that.

> 
>> +		 * so retrieve the PFN and pin it.
>> +		 */
>> +		pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(svm->snp_vmsa_gpa));
>> +		if (is_error_pfn(pfn))
>> +			goto e_unlock;
> 
> Silently ignoring the guest request is bad behavior, at worst KVM should exit to
> userspace with an emulation error.

You'll end up with a VMRUN failure, which is likely to be the same result.
But I can look at what it would take to exit up to userspace before that.

> 
>> +
>> +		svm->snp_vmsa_pfn = pfn;
>> +
>> +		/* Use the new VMSA in the sev_es_init_vmcb() path */
>> +		svm->vmsa_pa = pfn_to_hpa(pfn);
>> +		svm->vmcb->control.vmsa_pa = svm->vmsa_pa;
>> +
>> +		vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
>> +	} else {
>> +		vcpu->arch.pv.pv_unhalted = false;
> 
> Shouldn't the RUNNABLE path also clear pv_unhalted?

If anything it should set it, since it will be "unhalted" now. But, I
looked through the code to try and understand if there was a need to set
it and didn't really see any reason. It is only ever set (at least
directly) in one place and is cleared everywhere else. It was odd to me.

> 
>> +		vcpu->arch.mp_state = KVM_MP_STATE_UNINITIALIZED;
> 
> What happens if userspace calls kvm_arch_vcpu_ioctl_set_mpstate, or even worse
> the guest sends INIT-SIPI?  Unless I'm mistaken, either case will cause KVM to
> run the vCPU with vmcb->control.vmsa_pa==0.

Using the INVALID_PAGE value now (and even when it was 0), you'll get a
VMRUN failure.

The AP CREATE_ON_INIT is meant to be used with INIT-SIPI, so if the guest
hasn't done the right thing, then it will fail on VMRUN.

> 
> My initial reaction is that the "offline" case needs a new mp_state, or maybe
> just use KVM_MP_STATE_STOPPED.

I'll look at KVM_MP_STATE_STOPPED. Qemu doesn't reference that state at
all in the i386 support, though, which is why I went with UNINITIALIZED.

> 
>> +	}
>> +
>> +e_unlock:
>> +	mutex_unlock(&svm->snp_vmsa_mutex);
>> +}
>> +
>> +static void sev_snp_ap_creation(struct vcpu_svm *svm)
>> +{
>> +	struct kvm_sev_info *sev = &to_kvm_svm(svm->vcpu.kvm)->sev_info;
>> +	struct kvm_vcpu *vcpu = &svm->vcpu;
>> +	struct kvm_vcpu *target_vcpu;
>> +	struct vcpu_svm *target_svm;
>> +	unsigned int request;
>> +	unsigned int apic_id;
>> +	bool kick;
>> +
>> +	request = lower_32_bits(svm->vmcb->control.exit_info_1);
>> +	apic_id = upper_32_bits(svm->vmcb->control.exit_info_1);
>> +
>> +	/* Validate the APIC ID */
>> +	target_vcpu = kvm_get_vcpu_by_id(vcpu->kvm, apic_id);
>> +	if (!target_vcpu)
>> +		return;
> 
> KVM should not silently ignore bad requests, this needs to return an error to the
> guest.

This can be made to return an int and then follow standard VMGEXIT processing.

Which reminds me to get a GHCB discussion going on returning some other
form of error besides exceptions.

> 
>> +
>> +	target_svm = to_svm(target_vcpu);
>> +
>> +	kick = true;
> 
> This is wrong, e.g. KVM will kick the target vCPU even if the request fails.
> I suspect the correct behavior would be to:
> 
>   1. do all sanity checks
>   2. take the necessary lock(s)
>   3. modify target vCPU state
>   4. kick target vCPU unless request==SVM_VMGEXIT_AP_CREATE_ON_INIT

I'm not sure... the guest is really trying to do an INIT-SIPI type of
action and if they don't do something correctly, we don't want to leave
the AP running in its previous state, hence the kick, so that it will end
up falling back to UNINITIALIZED state.

> 
>> +	mutex_lock(&target_svm->snp_vmsa_mutex);
> 
> This seems like it's missing a big pile of sanity checks.  E.g. KVM should reject
> SVM_VMGEXIT_AP_CREATE if the target vCPU is already "created", including the case
> where it was "created_on_init" but hasn't yet received INIT-SIPI.

Why? If the guest wants to call it multiple times I guess I don't see a
reason that it would need to call DESTROY first and then CREATE. I don't
know why a guest would want to do that, but I don't think we should
prevent it.

> 
>> +
>> +	target_svm->snp_vmsa_gpa = 0;
>> +	target_svm->snp_vmsa_update_on_init = false;
>> +
>> +	/* Interrupt injection mode shouldn't change for AP creation */
>> +	if (request < SVM_VMGEXIT_AP_DESTROY) {
>> +		u64 sev_features;
>> +
>> +		sev_features = vcpu->arch.regs[VCPU_REGS_RAX];
>> +		sev_features ^= sev->sev_features;
>> +		if (sev_features & SVM_SEV_FEATURES_INT_INJ_MODES) {
> 
> Why is only INT_INJ_MODES checked?  The new comment in sev_es_sync_vmsa() explicitly
> states that sev_features are the same for all vCPUs, but that's not enforced here.
> At a bare minimum I would expect this to sanity check SVM_SEV_FEATURES_SNP_ACTIVE.

That's because we can't really enforce it. The SEV_FEATURES value is part
of the VMSA, of which the hypervisor has no insight into (its encrypted).

The interrupt injection mechanism was specifically requested as a sanity
check type of thing during the GHCB review, and as there were no
objections, it was added (see the end of section 4.1.9).

I can definitely add the check for the SNP_ACTIVE bit, but it isn't required.

> 
>> +			vcpu_unimpl(vcpu, "vmgexit: invalid AP injection mode [%#lx] from guest\n",
>> +				    vcpu->arch.regs[VCPU_REGS_RAX]);
>> +			goto out;
>> +		}
>> +	}
>> +
>> +	switch (request) {
>> +	case SVM_VMGEXIT_AP_CREATE_ON_INIT:
> 
> Out of curiosity, what's the use case for this variant?  I assume the guest has
> to preconfigure the VMSA and ensure the target vCPU's RIP points at something
> sane anyways, otherwise the hypervisor could attack the guest by immediately
> attempting to run the deferred vCPU.  At that point, a guest could simply use an
> existing mechanism to put the target vCPU into a holding pattern.

It came up in early discussions and so was included in the initial draft
of the GHCB SNP support. There wasn't any discussion about it nor any
objections, so it stayed.

> 
>> +		kick = false;
>> +		target_svm->snp_vmsa_update_on_init = true;
>> +		fallthrough;
>> +	case SVM_VMGEXIT_AP_CREATE:
>> +		target_svm->snp_vmsa_gpa = svm->vmcb->control.exit_info_2;
> 
> The incoming GPA needs to be checked for validity, at least as much possible.
> E.g. the PAGE_ALIGNED() check should be done here and be morphed to a synchronous
> error for the guest, not a silent "oops, didn't run your vCPU".

Will do.

> 
>> +		break;
>> +	case SVM_VMGEXIT_AP_DESTROY:
>> +		break;
>> +	default:
>> +		vcpu_unimpl(vcpu, "vmgexit: invalid AP creation request [%#x] from guest\n",
>> +			    request);
>> +		break;
>> +	}
>> +
>> +out:
>> +	mutex_unlock(&target_svm->snp_vmsa_mutex);
>> +
>> +	if (kick) {
>> +		kvm_make_request(KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, target_vcpu);
>> +		kvm_vcpu_kick(target_vcpu);
>> +	}
>> +}
>> +
>>  int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
>>  {
>>  	struct vcpu_svm *svm = to_svm(vcpu);
>> @@ -3523,6 +3649,11 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
>>  		ret = 1;
>>  		break;
>>  	}
>> +	case SVM_VMGEXIT_AP_CREATION:
>> +		sev_snp_ap_creation(svm);
>> +
>> +		ret = 1;
>> +		break;
>>  	case SVM_VMGEXIT_UNSUPPORTED_EVENT:
>>  		vcpu_unimpl(vcpu,
>>  			    "vmgexit: unsupported event - exit_info_1=%#llx, exit_info_2=%#llx\n",
>> @@ -3597,6 +3728,8 @@ void sev_es_create_vcpu(struct vcpu_svm *svm)
>>  	set_ghcb_msr(svm, GHCB_MSR_SEV_INFO(GHCB_VERSION_MAX,
>>  					    GHCB_VERSION_MIN,
>>  					    sev_enc_bit));
>> +
>> +	mutex_init(&svm->snp_vmsa_mutex);
>>  }
>>  
>>  void sev_es_prepare_guest_switch(struct vcpu_svm *svm, unsigned int cpu)
>> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
>> index 74bc635c9608..078a569c85a8 100644
>> --- a/arch/x86/kvm/svm/svm.c
>> +++ b/arch/x86/kvm/svm/svm.c
>> @@ -1304,7 +1304,10 @@ static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>>  	svm->spec_ctrl = 0;
>>  	svm->virt_spec_ctrl = 0;
>>  
>> -	if (!init_event) {
>> +	if (init_event && svm->snp_vmsa_update_on_init) {
> 
> This can race with sev_snp_ap_creation() since the new snp_vmsa_mutex isn't held.
> There needs to be smp_rmb() and smp_wmb() barriers to ensure correct ordering
> between snp_vmsa_update_on_init and consuming the new VMSA gpa.  And of course
> sev_snp_ap_creation() needs to have correct ordering, e.g. as is this code can
> see snp_vmsa_update_on_init=true before the new snp_vmsa_gpa is set.

I'll work on that.

Thanks,
Tom

> 
>> +		svm->snp_vmsa_update_on_init = false;
>> +		sev_snp_update_protected_guest_state(vcpu);
>> +	} else {
>>  		vcpu->arch.apic_base = APIC_DEFAULT_PHYS_BASE |
>>  				       MSR_IA32_APICBASE_ENABLE;
>>  		if (kvm_vcpu_is_reset_bsp(vcpu))
>
Sean Christopherson July 21, 2021, 7:52 p.m. UTC | #3
On Wed, Jul 21, 2021, Tom Lendacky wrote:
> On 7/20/21 7:01 PM, Sean Christopherson wrote:
> > On Wed, Jul 07, 2021, Brijesh Singh wrote:
> >> From: Tom Lendacky <thomas.lendacky@amd.com>
> >> +
> >> +		svm->snp_vmsa_pfn = pfn;
> >> +
> >> +		/* Use the new VMSA in the sev_es_init_vmcb() path */
> >> +		svm->vmsa_pa = pfn_to_hpa(pfn);
> >> +		svm->vmcb->control.vmsa_pa = svm->vmsa_pa;
> >> +
> >> +		vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> >> +	} else {
> >> +		vcpu->arch.pv.pv_unhalted = false;
> > 
> > Shouldn't the RUNNABLE path also clear pv_unhalted?
> 
> If anything it should set it, since it will be "unhalted" now. But, I
> looked through the code to try and understand if there was a need to set
> it and didn't really see any reason. It is only ever set (at least
> directly) in one place and is cleared everywhere else. It was odd to me.

pv_unhalted is specifically used for a "magic" IPI (KVM hijacked a defunct
IPI type) in the context of PV spinlocks.  The idea is that a vCPU that's releasing
a spinlock can kick the next vCPU in the queue, and KVM will directly yield to the
vCPU being kicked so that the guest can efficiently make forward progress.

So it's not wrong to leave pv_unhalted as is, but it's odd to clear it in the
DESTROY case but not CREATE_INIT case.  It should be a moot point, as a sane
implementation should make it impossible to get to CREATE with pv_unhalted=1.

> >> +		vcpu->arch.mp_state = KVM_MP_STATE_UNINITIALIZED;
> > 
> > What happens if userspace calls kvm_arch_vcpu_ioctl_set_mpstate, or even worse
> > the guest sends INIT-SIPI?  Unless I'm mistaken, either case will cause KVM to
> > run the vCPU with vmcb->control.vmsa_pa==0.
> 
> Using the INVALID_PAGE value now (and even when it was 0), you'll get a
> VMRUN failure.
> 
> The AP CREATE_ON_INIT is meant to be used with INIT-SIPI, so if the guest
> hasn't done the right thing, then it will fail on VMRUN.
> 
> > 
> > My initial reaction is that the "offline" case needs a new mp_state, or maybe
> > just use KVM_MP_STATE_STOPPED.
> 
> I'll look at KVM_MP_STATE_STOPPED. Qemu doesn't reference that state at
> all in the i386 support, though, which is why I went with UNINITIALIZED.

Ya, it'd effectively be a new feature.  My concern with UNINITIALIZED is that it
be impossible for KVM to differentiate between "never run" and "destroyed and may
have an invalid VMSA" without looking at the VMSA.

> >> +	mutex_lock(&target_svm->snp_vmsa_mutex);
> > 
> > This seems like it's missing a big pile of sanity checks.  E.g. KVM should reject
> > SVM_VMGEXIT_AP_CREATE if the target vCPU is already "created", including the case
> > where it was "created_on_init" but hasn't yet received INIT-SIPI.
> 
> Why? If the guest wants to call it multiple times I guess I don't see a
> reason that it would need to call DESTROY first and then CREATE. I don't
> know why a guest would want to do that, but I don't think we should
> prevent it.

Because "creating" a vCPU that already exists is non-sensical.  Ditto for
onlining a vCPU that is already onlined.  E.g. from the guest's perspective, I
would fully expect a SVM_VMGEXIT_AP_CREATE to fail, not effectively send the vCPU
to an arbitrary state.

Any ambiguity as to the legality of CREATE/DESTROY absolutely needs to be clarified
in the GHCB.

> >> +
> >> +	target_svm->snp_vmsa_gpa = 0;
> >> +	target_svm->snp_vmsa_update_on_init = false;
> >> +
> >> +	/* Interrupt injection mode shouldn't change for AP creation */
> >> +	if (request < SVM_VMGEXIT_AP_DESTROY) {
> >> +		u64 sev_features;
> >> +
> >> +		sev_features = vcpu->arch.regs[VCPU_REGS_RAX];
> >> +		sev_features ^= sev->sev_features;
> >> +		if (sev_features & SVM_SEV_FEATURES_INT_INJ_MODES) {
> > 
> > Why is only INT_INJ_MODES checked?  The new comment in sev_es_sync_vmsa() explicitly
> > states that sev_features are the same for all vCPUs, but that's not enforced here.
> > At a bare minimum I would expect this to sanity check SVM_SEV_FEATURES_SNP_ACTIVE.
> 
> That's because we can't really enforce it. The SEV_FEATURES value is part
> of the VMSA, of which the hypervisor has no insight into (its encrypted).
> 
> The interrupt injection mechanism was specifically requested as a sanity
> check type of thing during the GHCB review, and as there were no
> objections, it was added (see the end of section 4.1.9).
> 
> I can definitely add the check for the SNP_ACTIVE bit, but it isn't required.

I'm confused.  If we've no insight into what the guest is actually using, what's
the point of the INT_INJ_MODES check?
Tom Lendacky Aug. 20, 2021, 2:44 p.m. UTC | #4
On 7/21/21 2:52 PM, Sean Christopherson wrote:
> On Wed, Jul 21, 2021, Tom Lendacky wrote:
>> On 7/20/21 7:01 PM, Sean Christopherson wrote:
>>> On Wed, Jul 07, 2021, Brijesh Singh wrote:
>>>> From: Tom Lendacky <thomas.lendacky@amd.com>
>>>
...
>>> This seems like it's missing a big pile of sanity checks.  E.g. KVM should reject
>>> SVM_VMGEXIT_AP_CREATE if the target vCPU is already "created", including the case
>>> where it was "created_on_init" but hasn't yet received INIT-SIPI.
>>
>> Why? If the guest wants to call it multiple times I guess I don't see a
>> reason that it would need to call DESTROY first and then CREATE. I don't
>> know why a guest would want to do that, but I don't think we should
>> prevent it.
> 
> Because "creating" a vCPU that already exists is non-sensical.  Ditto for

Maybe the names of CREATE and DESTROY weren't the best choice. The intent 
is to replace INIT-SIPI.

> onlining a vCPU that is already onlined.  E.g. from the guest's perspective, I
> would fully expect a SVM_VMGEXIT_AP_CREATE to fail, not effectively send the vCPU
> to an arbitrary state.

The GHCB document does not require thos. To accomplish the same thing that 
a CREATE does today, the guest does DESTROY followed by CREATE. What's the 
difference, the DESTROY will kick the vCPU out and make it non-runnable 
and the guest will immediately follow that with a CREATE and set the new 
state. All of which can be accomplished by just calling CREATE to begin 
with - the net is the same, the vCPU gets taken from one state to a new state.

> 
> Any ambiguity as to the legality of CREATE/DESTROY absolutely needs to be clarified
> in the GHCB.

I don't see any ambiguity. The document states when the VMSA becomes 
effective and there's no requirement/need to issue a DESTROY before 
another CREATE.

> 
>>>> +
>>>> +	target_svm->snp_vmsa_gpa = 0;
>>>> +	target_svm->snp_vmsa_update_on_init = false;
>>>> +
>>>> +	/* Interrupt injection mode shouldn't change for AP creation */
>>>> +	if (request < SVM_VMGEXIT_AP_DESTROY) {
>>>> +		u64 sev_features;
>>>> +
>>>> +		sev_features = vcpu->arch.regs[VCPU_REGS_RAX];
>>>> +		sev_features ^= sev->sev_features;
>>>> +		if (sev_features & SVM_SEV_FEATURES_INT_INJ_MODES) {
>>>
>>> Why is only INT_INJ_MODES checked?  The new comment in sev_es_sync_vmsa() explicitly
>>> states that sev_features are the same for all vCPUs, but that's not enforced here.
>>> At a bare minimum I would expect this to sanity check SVM_SEV_FEATURES_SNP_ACTIVE.
>>
>> That's because we can't really enforce it. The SEV_FEATURES value is part
>> of the VMSA, of which the hypervisor has no insight into (its encrypted).
>>
>> The interrupt injection mechanism was specifically requested as a sanity
>> check type of thing during the GHCB review, and as there were no
>> objections, it was added (see the end of section 4.1.9).
>>
>> I can definitely add the check for the SNP_ACTIVE bit, but it isn't required.
> 
> I'm confused.  If we've no insight into what the guest is actually using, what's
> the point of the INT_INJ_MODES check?

As I said, it was requested and there were no objections, all with the 
knowledge that the guest could "lie" about it. Maybe it is just to ensure 
the proper behavior of an "honest" guest? It makes for a small bit of 
extra work, but, yes, isn't very useful from a KVM perspective.

Thanks,
Tom

>
diff mbox series

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 117e2e08d7ed..881e05b3f74e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -91,6 +91,7 @@ 
 #define KVM_REQ_MSR_FILTER_CHANGED	KVM_ARCH_REQ(29)
 #define KVM_REQ_UPDATE_CPU_DIRTY_LOGGING \
 	KVM_ARCH_REQ_FLAGS(30, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
+#define KVM_REQ_UPDATE_PROTECTED_GUEST_STATE	KVM_ARCH_REQ(31)
 
 #define CR0_RESERVED_BITS                                               \
 	(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
@@ -1402,6 +1403,8 @@  struct kvm_x86_ops {
 
 	int (*handle_rmp_page_fault)(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t pfn,
 			int level, u64 error_code);
+
+	void (*update_protected_guest_state)(struct kvm_vcpu *vcpu);
 };
 
 struct kvm_x86_nested_ops {
diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 5e72faa00cf2..6634a952563e 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -220,6 +220,9 @@  struct __attribute__ ((__packed__)) vmcb_control_area {
 #define SVM_SEV_FEATURES_DEBUG_SWAP		BIT(5)
 #define SVM_SEV_FEATURES_PREVENT_HOST_IBS	BIT(6)
 #define SVM_SEV_FEATURES_BTB_ISOLATION		BIT(7)
+#define SVM_SEV_FEATURES_INT_INJ_MODES			\
+	(SVM_SEV_FEATURES_RESTRICTED_INJECTION |	\
+	 SVM_SEV_FEATURES_ALTERNATE_INJECTION)
 
 struct vmcb_seg {
 	u16 selector;
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index d8ad6dd58c87..95f5d25b4f08 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -582,6 +582,7 @@  static int sev_launch_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp)
 
 static int sev_es_sync_vmsa(struct vcpu_svm *svm)
 {
+	struct kvm_sev_info *sev = &to_kvm_svm(svm->vcpu.kvm)->sev_info;
 	struct sev_es_save_area *save = svm->vmsa;
 
 	/* Check some debug related fields before encrypting the VMSA */
@@ -625,6 +626,12 @@  static int sev_es_sync_vmsa(struct vcpu_svm *svm)
 	if (sev_snp_guest(svm->vcpu.kvm))
 		save->sev_features |= SVM_SEV_FEATURES_SNP_ACTIVE;
 
+	/*
+	 * Save the VMSA synced SEV features. For now, they are the same for
+	 * all vCPUs, so just save each time.
+	 */
+	sev->sev_features = save->sev_features;
+
 	return 0;
 }
 
@@ -2682,6 +2689,10 @@  static int sev_es_validate_vmgexit(struct vcpu_svm *svm)
 		if (!ghcb_sw_scratch_is_valid(ghcb))
 			goto vmgexit_err;
 		break;
+	case SVM_VMGEXIT_AP_CREATION:
+		if (!ghcb_rax_is_valid(ghcb))
+			goto vmgexit_err;
+		break;
 	case SVM_VMGEXIT_NMI_COMPLETE:
 	case SVM_VMGEXIT_AP_HLT_LOOP:
 	case SVM_VMGEXIT_AP_JUMP_TABLE:
@@ -3395,6 +3406,121 @@  static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
 	return ret;
 }
 
+void sev_snp_update_protected_guest_state(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_svm *svm = to_svm(vcpu);
+	kvm_pfn_t pfn;
+
+	mutex_lock(&svm->snp_vmsa_mutex);
+
+	vcpu->arch.mp_state = KVM_MP_STATE_UNINITIALIZED;
+
+	/* Clear use of the VMSA in the sev_es_init_vmcb() path */
+	svm->vmsa_pa = 0;
+
+	/* Clear use of the VMSA from the VMCB */
+	svm->vmcb->control.vmsa_pa = 0;
+
+	/* Un-pin previous VMSA */
+	if (svm->snp_vmsa_pfn) {
+		kvm_release_pfn_dirty(svm->snp_vmsa_pfn);
+		svm->snp_vmsa_pfn = 0;
+	}
+
+	if (svm->snp_vmsa_gpa) {
+		/* Validate that the GPA is page aligned */
+		if (!PAGE_ALIGNED(svm->snp_vmsa_gpa))
+			goto e_unlock;
+
+		/*
+		 * The VMSA is referenced by thy hypervisor physical address,
+		 * so retrieve the PFN and pin it.
+		 */
+		pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(svm->snp_vmsa_gpa));
+		if (is_error_pfn(pfn))
+			goto e_unlock;
+
+		svm->snp_vmsa_pfn = pfn;
+
+		/* Use the new VMSA in the sev_es_init_vmcb() path */
+		svm->vmsa_pa = pfn_to_hpa(pfn);
+		svm->vmcb->control.vmsa_pa = svm->vmsa_pa;
+
+		vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
+	} else {
+		vcpu->arch.pv.pv_unhalted = false;
+		vcpu->arch.mp_state = KVM_MP_STATE_UNINITIALIZED;
+	}
+
+e_unlock:
+	mutex_unlock(&svm->snp_vmsa_mutex);
+}
+
+static void sev_snp_ap_creation(struct vcpu_svm *svm)
+{
+	struct kvm_sev_info *sev = &to_kvm_svm(svm->vcpu.kvm)->sev_info;
+	struct kvm_vcpu *vcpu = &svm->vcpu;
+	struct kvm_vcpu *target_vcpu;
+	struct vcpu_svm *target_svm;
+	unsigned int request;
+	unsigned int apic_id;
+	bool kick;
+
+	request = lower_32_bits(svm->vmcb->control.exit_info_1);
+	apic_id = upper_32_bits(svm->vmcb->control.exit_info_1);
+
+	/* Validate the APIC ID */
+	target_vcpu = kvm_get_vcpu_by_id(vcpu->kvm, apic_id);
+	if (!target_vcpu)
+		return;
+
+	target_svm = to_svm(target_vcpu);
+
+	kick = true;
+
+	mutex_lock(&target_svm->snp_vmsa_mutex);
+
+	target_svm->snp_vmsa_gpa = 0;
+	target_svm->snp_vmsa_update_on_init = false;
+
+	/* Interrupt injection mode shouldn't change for AP creation */
+	if (request < SVM_VMGEXIT_AP_DESTROY) {
+		u64 sev_features;
+
+		sev_features = vcpu->arch.regs[VCPU_REGS_RAX];
+		sev_features ^= sev->sev_features;
+		if (sev_features & SVM_SEV_FEATURES_INT_INJ_MODES) {
+			vcpu_unimpl(vcpu, "vmgexit: invalid AP injection mode [%#lx] from guest\n",
+				    vcpu->arch.regs[VCPU_REGS_RAX]);
+			goto out;
+		}
+	}
+
+	switch (request) {
+	case SVM_VMGEXIT_AP_CREATE_ON_INIT:
+		kick = false;
+		target_svm->snp_vmsa_update_on_init = true;
+		fallthrough;
+	case SVM_VMGEXIT_AP_CREATE:
+		target_svm->snp_vmsa_gpa = svm->vmcb->control.exit_info_2;
+		break;
+	case SVM_VMGEXIT_AP_DESTROY:
+		break;
+	default:
+		vcpu_unimpl(vcpu, "vmgexit: invalid AP creation request [%#x] from guest\n",
+			    request);
+		break;
+	}
+
+out:
+	mutex_unlock(&target_svm->snp_vmsa_mutex);
+
+	if (kick) {
+		kvm_make_request(KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, target_vcpu);
+		kvm_vcpu_kick(target_vcpu);
+	}
+}
+
 int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -3523,6 +3649,11 @@  int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
 		ret = 1;
 		break;
 	}
+	case SVM_VMGEXIT_AP_CREATION:
+		sev_snp_ap_creation(svm);
+
+		ret = 1;
+		break;
 	case SVM_VMGEXIT_UNSUPPORTED_EVENT:
 		vcpu_unimpl(vcpu,
 			    "vmgexit: unsupported event - exit_info_1=%#llx, exit_info_2=%#llx\n",
@@ -3597,6 +3728,8 @@  void sev_es_create_vcpu(struct vcpu_svm *svm)
 	set_ghcb_msr(svm, GHCB_MSR_SEV_INFO(GHCB_VERSION_MAX,
 					    GHCB_VERSION_MIN,
 					    sev_enc_bit));
+
+	mutex_init(&svm->snp_vmsa_mutex);
 }
 
 void sev_es_prepare_guest_switch(struct vcpu_svm *svm, unsigned int cpu)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 74bc635c9608..078a569c85a8 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1304,7 +1304,10 @@  static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	svm->spec_ctrl = 0;
 	svm->virt_spec_ctrl = 0;
 
-	if (!init_event) {
+	if (init_event && svm->snp_vmsa_update_on_init) {
+		svm->snp_vmsa_update_on_init = false;
+		sev_snp_update_protected_guest_state(vcpu);
+	} else {
 		vcpu->arch.apic_base = APIC_DEFAULT_PHYS_BASE |
 				       MSR_IA32_APICBASE_ENABLE;
 		if (kvm_vcpu_is_reset_bsp(vcpu))
@@ -4588,6 +4591,8 @@  static struct kvm_x86_ops svm_x86_ops __initdata = {
 	.write_page_begin = sev_snp_write_page_begin,
 
 	.handle_rmp_page_fault = snp_handle_rmp_page_fault,
+
+	.update_protected_guest_state = sev_snp_update_protected_guest_state,
 };
 
 static struct kvm_x86_init_ops svm_init_ops __initdata = {
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 285d9b97b4d2..f9d25d944f26 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -60,18 +60,26 @@  struct kvm_sev_info {
 	bool active;		/* SEV enabled guest */
 	bool es_active;		/* SEV-ES enabled guest */
 	bool snp_active;	/* SEV-SNP enabled guest */
+
 	unsigned int asid;	/* ASID used for this guest */
 	unsigned int handle;	/* SEV firmware handle */
 	int fd;			/* SEV device fd */
+
 	unsigned long pages_locked; /* Number of pages locked */
 	struct list_head regions_list;  /* List of registered regions */
+
 	u64 ap_jump_table;	/* SEV-ES AP Jump Table address */
+
 	struct kvm *enc_context_owner; /* Owner of copied encryption context */
+
 	struct misc_cg *misc_cg; /* For misc cgroup accounting */
+
 	void *snp_context;      /* SNP guest context page */
 	void *snp_resp_page;	/* SNP guest response page */
 	struct ratelimit_state snp_guest_msg_rs; /* Rate limit the SNP guest message */
 	void *snp_certs_data;
+
+	u64 sev_features;	/* Features set at VMSA creation */
 };
 
 struct kvm_svm {
@@ -192,6 +200,11 @@  struct vcpu_svm {
 	bool guest_state_loaded;
 
 	u64 ghcb_registered_gpa;
+
+	struct mutex snp_vmsa_mutex;
+	gpa_t snp_vmsa_gpa;
+	kvm_pfn_t snp_vmsa_pfn;
+	bool snp_vmsa_update_on_init;	/* SEV-SNP AP Creation on INIT-SIPI */
 };
 
 struct svm_cpu_data {
@@ -555,7 +568,7 @@  void svm_vcpu_unblocking(struct kvm_vcpu *vcpu);
 #define GHCB_VERSION_MAX	2ULL
 #define GHCB_VERSION_MIN	1ULL
 
-#define GHCB_HV_FT_SUPPORTED	GHCB_HV_FT_SNP
+#define GHCB_HV_FT_SUPPORTED	(GHCB_HV_FT_SNP | GHCB_HV_FT_SNP_AP_CREATION)
 
 extern unsigned int max_sev_asid;
 
@@ -584,6 +597,7 @@  int sev_get_tdp_max_page_level(struct kvm_vcpu *vcpu, gpa_t gpa, int max_level);
 void sev_snp_write_page_begin(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn);
 int snp_handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t pfn,
 			      int level, u64 error_code);
+void sev_snp_update_protected_guest_state(struct kvm_vcpu *vcpu);
 
 /* vmenter.S */
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1398b8021982..e9fd59913bc2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9279,6 +9279,14 @@  static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
 		if (kvm_check_request(KVM_REQ_UPDATE_CPU_DIRTY_LOGGING, vcpu))
 			static_call(kvm_x86_update_cpu_dirty_logging)(vcpu);
+
+		if (kvm_check_request(KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, vcpu)) {
+			kvm_x86_ops.update_protected_guest_state(vcpu);
+			if (vcpu->arch.mp_state != KVM_MP_STATE_RUNNABLE) {
+				r = 1;
+				goto out;
+			}
+		}
 	}
 
 	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win ||
@@ -11236,7 +11244,8 @@  static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
 	if (!list_empty_careful(&vcpu->async_pf.done))
 		return true;
 
-	if (kvm_apic_has_events(vcpu))
+	if (kvm_apic_has_events(vcpu) ||
+	    kvm_test_request(KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, vcpu))
 		return true;
 
 	if (vcpu->arch.pv.pv_unhalted)