[RFC] KVM: x86: Support KVM VMs sharing SEV context

Message ID	20210224085915.28751-1-natet@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> Sender: "natet via sendgmr" <natet@natesp.c.googlers.com> Date: Wed, 24 Feb 2021 08:59:15 +0000 Message-Id: <20210224085915.28751-1-natet@google.com> Mime-Version: 1.0 Subject: [RFC] KVM: x86: Support KVM VMs sharing SEV context From: Nathan Tempelman <natet@google.com> To: pbonzini@redhat.com Cc: natet@google.com, thomas.lendacky@amd.com, x86@kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, srutherford@google.com, seanjc@google.com, rientjes@google.com, brijesh.singh@amd.com, Ashish.Kalra@amd.com Content-Type: text/plain; charset="UTF-8" Precedence: bulk
Series	[RFC] KVM: x86: Support KVM VMs sharing SEV context \| expand [RFC] KVM: x86: Support KVM VMs sharing SEV context

Nathan Tempelman Feb. 24, 2021, 8:59 a.m. UTC

Add a capability for userspace to mirror SEV encryption context from
one vm to another. On our side, this is intended to support a
Migration Helper vCPU, but it can also be used generically to support
other in-guest workloads scheduled by the host. The intention is for
the primary guest and the mirror to have nearly identical memslots.

The primary benefits of this are that:
1) The VMs do not share KVM contexts (think APIC/MSRs/etc), so they
can't accidentally clobber each other.
2) The VMs can have different memory-views, which is necessary for post-copy
migration (the migration vCPUs on the target need to read and write to
pages, when the primary guest would VMEXIT).

This does not change the threat model for AMD SEV. Any memory involved
is still owned by the primary guest and its initial state is still
attested to through the normal SEV_LAUNCH_* flows. If userspace wanted
to circumvent SEV, they could achieve the same effect by simply attaching
a vCPU to the primary VM.
This patch deliberately leaves userspace in charge of the memslots for the
mirror, as it already has the power to mess with them in the primary guest.

This patch does not support SEV-ES (much less SNP), as it does not
handle handing off attested VMSAs to the mirror.

For additional context, we need a Migration Helper because SEV PSP migration
is far too slow for our live migration on its own. Using an in-guest
migrator lets us speed this up significantly.

Signed-off-by: Nathan Tempelman <natet@google.com>
---
 Documentation/virt/kvm/api.rst  | 17 +++++++
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/svm/sev.c          | 82 +++++++++++++++++++++++++++++++++
 arch/x86/kvm/svm/svm.c          |  2 +
 arch/x86/kvm/svm/svm.h          |  2 +
 arch/x86/kvm/x86.c              |  7 ++-
 include/linux/kvm_host.h        |  1 +
 include/uapi/linux/kvm.h        |  1 +
 virt/kvm/kvm_main.c             |  8 ++++
 9 files changed, 120 insertions(+), 1 deletion(-)

Paolo Bonzini Feb. 24, 2021, 9:18 a.m. UTC | #1

[CCing Nathaniel McCallum]

On 24/02/21 09:59, Nathan Tempelman wrote:
> 
> +7.23 KVM_CAP_VM_COPY_ENC_CONTEXT_TO
> +-----------------------------------
> +
> +Architectures: x86 SEV enabled
> +Type: system

vm ioctl, not system (/dev/kvm).  But, see below.

> +Parameters: args[0] is the fd of the kvm to mirror encryption context to
> +Returns: 0 on success; ENOTTY on error
> +
> +This capability enables userspace to copy encryption context from a primary
> +vm to the vm indicated by the fd.
> +
> +This is intended to support in-guest workloads scheduled by the host. This
> +allows the in-guest workload to maintain its own NPTs and keeps the two vms
> +from accidentally clobbering each other with interrupts and the like (separate
> +APIC/MSRs/etc).

 From purely an API design standpoint, I think I'd prefer a "set context 
from" API (the other way round) to match the existing KVM_SEV_INIT.

Apart from this, the code is very nice and I would have no issues 
merging this in 5.12 even after the merge window.

As an aside, do you happen to have SEV selftests at Google?  I would 
gladly volunteer to write the selftest myself for this ioctl given the 
infrastructure.

Thanks,

Paolo

Sean Christopherson Feb. 24, 2021, 4:58 p.m. UTC | #2

+Marc and Wooky

On Wed, Feb 24, 2021, Paolo Bonzini wrote:
> [CCing Nathaniel McCallum]

Ah, I assume Enarx can use this to share an asid across multiple workloads?

> On 24/02/21 09:59, Nathan Tempelman wrote:
> > 
> > +7.23 KVM_CAP_VM_COPY_ENC_CONTEXT_TO
> > +-----------------------------------
> > +
> > +Architectures: x86 SEV enabled
> > +Type: system
> 
> vm ioctl, not system (/dev/kvm).  But, see below.
> 
> > +Parameters: args[0] is the fd of the kvm to mirror encryption context to
> > +Returns: 0 on success; ENOTTY on error
> > +
> > +This capability enables userspace to copy encryption context from a primary
> > +vm to the vm indicated by the fd.
> > +
> > +This is intended to support in-guest workloads scheduled by the host. This
> > +allows the in-guest workload to maintain its own NPTs and keeps the two vms
> > +from accidentally clobbering each other with interrupts and the like (separate
> > +APIC/MSRs/etc).
> 
> From purely an API design standpoint, I think I'd prefer a "set context
> from" API (the other way round) to match the existing KVM_SEV_INIT.
> 
> Apart from this, the code is very nice and I would have no issues merging
> this in 5.12 even after the merge window.

And I don't think it would prevent us from refcounting the asid itself in the
future, e.g. to avoid pinning the "parent" KVM, which was my biggest hesitation
with the approach.

That being said, is there a strong need to get this into 5.12?  AIUI, this hasn't
had any meaningful testing, selftests/kvm-unit-tests or otherwise.  Pushing out
to 5.13 might give us a good chance of getting some real testing before merging,
depending on the readiness of SEV testing support.

> As an aside, do you happen to have SEV selftests at Google?  I would gladly
> volunteer to write the selftest myself for this ioctl given the
> infrastructure.

No, but I believe someone(s) from AMD is working on selftests.

On a related topic, Marc and Wooky are working on adapting kvm-unit-tests to run
SEV guests.  IIUC, the last open before posting an RFC is how best to make
kvm-unit-tests play nice with UEFI.

Sean Christopherson Feb. 24, 2021, 5:37 p.m. UTC | #3

On Wed, Feb 24, 2021, Nathan Tempelman wrote:
>  static bool __sev_recycle_asids(int min_asid, int max_asid)
>  {
> @@ -1124,6 +1129,10 @@ int svm_mem_enc_op(struct kvm *kvm, void __user *argp)
>  	if (copy_from_user(&sev_cmd, argp, sizeof(struct kvm_sev_cmd)))
>  		return -EFAULT;
>  
> +	/* enc_context_owner handles all memory enc operations */
> +	if (is_mirroring_enc_context(kvm))
> +		return -ENOTTY;

Is this strictly necessary?  Honest question, as I don't know the hardware/PSP
flows well enough to understand how asids are tied to the state managed by the
PSP.

> +
>  	mutex_lock(&kvm->lock);
>  
>  	switch (sev_cmd.id) {
> @@ -1186,6 +1195,10 @@ int svm_register_enc_region(struct kvm *kvm,
>  	if (!sev_guest(kvm))
>  		return -ENOTTY;
>  
> +	/* If kvm is mirroring encryption context it isn't responsible for it */
> +	if (is_mirroring_enc_context(kvm))

Hmm, preventing the mirror from pinning memory only works if the two VMs are in
the same address space (process), which isn't guaranteed/enforced by the ioctl().
Obviously we could check and enforce that, but do we really need to?

Part of me thinks it would be better to treat the new ioctl() as a variant of
sev_guest_init(), i.e. purely make this a way to share asids.

> +		return -ENOTTY;
> +
>  	if (range->addr > ULONG_MAX || range->size > ULONG_MAX)
>  		return -EINVAL;
>  
> @@ -1252,6 +1265,10 @@ int svm_unregister_enc_region(struct kvm *kvm,
>  	struct enc_region *region;
>  	int ret;
>  
> +	/* If kvm is mirroring encryption context it isn't responsible for it */
> +	if (is_mirroring_enc_context(kvm))
> +		return -ENOTTY;
> +
>  	mutex_lock(&kvm->lock);
>  
>  	if (!sev_guest(kvm)) {
> @@ -1282,6 +1299,65 @@ int svm_unregister_enc_region(struct kvm *kvm,
>  	return ret;
>  }
>  
> +int svm_vm_copy_asid_to(struct kvm *kvm, unsigned int mirror_kvm_fd)
> +{
> +	struct file *mirror_kvm_file;
> +	struct kvm *mirror_kvm;
> +	struct kvm_sev_info *mirror_kvm_sev;

What about using src and dst, e.g. src_kvm, dest_kvm_fd, dest_kvm, etc...?  For
my brain, the mirror terminology adds an extra layer of translation.

> +	unsigned int asid;
> +	int ret;
> +
> +	if (!sev_guest(kvm))
> +		return -ENOTTY;
> +
> +	mutex_lock(&kvm->lock);
> +
> +	/* Mirrors of mirrors should work, but let's not get silly */

Do we really care?

> +	if (is_mirroring_enc_context(kvm)) {
> +		ret = -ENOTTY;
> +		goto failed;
> +	}
> +
> +	mirror_kvm_file = fget(mirror_kvm_fd);
> +	if (!kvm_is_kvm(mirror_kvm_file)) {
> +		ret = -EBADF;
> +		goto failed;
> +	}
> +
> +	mirror_kvm = mirror_kvm_file->private_data;
> +
> +	if (mirror_kvm == kvm || is_mirroring_enc_context(mirror_kvm)) {

This is_mirroring_enc_context() check needs to be after mirror_kvm->lock is
acquired, else there's a TOCTOU race.

I also suspect there needs to be more checks on the destination.  E.g. what
happens if the destination already has vCPUs that are currently running?  Though
on that front, sev_guest_init() also doesn't guard against this.  Feels like
that flow and this one should check kvm->created_vcpus.

> +		ret = -ENOTTY;
> +		fput(mirror_kvm_file);

Nit, probably worth adding a second error label to handle this fput(), e.g. in
case additional checks are needed in the future.  Actually, I suspect that's
already needed to fix the TOCTOU bug.

> +		goto failed;
> +	}
> +
> +	asid = *&to_kvm_svm(kvm)->sev_info.asid;

Don't think "*&" is necessary. :-)

> +
> +	/*
> +	 * The mirror_kvm holds an enc_context_owner ref so its asid can't
> +	 * disappear until we're done with it
> +	 */
> +	kvm_get_kvm(kvm);

Do we really need/want to take a reference to the source 'struct kvm'?  IMO,
the so called mirror should never be doing operations with its source context,
i.e. should not have easy access to 'struct kvm'.  We already have a reference
to the fd, any reason not to use that to ensure liveliness of the source?

> +
> +	mutex_unlock(&kvm->lock);
> +	mutex_lock(&mirror_kvm->lock);
> +
> +	/* Set enc_context_owner and copy its encryption context over */
> +	mirror_kvm_sev = &to_kvm_svm(mirror_kvm)->sev_info;
> +	mirror_kvm_sev->enc_context_owner = kvm;
> +	mirror_kvm_sev->asid = asid;
> +	mirror_kvm_sev->active = true;

I would prefer a prep patch to move "INIT_LIST_HEAD(&sev->regions_list);" from
sev_guest_init() to when the VM is instantiated.  Shaving a few cycles in that
flow is meaningless, and not initializing the list of regions is odd, and will
cause problems if mirrors are allowed to pin memory (or do PSP commands).

> +
> +	mutex_unlock(&mirror_kvm->lock);
> +	fput(mirror_kvm_file);
> +	return 0;
> +
> +failed:
> +	mutex_unlock(&kvm->lock);
> +	return ret;
> +}
> +
>  void sev_vm_destroy(struct kvm *kvm)
>  {
>  	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> @@ -1293,6 +1369,12 @@ void sev_vm_destroy(struct kvm *kvm)
>  
>  	mutex_lock(&kvm->lock);
>  
> +	/* If this is a mirror_kvm release the enc_context_owner and skip sev cleanup */
> +	if (is_mirroring_enc_context(kvm)) {
> +		kvm_put_kvm(sev->enc_context_owner);
> +		return;
> +	}
> +
>  	/*
>  	 * Ensure that all guest tagged cache entries are flushed before
>  	 * releasing the pages back to the system for use. CLFLUSH will
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 42d4710074a6..5308b7f8c11c 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -4608,6 +4608,8 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
>  	.mem_enc_reg_region = svm_register_enc_region,
>  	.mem_enc_unreg_region = svm_unregister_enc_region,
>  
> +	.vm_copy_enc_context_to = svm_vm_copy_asid_to,
> +
>  	.can_emulate_instruction = svm_can_emulate_instruction,
>  
>  	.apic_init_signal_blocked = svm_apic_init_signal_blocked,
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 39e071fdab0c..1e65c912552d 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -65,6 +65,7 @@ struct kvm_sev_info {
>  	unsigned long pages_locked; /* Number of pages locked */
>  	struct list_head regions_list;  /* List of registered regions */
>  	u64 ap_jump_table;	/* SEV-ES AP Jump Table address */
> +	struct kvm *enc_context_owner; /* Owner of copied encryption context */
>  };
>  
>  struct kvm_svm {
> @@ -561,6 +562,7 @@ int svm_register_enc_region(struct kvm *kvm,
>  			    struct kvm_enc_region *range);
>  int svm_unregister_enc_region(struct kvm *kvm,
>  			      struct kvm_enc_region *range);
> +int svm_vm_copy_asid_to(struct kvm *kvm, unsigned int child_fd);
>  void pre_sev_run(struct vcpu_svm *svm, int cpu);
>  void __init sev_hardware_setup(void);
>  void sev_hardware_teardown(void);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 3fa140383f5d..7bbcf37fcc2b 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3753,6 +3753,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  	case KVM_CAP_X86_USER_SPACE_MSR:
>  	case KVM_CAP_X86_MSR_FILTER:
>  	case KVM_CAP_ENFORCE_PV_FEATURE_CPUID:
> +	case KVM_CAP_VM_COPY_ENC_CONTEXT_TO:
>  		r = 1;
>  		break;
>  	case KVM_CAP_XEN_HVM:
> @@ -4649,7 +4650,6 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
>  			kvm_update_pv_runtime(vcpu);
>  
>  		return 0;
> -
>  	default:
>  		return -EINVAL;
>  	}
> @@ -5321,6 +5321,11 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>  			kvm->arch.bus_lock_detection_enabled = true;
>  		r = 0;
>  		break;
> +	case KVM_CAP_VM_COPY_ENC_CONTEXT_TO:
> +		r = -ENOTTY;
> +		if (kvm_x86_ops.vm_copy_enc_context_to)
> +			r = kvm_x86_ops.vm_copy_enc_context_to(kvm, cap->args[0]);

This can be a static call.

On a related topic, does this really need to be a separate ioctl()?  TDX can't
share encryption contexts, everything that KVM can do for a TDX guest requires
the per-VM context.  Unless there is a known non-x86 use case, it might be
better to make this a mem_enc_op, and then it can be named SEV_SHARE_ASID or
something.

> +		return r;
>  	default:
>  		r = -EINVAL;
>  		break;
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index e126ebda36d0..18491638f070 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -637,6 +637,7 @@ void kvm_exit(void);
>  
>  void kvm_get_kvm(struct kvm *kvm);
>  void kvm_put_kvm(struct kvm *kvm);
> +bool kvm_is_kvm(struct file *file);
>  void kvm_put_kvm_no_destroy(struct kvm *kvm);
>  
>  static inline struct kvm_memslots *__kvm_memslots(struct kvm *kvm, int as_id)
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 63f8f6e95648..5b6296772db9 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1077,6 +1077,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_SYS_HYPERV_CPUID 191
>  #define KVM_CAP_DIRTY_LOG_RING 192
>  #define KVM_CAP_X86_BUS_LOCK_EXIT 193
> +#define KVM_CAP_VM_COPY_ENC_CONTEXT_TO 194
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 001b9de4e727..5f31fcda4777 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -739,6 +739,8 @@ void __weak kvm_arch_pre_destroy_vm(struct kvm *kvm)
>  {
>  }
>  
> +static struct file_operations kvm_vm_fops;

I'd probably prefer to put the helper just below kvm_vm_fops instead of adding
a forward declaration.  IMO it's not all that important to add the helper close
to kvm_get/put_kvm().

> +
>  static struct kvm *kvm_create_vm(unsigned long type)
>  {
>  	struct kvm *kvm = kvm_arch_alloc_vm();
> @@ -903,6 +905,12 @@ void kvm_put_kvm(struct kvm *kvm)
>  }
>  EXPORT_SYMBOL_GPL(kvm_put_kvm);
>  
> +bool kvm_is_kvm(struct file *file)

Heh, maybe kvm_file_is_kvm()?  or just file_is_kvm()?

> +{
> +	return file && file->f_op == &kvm_vm_fops;
> +}
> +EXPORT_SYMBOL_GPL(kvm_is_kvm);
> +
>  /*
>   * Used to put a reference that was taken on behalf of an object associated
>   * with a user-visible file descriptor, e.g. a vcpu or device, if installation
> -- 
> 2.30.0.617.g56c4b15f3c-goog
>

Steve Rutherford Feb. 25, 2021, 3:44 a.m. UTC | #4

On Wed, Feb 24, 2021 at 1:00 AM Nathan Tempelman <natet@google.com> wrote:
>
> @@ -1186,6 +1195,10 @@ int svm_register_enc_region(struct kvm *kvm,
>         if (!sev_guest(kvm))
>                 return -ENOTTY;
>
> +       /* If kvm is mirroring encryption context it isn't responsible for it */
> +       if (is_mirroring_enc_context(kvm))
> +               return -ENOTTY;
> +

Is this necessary? Same for unregister. When we looked at
sev_pin_memory, I believe we concluded that double pinning was safe.
>
>         if (range->addr > ULONG_MAX || range->size > ULONG_MAX)
>                 return -EINVAL;
>
> @@ -1252,6 +1265,10 @@ int svm_unregister_enc_region(struct kvm *kvm,
>         struct enc_region *region;
>         int ret;
>
> +       /* If kvm is mirroring encryption context it isn't responsible for it */
> +       if (is_mirroring_enc_context(kvm))
> +               return -ENOTTY;
> +
>         mutex_lock(&kvm->lock);
>
>         if (!sev_guest(kvm)) {
> @@ -1282,6 +1299,65 @@ int svm_unregister_enc_region(struct kvm *kvm,
>         return ret;
>  }
>
> +int svm_vm_copy_asid_to(struct kvm *kvm, unsigned int mirror_kvm_fd)
> +{
> +       struct file *mirror_kvm_file;
> +       struct kvm *mirror_kvm;
> +       struct kvm_sev_info *mirror_kvm_sev;
> +       unsigned int asid;
> +       int ret;
> +
> +       if (!sev_guest(kvm))
> +               return -ENOTTY;

You definitely don't want this: this is the function that turns the vm
into an SEV guest (marks SEV as active).

(Not an issue with this patch, but a broader issue) I believe
sev_guest lacks the necessary acquire/release barriers on sev->active,
since it's called without the kvm lock. I mean, it's x86, so the only
one that's going to hose you is the compiler for this type of access.
There should be an smp_rmb() after the access in sev_guest and an
smp_wmb() before the access in SEV_GUEST_INIT and here.
>
> +
> +       mutex_lock(&kvm->lock);
> +
> +       /* Mirrors of mirrors should work, but let's not get silly */
> +       if (is_mirroring_enc_context(kvm)) {
> +               ret = -ENOTTY;
> +               goto failed;
> +       }
> +
> +       mirror_kvm_file = fget(mirror_kvm_fd);
> +       if (!kvm_is_kvm(mirror_kvm_file)) {
> +               ret = -EBADF;
> +               goto failed;
> +       }
> +
> +       mirror_kvm = mirror_kvm_file->private_data;
> +
> +       if (mirror_kvm == kvm || is_mirroring_enc_context(mirror_kvm)) {
Just check if the source is an sev_guest and that the destination is
not an sev_guest.

I reviewed earlier incarnations of this, and think the high-level idea
is sound. I'd like to see kvm-selftests for this patch, and plan on
collaborating with AMD to help make those happen.

Steve Rutherford Feb. 25, 2021, 3:55 a.m. UTC | #5

On Wed, Feb 24, 2021 at 9:37 AM Sean Christopherson <seanjc@google.com> wrote:
> > +     unsigned int asid;
> > +     int ret;
> > +
> > +     if (!sev_guest(kvm))
> > +             return -ENOTTY;
> > +
> > +     mutex_lock(&kvm->lock);
> > +
> > +     /* Mirrors of mirrors should work, but let's not get silly */
>
> Do we really care?
Yes, unless you reparent mirrors of mirrors to the original ASID
owner. If you don't do that, I think userspace could pump a chain of
mirrors to blow the kernel stack when it closes the leaf vm, since you
could build up a chain of sev_vm_destroys. Refcounting the ASIDs
directly would also fix this.

Nate's early implementation did the reparenting, but I pushed for the
simplification since it made the locking a bit hairy.
>
> > +     if (is_mirroring_enc_context(kvm)) {
> > +             ret = -ENOTTY;
> > +             goto failed;
> > +     }
> > +

Paolo Bonzini Feb. 25, 2021, 11:26 a.m. UTC | #6

On 24/02/21 17:58, Sean Christopherson wrote:
> That being said, is there a strong need to get this into 5.12?  AIUI, this hasn't
> had any meaningful testing, selftests/kvm-unit-tests or otherwise.  Pushing out
> to 5.13 might give us a good chance of getting some real testing before merging,
> depending on the readiness of SEV testing support.

Note that I don't mean including this in the merge window.  However, I 
know that there are multiple people working on alternative SEV live 
migration support, and as long as we are sure that the API is simple and 
useful, it should be okay to merge this for rc2 or rc3.

Paolo

Tom Lendacky Feb. 25, 2021, 2:57 p.m. UTC | #7

On 2/24/21 9:44 PM, Steve Rutherford wrote:
> On Wed, Feb 24, 2021 at 1:00 AM Nathan Tempelman <natet@google.com> wrote:
>>
>> @@ -1186,6 +1195,10 @@ int svm_register_enc_region(struct kvm *kvm,
>>          if (!sev_guest(kvm))
>>                  return -ENOTTY;
>>
>> +       /* If kvm is mirroring encryption context it isn't responsible for it */
>> +       if (is_mirroring_enc_context(kvm))
>> +               return -ENOTTY;
>> +
> 
> Is this necessary? Same for unregister. When we looked at
> sev_pin_memory, I believe we concluded that double pinning was safe.
>>
>>          if (range->addr > ULONG_MAX || range->size > ULONG_MAX)
>>                  return -EINVAL;
>>
>> @@ -1252,6 +1265,10 @@ int svm_unregister_enc_region(struct kvm *kvm,
>>          struct enc_region *region;
>>          int ret;
>>
>> +       /* If kvm is mirroring encryption context it isn't responsible for it */
>> +       if (is_mirroring_enc_context(kvm))
>> +               return -ENOTTY;
>> +
>>          mutex_lock(&kvm->lock);
>>
>>          if (!sev_guest(kvm)) {
>> @@ -1282,6 +1299,65 @@ int svm_unregister_enc_region(struct kvm *kvm,
>>          return ret;
>>   }
>>
>> +int svm_vm_copy_asid_to(struct kvm *kvm, unsigned int mirror_kvm_fd)
>> +{
>> +       struct file *mirror_kvm_file;
>> +       struct kvm *mirror_kvm;
>> +       struct kvm_sev_info *mirror_kvm_sev;
>> +       unsigned int asid;
>> +       int ret;
>> +
>> +       if (!sev_guest(kvm))
>> +               return -ENOTTY;
> 
> You definitely don't want this: this is the function that turns the vm
> into an SEV guest (marks SEV as active).

The sev_guest() function does not set sev->active, it only checks it. The 
sev_guest_init() function is where sev->active is set.

> 
> (Not an issue with this patch, but a broader issue) I believe
> sev_guest lacks the necessary acquire/release barriers on sev->active,

The svm_mem_enc_op() takes the kvm lock and that is the only way into the 
sev_guest_init() function where sev->active is set.

Thanks,
Tom

> since it's called without the kvm lock. I mean, it's x86, so the only
> one that's going to hose you is the compiler for this type of access.
> There should be an smp_rmb() after the access in sev_guest and an
> smp_wmb() before the access in SEV_GUEST_INIT and here.
>>
>> +
>> +       mutex_lock(&kvm->lock);
>> +
>> +       /* Mirrors of mirrors should work, but let's not get silly */
>> +       if (is_mirroring_enc_context(kvm)) {
>> +               ret = -ENOTTY;
>> +               goto failed;
>> +       }
>> +
>> +       mirror_kvm_file = fget(mirror_kvm_fd);
>> +       if (!kvm_is_kvm(mirror_kvm_file)) {
>> +               ret = -EBADF;
>> +               goto failed;
>> +       }
>> +
>> +       mirror_kvm = mirror_kvm_file->private_data;
>> +
>> +       if (mirror_kvm == kvm || is_mirroring_enc_context(mirror_kvm)) {
> Just check if the source is an sev_guest and that the destination is
> not an sev_guest.
> 
> I reviewed earlier incarnations of this, and think the high-level idea
> is sound. I'd like to see kvm-selftests for this patch, and plan on
> collaborating with AMD to help make those happen.
>

James Bottomley Feb. 25, 2021, 5:53 p.m. UTC | #8

> Add a capability for userspace to mirror SEV encryption context from
> one vm to another. On our side, this is intended to support a
> Migration Helper vCPU, but it can also be used generically to support
> other in-guest workloads scheduled by the host. The intention is for
> the primary guest and the mirror to have nearly identical memslots.

So this causes a cloned VM that you can boot up another CPU into but
the boot path must have been already present?  In essence we've already
been thinking about something like this to get migration running inside
OVMF:

https://lore.kernel.org/qemu-devel/8b824c44-6a51-c3a7-6596-921dc47fea39@linux.ibm.com/

It sounds like this mechanism can be used to boot a vCPU through a
mirror VM after the fact, which is very compatible with the above whose
mechanism is  simply to steal a VCPU to hold in reset until it's
activated.  However, you haven't published how you activate the entity
inside the VM ... do you have patches for this so we can see the
internal capture mechanism and mirror VM boot path?

> The primary benefits of this are that:
> 1) The VMs do not share KVM contexts (think APIC/MSRs/etc), so they
> can't accidentally clobber each other.
> 2) The VMs can have different memory-views, which is necessary for
> post-copy migration (the migration vCPUs on the target need to read
> and write to pages, when the primary guest would VMEXIT).
> 
> This does not change the threat model for AMD SEV. Any memory
> involved is still owned by the primary guest and its initial state is
> still attested to through the normal SEV_LAUNCH_* flows. If userspace
> wanted to circumvent SEV, they could achieve the same effect by
> simply attaching a vCPU to the primary VM.
> This patch deliberately leaves userspace in charge of the memslots
> for the mirror, as it already has the power to mess with them in the
> primary guest.

Well it does alter the threat model in that previously the
configuration, including the CPU configuration, was fixed after launch
and attestation.  Now the CSP can alter the configuration via a mirror.
I'm not sure I have a threat for this, but it definitely alters the
model.

> This patch does not support SEV-ES (much less SNP), as it does not
> handle handing off attested VMSAs to the mirror.

One of the reasons for doing the sequestered vcpu is that -ES and -SNP
require the initial CPU state to be part of the attestation, so with
them you can't add CPU state after the fact.  I think you could use
this model if you declare the vCPU in the mirror in the initial
attested VMSA, but that's conjecture at this stage.

> For additional context, we need a Migration Helper because SEV PSP
> migration is far too slow for our live migration on its own. Using an
> in-guest migrator lets us speed this up significantly.

We have the same problem here at IBM, hence the RFC referred to above.

James

Kalra, Ashish Feb. 25, 2021, 6:18 p.m. UTC | #9

>> > For additional context, we need a Migration Helper because SEV PSP
>> > migration is far too slow for our live migration on its own. Using an
>> > in-guest migrator lets us speed this up significantly.
>> 
>> We have the same problem here at IBM, hence the RFC referred to above.
>> 

I do believe that some of these alternative SEV live migration support
or Migration helper (MH) solutions will still use SEV PSP migration for 
migrating the MH itself, therefore the SEV live migration patches
(currently v10 posted upstream) still make sense and will be used.

Thanks,
Ashish

Steve Rutherford Feb. 25, 2021, 6:49 p.m. UTC | #10

On Thu, Feb 25, 2021 at 6:57 AM Tom Lendacky <thomas.lendacky@amd.com> wrote:
> >> +int svm_vm_copy_asid_to(struct kvm *kvm, unsigned int mirror_kvm_fd)
> >> +{
> >> +       struct file *mirror_kvm_file;
> >> +       struct kvm *mirror_kvm;
> >> +       struct kvm_sev_info *mirror_kvm_sev;
> >> +       unsigned int asid;
> >> +       int ret;
> >> +
> >> +       if (!sev_guest(kvm))
> >> +               return -ENOTTY;
> >
> > You definitely don't want this: this is the function that turns the vm
> > into an SEV guest (marks SEV as active).
>
> The sev_guest() function does not set sev->active, it only checks it. The
> sev_guest_init() function is where sev->active is set.
Sorry, bad use of the english on my part: the "this" was referring to
svm_vm_copy_asid_to. Right now, you could only pass this sev_guest
check if you had already called sev_guest_init, which seems incorrect.
>
> >
> > (Not an issue with this patch, but a broader issue) I believe
> > sev_guest lacks the necessary acquire/release barriers on sev->active,
>
> The svm_mem_enc_op() takes the kvm lock and that is the only way into the
> sev_guest_init() function where sev->active is set.
There are a few places that check sev->active which don't have the kvm
lock, which is not problematic if we add in a few compiler barriers
(ala irqchip_split et al).
>
> Thanks,
> Tom

Thanks,
Steve

Paolo Bonzini Feb. 25, 2021, 8:33 p.m. UTC | #11

On 25/02/21 19:18, Ashish Kalra wrote:
> I do believe that some of these alternative SEV live migration support
> or Migration helper (MH) solutions will still use SEV PSP migration for
> migrating the MH itself, therefore the SEV live migration patches
> (currently v10 posted upstream) still make sense and will be used.

I think that since the migration helper (at least for SEV, not -ES) is 
part of the attested firmware, it can be started on the destination 
before the rest of the VM.

Paolo

Paolo Bonzini Feb. 25, 2021, 8:36 p.m. UTC | #12

On 25/02/21 18:53, James Bottomley wrote:
> 
> https://lore.kernel.org/qemu-devel/8b824c44-6a51-c3a7-6596-921dc47fea39@linux.ibm.com/
> 
> It sounds like this mechanism can be used to boot a vCPU through a
> mirror VM after the fact, which is very compatible with the above whose
> mechanism is  simply to steal a VCPU to hold in reset until it's
> activated.

Yes, and it's much cleaner because, for example, the extra vCPU need not 
participate in the ACPI hotplug stuff and can even use a simplified run 
loop.

Paolo

Kalra, Ashish Feb. 26, 2021, 1:30 p.m. UTC | #13

On Thu, Feb 25, 2021 at 09:33:09PM +0100, Paolo Bonzini wrote:
> On 25/02/21 19:18, Ashish Kalra wrote:
> > I do believe that some of these alternative SEV live migration support
> > or Migration helper (MH) solutions will still use SEV PSP migration for
> > migrating the MH itself, therefore the SEV live migration patches
> > (currently v10 posted upstream) still make sense and will be used.
> 
> I think that since the migration helper (at least for SEV, not -ES) is part
> of the attested firmware, it can be started on the destination before the
> rest of the VM.
> 

We may also need to transport VMSA(s) securely using the PSP migration
support for SEV-ES live migration with MH.

Thanks,
Ashish

Kalra, Ashish March 5, 2021, 2:04 p.m. UTC | #14

On Wed, Feb 24, 2021 at 08:59:15AM +0000, Nathan Tempelman wrote:
> Add a capability for userspace to mirror SEV encryption context from
> one vm to another. On our side, this is intended to support a
> Migration Helper vCPU, but it can also be used generically to support
> other in-guest workloads scheduled by the host. The intention is for
> the primary guest and the mirror to have nearly identical memslots.
> 
> The primary benefits of this are that:
> 1) The VMs do not share KVM contexts (think APIC/MSRs/etc), so they
> can't accidentally clobber each other.
> 2) The VMs can have different memory-views, which is necessary for post-copy
> migration (the migration vCPUs on the target need to read and write to
> pages, when the primary guest would VMEXIT).
> 
> This does not change the threat model for AMD SEV. Any memory involved
> is still owned by the primary guest and its initial state is still
> attested to through the normal SEV_LAUNCH_* flows. If userspace wanted
> to circumvent SEV, they could achieve the same effect by simply attaching
> a vCPU to the primary VM.
> This patch deliberately leaves userspace in charge of the memslots for the
> mirror, as it already has the power to mess with them in the primary guest.
> 
> This patch does not support SEV-ES (much less SNP), as it does not
> handle handing off attested VMSAs to the mirror.
> 
> For additional context, we need a Migration Helper because SEV PSP migration
> is far too slow for our live migration on its own. Using an in-guest
> migrator lets us speed this up significantly.
> 
> Signed-off-by: Nathan Tempelman <natet@google.com>
> ---
>  Documentation/virt/kvm/api.rst  | 17 +++++++
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/svm/sev.c          | 82 +++++++++++++++++++++++++++++++++
>  arch/x86/kvm/svm/svm.c          |  2 +
>  arch/x86/kvm/svm/svm.h          |  2 +
>  arch/x86/kvm/x86.c              |  7 ++-
>  include/linux/kvm_host.h        |  1 +
>  include/uapi/linux/kvm.h        |  1 +
>  virt/kvm/kvm_main.c             |  8 ++++
>  9 files changed, 120 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 482508ec7cc4..438b647663c9 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6213,6 +6213,23 @@ the bus lock vm exit can be preempted by a higher priority VM exit, the exit
>  notifications to userspace can be KVM_EXIT_BUS_LOCK or other reasons.
>  KVM_RUN_BUS_LOCK flag is used to distinguish between them.
>  
> +7.23 KVM_CAP_VM_COPY_ENC_CONTEXT_TO
> +-----------------------------------
> +
> +Architectures: x86 SEV enabled
> +Type: system
> +Parameters: args[0] is the fd of the kvm to mirror encryption context to
> +Returns: 0 on success; ENOTTY on error
> +
> +This capability enables userspace to copy encryption context from a primary
> +vm to the vm indicated by the fd.
> +
> +This is intended to support in-guest workloads scheduled by the host. This
> +allows the in-guest workload to maintain its own NPTs and keeps the two vms
> +from accidentally clobbering each other with interrupts and the like (separate
> +APIC/MSRs/etc).
> +
> +
>  8. Other capabilities.
>  ======================
>  
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 84499aad01a4..b7636c009647 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1334,6 +1334,7 @@ struct kvm_x86_ops {
>  	int (*mem_enc_op)(struct kvm *kvm, void __user *argp);
>  	int (*mem_enc_reg_region)(struct kvm *kvm, struct kvm_enc_region *argp);
>  	int (*mem_enc_unreg_region)(struct kvm *kvm, struct kvm_enc_region *argp);
> +	int (*vm_copy_enc_context_to)(struct kvm *kvm, unsigned int child_fd);
>  
>  	int (*get_msr_feature)(struct kvm_msr_entry *entry);
>  
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 874ea309279f..2bad6cd2cb4c 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -66,6 +66,11 @@ static int sev_flush_asids(void)
>  	return ret;
>  }
>  
> +static inline bool is_mirroring_enc_context(struct kvm *kvm)
> +{
> +	return &to_kvm_svm(kvm)->sev_info.enc_context_owner;
> +}
> +
>  /* Must be called with the sev_bitmap_lock held */
>  static bool __sev_recycle_asids(int min_asid, int max_asid)
>  {
> @@ -1124,6 +1129,10 @@ int svm_mem_enc_op(struct kvm *kvm, void __user *argp)
>  	if (copy_from_user(&sev_cmd, argp, sizeof(struct kvm_sev_cmd)))
>  		return -EFAULT;
>  
> +	/* enc_context_owner handles all memory enc operations */
> +	if (is_mirroring_enc_context(kvm))
> +		return -ENOTTY;
> +
>  	mutex_lock(&kvm->lock);
>  
>  	switch (sev_cmd.id) {
> @@ -1186,6 +1195,10 @@ int svm_register_enc_region(struct kvm *kvm,
>  	if (!sev_guest(kvm))
>  		return -ENOTTY;
>  
> +	/* If kvm is mirroring encryption context it isn't responsible for it */
> +	if (is_mirroring_enc_context(kvm))
> +		return -ENOTTY;
> +
>  	if (range->addr > ULONG_MAX || range->size > ULONG_MAX)
>  		return -EINVAL;
>  
> @@ -1252,6 +1265,10 @@ int svm_unregister_enc_region(struct kvm *kvm,
>  	struct enc_region *region;
>  	int ret;
>  
> +	/* If kvm is mirroring encryption context it isn't responsible for it */
> +	if (is_mirroring_enc_context(kvm))
> +		return -ENOTTY;
> +
>  	mutex_lock(&kvm->lock);
>  
>  	if (!sev_guest(kvm)) {
> @@ -1282,6 +1299,65 @@ int svm_unregister_enc_region(struct kvm *kvm,
>  	return ret;
>  }
>  
> +int svm_vm_copy_asid_to(struct kvm *kvm, unsigned int mirror_kvm_fd)
> +{
> +	struct file *mirror_kvm_file;
> +	struct kvm *mirror_kvm;
> +	struct kvm_sev_info *mirror_kvm_sev;
> +	unsigned int asid;
> +	int ret;
> +
> +	if (!sev_guest(kvm))
> +		return -ENOTTY;
> +
> +	mutex_lock(&kvm->lock);
> +
> +	/* Mirrors of mirrors should work, but let's not get silly */
> +	if (is_mirroring_enc_context(kvm)) {
> +		ret = -ENOTTY;
> +		goto failed;
> +	}

How will A->B->C->... type of live migration work if mirrors of
mirrors are not supported ?

Thanks,
Ashish

Paolo Bonzini March 5, 2021, 3:13 p.m. UTC | #15

On 05/03/21 15:04, Ashish Kalra wrote:
>> +	/* Mirrors of mirrors should work, but let's not get silly */
>> +	if (is_mirroring_enc_context(kvm)) {
>> +		ret = -ENOTTY;
>> +		goto failed;
>> +	}
> How will A->B->C->... type of live migration work if mirrors of
> mirrors are not supported ?

Each host would only run one main VM and one mirror, wouldn't it?

Paolo

Nathan Tempelman March 5, 2021, 8:43 p.m. UTC | #16

On Fri, Mar 5, 2021 at 7:13 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 05/03/21 15:04, Ashish Kalra wrote:
> >> +    /* Mirrors of mirrors should work, but let's not get silly */
> >> +    if (is_mirroring_enc_context(kvm)) {
> >> +            ret = -ENOTTY;
> >> +            goto failed;
> >> +    }
> > How will A->B->C->... type of live migration work if mirrors of
> > mirrors are not supported ?
>
> Each host would only run one main VM and one mirror, wouldn't it?

That's correct. You could create a second mirror vm of the original
(A->B, A->C) if you needed two in-guest workers, but I don't see a use
for a chain. If anyone can see one I can write it that way, but in the interest
of keeping it simple I've blocked it. Originally I'd built it with
that functionality,
but allowing a chain like that smells like recursion and from what I
understand we don't like recursion in the kernel. There's also the fear as
steve mentioned that we could blow the callstack with a long chain of
destroys starting from the leaf.
Ideally we give userspace one less gun to shoot itself in the foot with.

Kalra, Ashish March 5, 2021, 10:36 p.m. UTC | #17

On Thu, Feb 25, 2021 at 10:49:00AM -0800, Steve Rutherford wrote:
> On Thu, Feb 25, 2021 at 6:57 AM Tom Lendacky <thomas.lendacky@amd.com> wrote:
> > >> +int svm_vm_copy_asid_to(struct kvm *kvm, unsigned int mirror_kvm_fd)
> > >> +{
> > >> +       struct file *mirror_kvm_file;
> > >> +       struct kvm *mirror_kvm;
> > >> +       struct kvm_sev_info *mirror_kvm_sev;
> > >> +       unsigned int asid;
> > >> +       int ret;
> > >> +
> > >> +       if (!sev_guest(kvm))
> > >> +               return -ENOTTY;
> > >
> > > You definitely don't want this: this is the function that turns the vm
> > > into an SEV guest (marks SEV as active).
> >
> > The sev_guest() function does not set sev->active, it only checks it. The
> > sev_guest_init() function is where sev->active is set.
> Sorry, bad use of the english on my part: the "this" was referring to
> svm_vm_copy_asid_to. Right now, you could only pass this sev_guest
> check if you had already called sev_guest_init, which seems incorrect.
> >
> > >
> > > (Not an issue with this patch, but a broader issue) I believe
> > > sev_guest lacks the necessary acquire/release barriers on sev->active,
> >
> > The svm_mem_enc_op() takes the kvm lock and that is the only way into the
> > sev_guest_init() function where sev->active is set.
> There are a few places that check sev->active which don't have the kvm
> lock, which is not problematic if we add in a few compiler barriers
> (ala irqchip_split et al).

Probably, sev->active accesses can be made safe using READ_ONCE() &
WRITE_ONCE().

Thanks,
Ashish

Sean Christopherson March 9, 2021, 5:45 p.m. UTC | #18

On Fri, Mar 05, 2021, Ashish Kalra wrote:
> On Thu, Feb 25, 2021 at 10:49:00AM -0800, Steve Rutherford wrote:
> > On Thu, Feb 25, 2021 at 6:57 AM Tom Lendacky <thomas.lendacky@amd.com> wrote:
> > > >> +int svm_vm_copy_asid_to(struct kvm *kvm, unsigned int mirror_kvm_fd)
> > > >> +{
> > > >> +       struct file *mirror_kvm_file;
> > > >> +       struct kvm *mirror_kvm;
> > > >> +       struct kvm_sev_info *mirror_kvm_sev;
> > > >> +       unsigned int asid;
> > > >> +       int ret;
> > > >> +
> > > >> +       if (!sev_guest(kvm))
> > > >> +               return -ENOTTY;
> > > >
> > > > You definitely don't want this: this is the function that turns the vm
> > > > into an SEV guest (marks SEV as active).
> > >
> > > The sev_guest() function does not set sev->active, it only checks it. The
> > > sev_guest_init() function is where sev->active is set.
> > Sorry, bad use of the english on my part: the "this" was referring to
> > svm_vm_copy_asid_to. Right now, you could only pass this sev_guest
> > check if you had already called sev_guest_init, which seems incorrect.
> > >
> > > >
> > > > (Not an issue with this patch, but a broader issue) I believe
> > > > sev_guest lacks the necessary acquire/release barriers on sev->active,
> > >
> > > The svm_mem_enc_op() takes the kvm lock and that is the only way into the
> > > sev_guest_init() function where sev->active is set.
> > There are a few places that check sev->active which don't have the kvm
> > lock, which is not problematic if we add in a few compiler barriers
> > (ala irqchip_split et al).

Eh, I don't see the point in taking on the complexity of barriers.  Ignoring the
vCPU behavior, the only existing call that isn't safe is svm_register_enc_region().
Fixing that is trivial and easy to understand.

As for the vCPU stuff, adding barriers will not make them safe.  E.g. a barrier
won't magically make init_vmcb() go back in time and set SVM_NESTED_CTL_SEV_ENABLE
if SEV is enabled after vCPUs are created.

> Probably, sev->active accesses can be made safe using READ_ONCE() &
> WRITE_ONCE().

Tobin Feldman-Fitzthum March 11, 2021, 3:30 p.m. UTC | #19

On 24/02/21 09:59, Nathan Tempelman wrote:

> Add a capability for userspace to mirror SEV encryption context from
> one vm to another. On our side, this is intended to support a
> Migration Helper vCPU, but it can also be used generically to support
> other in-guest workloads scheduled by the host. The intention is for
> the primary guest and the mirror to have nearly identical memslots.
>
> The primary benefits of this are that:
> 1) The VMs do not share KVM contexts (think APIC/MSRs/etc), so they
> can't accidentally clobber each other.
> 2) The VMs can have different memory-views, which is necessary for post-copy
> migration (the migration vCPUs on the target need to read and write to
> pages, when the primary guest would VMEXIT).
>
> This does not change the threat model for AMD SEV. Any memory involved
> is still owned by the primary guest and its initial state is still
> attested to through the normal SEV_LAUNCH_* flows. If userspace wanted
> to circumvent SEV, they could achieve the same effect by simply attaching
> a vCPU to the primary VM.
> This patch deliberately leaves userspace in charge of the memslots for the
> mirror, as it already has the power to mess with them in the primary guest.
>
> This patch does not support SEV-ES (much less SNP), as it does not
> handle handing off attested VMSAs to the mirror.
>
> For additional context, we need a Migration Helper because SEV PSP migration
> is far too slow for our live migration on its own. Using an in-guest
> migrator lets us speed this up significantly.
Hello,

We've been thinking a lot about migrating confidential virtual machines 
at IBM. Maybe you've seen the approach that we (Dov Murik and myself) 
shared on the QEMU and OVMF mailing lists. In general, we have tried to 
implement migration without kernel support, which has some drawbacks. 
Mainly, it is difficult to dynamically start the migration handler 
without kernel support, which puts stress on OVMF. If there is momentum 
behind these KVM patches, we think they could go hand-in-hand with some 
of the work that we have done.

I'm not sure if you have patches for a migration handler/helper or 
hypervisor support. If you do, I'd be curious to see them. If not, maybe 
we should try to converge some of the work that has already happened. I 
think that no matter where the migration handler ends up running or how 
it is started, it will do more or less the same things: export pages to 
the HV and import pages from the HV. Similarly, the hypervisor is 
probably going to need similar mechanisms to ask the MH for encrypted 
pages. Given that we already have some of these things,  maybe there is 
a way to bring them together with this patch.

I also have a few specific questions about this patch.

I am not sure how the mirror VM will be supported in QEMU. Usually there 
is one QEMU process per-vm. Now we would need to run a second VM and 
communicate with it during migration. Is there a way to do this without 
adding significant complexity?

You say that SEV-ES is not supported. While there are challenges 
regarding setting the CPU state of the mirror, I think there may also be 
larger issues with using the mirror for -ES. With plain SEV, the 
migration handler only has to worry about guest memory. With SEV-ES the 
MH will probably need to set the CPU state of the guest as well. It 
seems difficult to do this with an MH that is in a separate VM entirely. 
Is there an expectation that the mirror-based approach will ever work 
with SEV-ES?

I am curious where you plan on putting the migration handler itself. We 
were drawn to OVMF because it is measured by the PSP. Do you have some 
alternate approach?

Do you plan to support consecutive migrations (target of first migration 
is source of second)? This is really just a question about the lifetime 
of the MH. Will the mirror VM be started and stopped dynamically or will 
it persist for the life of the guest on both source and target?

Finally, do you plan to use AMD PSP-based migration to migrate parts of 
the mirror VM or of the primary VM? The migration handler we've 
developed does not use PSP-based migration at all; instead it relies on 
secret injection to both source and target VMs to keep the migration 
keys secure. There are trade-offs either way.

-Tobin

Paolo Bonzini March 11, 2021, 4:29 p.m. UTC | #20

On 11/03/21 16:30, Tobin Feldman-Fitzthum wrote:
> I am not sure how the mirror VM will be supported in QEMU. Usually there 
> is one QEMU process per-vm. Now we would need to run a second VM and 
> communicate with it during migration. Is there a way to do this without 
> adding significant complexity?

I can answer this part.  I think this will actually be simpler than with 
auxiliary vCPUs.  There will be a separate pair of VM+vCPU file 
descriptors within the same QEMU process, and some code to set up the 
memory map using KVM_SET_USER_MEMORY_REGION.

However, the code to run this VM will be very small as the VM does not 
have to do MMIO, interrupts, live migration (of itself), etc.  It just 
starts up and communicates with QEMU using a mailbox at a predetermined 
address.

I also think (but I'm not 100% sure) that the auxiliary VM does not have 
to watch changes in the primary VM's memory map (e.g. mapping and 
unmapping of BARs).  In QEMU terms, the auxiliary VM's memory map tracks 
RAMBlocks, not MemoryRegions, which makes things much simpler.

There are already many examples of mini VMMs running special purpose VMs 
in the kernel's tools/testing/selftests/kvm directory, and I don't think 
the QEMU code would be any more complex than that.

Paolo

Nathan Tempelman March 12, 2021, 11:47 p.m. UTC | #21

On Wed, Feb 24, 2021 at 9:37 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Feb 24, 2021, Nathan Tempelman wrote:
> >  static bool __sev_recycle_asids(int min_asid, int max_asid)
> >  {
> > @@ -1124,6 +1129,10 @@ int svm_mem_enc_op(struct kvm *kvm, void __user *argp)
> >       if (copy_from_user(&sev_cmd, argp, sizeof(struct kvm_sev_cmd)))
> >               return -EFAULT;
> >
> > +     /* enc_context_owner handles all memory enc operations */
> > +     if (is_mirroring_enc_context(kvm))
> > +             return -ENOTTY;
>
> Is this strictly necessary?  Honest question, as I don't know the hardware/PSP
> flows well enough to understand how asids are tied to the state managed by the
> PSP.
>
> > +
> >       mutex_lock(&kvm->lock);
> >
> >       switch (sev_cmd.id) {
> > @@ -1186,6 +1195,10 @@ int svm_register_enc_region(struct kvm *kvm,
> >       if (!sev_guest(kvm))
> >               return -ENOTTY;
> >
> > +     /* If kvm is mirroring encryption context it isn't responsible for it */
> > +     if (is_mirroring_enc_context(kvm))
>
> Hmm, preventing the mirror from pinning memory only works if the two VMs are in
> the same address space (process), which isn't guaranteed/enforced by the ioctl().
> Obviously we could check and enforce that, but do we really need to?
>
> Part of me thinks it would be better to treat the new ioctl() as a variant of
> sev_guest_init(), i.e. purely make this a way to share asids.
>
> > +             return -ENOTTY;
> > +
> >       if (range->addr > ULONG_MAX || range->size > ULONG_MAX)
> >               return -EINVAL;
> >
> > @@ -1252,6 +1265,10 @@ int svm_unregister_enc_region(struct kvm *kvm,
> >       struct enc_region *region;
> >       int ret;
> >
> > +     /* If kvm is mirroring encryption context it isn't responsible for it */
> > +     if (is_mirroring_enc_context(kvm))
> > +             return -ENOTTY;
> > +
> >       mutex_lock(&kvm->lock);
> >
> >       if (!sev_guest(kvm)) {
> > @@ -1282,6 +1299,65 @@ int svm_unregister_enc_region(struct kvm *kvm,
> >       return ret;
> >  }
> >
> > +int svm_vm_copy_asid_to(struct kvm *kvm, unsigned int mirror_kvm_fd)
> > +{
> > +     struct file *mirror_kvm_file;
> > +     struct kvm *mirror_kvm;
> > +     struct kvm_sev_info *mirror_kvm_sev;
>
> What about using src and dst, e.g. src_kvm, dest_kvm_fd, dest_kvm, etc...?  For
> my brain, the mirror terminology adds an extra layer of translation.

I like source, but I think I'll keep mirror. I think it captures the
current state
of it better--this isn't it's own full featured sev vm, in a sense
it's a reflection of
the source.

Unless everyone found this confusing?

>
> > +     unsigned int asid;
> > +     int ret;
> > +
> > +     if (!sev_guest(kvm))
> > +             return -ENOTTY;
> > +
> > +     mutex_lock(&kvm->lock);
> > +
> > +     /* Mirrors of mirrors should work, but let's not get silly */
>
> Do we really care?
>
> > +     if (is_mirroring_enc_context(kvm)) {
> > +             ret = -ENOTTY;
> > +             goto failed;
> > +     }
> > +
> > +     mirror_kvm_file = fget(mirror_kvm_fd);
> > +     if (!kvm_is_kvm(mirror_kvm_file)) {
> > +             ret = -EBADF;
> > +             goto failed;
> > +     }
> > +
> > +     mirror_kvm = mirror_kvm_file->private_data;
> > +
> > +     if (mirror_kvm == kvm || is_mirroring_enc_context(mirror_kvm)) {
>
> This is_mirroring_enc_context() check needs to be after mirror_kvm->lock is
> acquired, else there's a TOCTOU race.

Nice. Yeah, I've flipped it around as per Paolo's point and this problem goes
away.

>
> I also suspect there needs to be more checks on the destination.  E.g. what
> happens if the destination already has vCPUs that are currently running?  Though
> on that front, sev_guest_init() also doesn't guard against this.  Feels like
> that flow and this one should check kvm->created_vcpus.
>
> > +             ret = -ENOTTY;
> > +             fput(mirror_kvm_file);
>
> Nit, probably worth adding a second error label to handle this fput(), e.g. in
> case additional checks are needed in the future.  Actually, I suspect that's
> already needed to fix the TOCTOU bug.
>
> > +             goto failed;
> > +     }
> > +
> > +     asid = *&to_kvm_svm(kvm)->sev_info.asid;
>
> Don't think "*&" is necessary. :-)

:')

>
> > +
> > +     /*
> > +      * The mirror_kvm holds an enc_context_owner ref so its asid can't
> > +      * disappear until we're done with it
> > +      */
> > +     kvm_get_kvm(kvm);
>
> Do we really need/want to take a reference to the source 'struct kvm'?  IMO,
> the so called mirror should never be doing operations with its source context,
> i.e. should not have easy access to 'struct kvm'.  We already have a reference
> to the fd, any reason not to use that to ensure liveliness of the source?

I agree the mirror should never be running operations on the source. I
don't know
that holding the fd instead of the kvm makes that much better though, are there
advantages to that I'm not seeing?

>
> > +
> > +     mutex_unlock(&kvm->lock);
> > +     mutex_lock(&mirror_kvm->lock);
> > +
> > +     /* Set enc_context_owner and copy its encryption context over */
> > +     mirror_kvm_sev = &to_kvm_svm(mirror_kvm)->sev_info;
> > +     mirror_kvm_sev->enc_context_owner = kvm;
> > +     mirror_kvm_sev->asid = asid;
> > +     mirror_kvm_sev->active = true;
>
> I would prefer a prep patch to move "INIT_LIST_HEAD(&sev->regions_list);" from
> sev_guest_init() to when the VM is instantiated.  Shaving a few cycles in that
> flow is meaningless, and not initializing the list of regions is odd, and will
> cause problems if mirrors are allowed to pin memory (or do PSP commands).

It seems like we can keep this a lot simpler and easier to reason about by not
allowing mirrors to pin memory or do psp commands. That was the intent. We
don't gain anything but complexity by allowing this to be a fully featured SEV
VM. Unless anyone can think of a good reason we'd want to have a mirror
vm be able to do more than this?

> > @@ -5321,6 +5321,11 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
> >                       kvm->arch.bus_lock_detection_enabled = true;
> >               r = 0;
> >               break;
> > +     case KVM_CAP_VM_COPY_ENC_CONTEXT_TO:
> > +             r = -ENOTTY;
> > +             if (kvm_x86_ops.vm_copy_enc_context_to)
> > +                     r = kvm_x86_ops.vm_copy_enc_context_to(kvm, cap->args[0]);
>
> This can be a static call.
>
> On a related topic, does this really need to be a separate ioctl()?  TDX can't
> share encryption contexts, everything that KVM can do for a TDX guest requires
> the per-VM context.  Unless there is a known non-x86 use case, it might be
> better to make this a mem_enc_op, and then it can be named SEV_SHARE_ASID or
> something.

I'd prefer to leave this as a capability in the same way the
register_enc_region calls
work. Moving it into mem_enc_ops means we'll have to do some messy locking
to avoid race conditions with the second vm since kvm gets locked in enc_ops.
Also seems wierd to me having this hack grouped in with all the PSP commands.
If i'm the only one that thinks this is cleaner, I'll move it though.

Interesting about the platform, too. If you're sure we'll never need
to build this for
any other platform I'll at least rename it to be amd specific. There's
no non-sev
scenario anyone can think of that might want to do this?


>
> > +             return r;
> >       default:
> >               r = -EINVAL;
> >               break;
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index e126ebda36d0..18491638f070 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -637,6 +637,7 @@ void kvm_exit(void);
> >
> >  void kvm_get_kvm(struct kvm *kvm);
> >  void kvm_put_kvm(struct kvm *kvm);
> > +bool kvm_is_kvm(struct file *file);
> >  void kvm_put_kvm_no_destroy(struct kvm *kvm);
> >
> >  static inline struct kvm_memslots *__kvm_memslots(struct kvm *kvm, int as_id)
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 63f8f6e95648..5b6296772db9 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -1077,6 +1077,7 @@ struct kvm_ppc_resize_hpt {
> >  #define KVM_CAP_SYS_HYPERV_CPUID 191
> >  #define KVM_CAP_DIRTY_LOG_RING 192
> >  #define KVM_CAP_X86_BUS_LOCK_EXIT 193
> > +#define KVM_CAP_VM_COPY_ENC_CONTEXT_TO 194
> >
> >  #ifdef KVM_CAP_IRQ_ROUTING
> >
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 001b9de4e727..5f31fcda4777 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -739,6 +739,8 @@ void __weak kvm_arch_pre_destroy_vm(struct kvm *kvm)
> >  {
> >  }
> >
> > +static struct file_operations kvm_vm_fops;
>
> I'd probably prefer to put the helper just below kvm_vm_fops instead of adding
> a forward declaration.  IMO it's not all that important to add the helper close
> to kvm_get/put_kvm().
>
> > +
> >  static struct kvm *kvm_create_vm(unsigned long type)
> >  {
> >       struct kvm *kvm = kvm_arch_alloc_vm();
> > @@ -903,6 +905,12 @@ void kvm_put_kvm(struct kvm *kvm)
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_put_kvm);
> >
> > +bool kvm_is_kvm(struct file *file)
>
> Heh, maybe kvm_file_is_kvm()?  or just file_is_kvm()?
>
> > +{
> > +     return file && file->f_op == &kvm_vm_fops;
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_is_kvm);
> > +
> >  /*
> >   * Used to put a reference that was taken on behalf of an object associated
> >   * with a user-visible file descriptor, e.g. a vcpu or device, if installation
> > --
> > 2.30.0.617.g56c4b15f3c-goog
> >

Tobin Feldman-Fitzthum March 15, 2021, 5:05 p.m. UTC | #22

On 3/11/21 11:29 AM, Paolo Bonzini wrote:

> On 11/03/21 16:30, Tobin Feldman-Fitzthum wrote:
>> I am not sure how the mirror VM will be supported in QEMU. Usually 
>> there is one QEMU process per-vm. Now we would need to run a second 
>> VM and communicate with it during migration. Is there a way to do 
>> this without adding significant complexity?
>
> I can answer this part.  I think this will actually be simpler than 
> with auxiliary vCPUs.  There will be a separate pair of VM+vCPU file 
> descriptors within the same QEMU process, and some code to set up the 
> memory map using KVM_SET_USER_MEMORY_REGION.
>
> However, the code to run this VM will be very small as the VM does not 
> have to do MMIO, interrupts, live migration (of itself), etc.  It just 
> starts up and communicates with QEMU using a mailbox at a 
> predetermined address.

We've been starting up our Migration Handler via OVMF. I'm not sure if 
this would work with a minimal setup in QEMU.

-Tobin

>
> I also think (but I'm not 100% sure) that the auxiliary VM does not 
> have to watch changes in the primary VM's memory map (e.g. mapping and 
> unmapping of BARs).  In QEMU terms, the auxiliary VM's memory map 
> tracks RAMBlocks, not MemoryRegions, which makes things much simpler.
>
> There are already many examples of mini VMMs running special purpose 
> VMs in the kernel's tools/testing/selftests/kvm directory, and I don't 
> think the QEMU code would be any more complex than that.
>
> Paolo
>

Paolo Bonzini March 15, 2021, 5:29 p.m. UTC | #23

On 15/03/21 18:05, Tobin Feldman-Fitzthum wrote:
>>
>> I can answer this part.  I think this will actually be simpler than 
>> with auxiliary vCPUs.  There will be a separate pair of VM+vCPU file 
>> descriptors within the same QEMU process, and some code to set up the 
>> memory map using KVM_SET_USER_MEMORY_REGION.
>>
>> However, the code to run this VM will be very small as the VM does not 
>> have to do MMIO, interrupts, live migration (of itself), etc.  It just 
>> starts up and communicates with QEMU using a mailbox at a 
>> predetermined address.
> 
> We've been starting up our Migration Handler via OVMF. I'm not sure if 
> this would work with a minimal setup in QEMU.

Yeah, the way to start up the migration handler would be completely 
different, you'd have to do so very early (probably SEC).

Paolo

Sean Christopherson March 16, 2021, 5:52 p.m. UTC | #24

On Fri, Mar 12, 2021, Nathan Tempelman wrote:
> On Wed, Feb 24, 2021 at 9:37 AM Sean Christopherson <seanjc@google.com> wrote:
> > > @@ -1282,6 +1299,65 @@ int svm_unregister_enc_region(struct kvm *kvm,
> > >       return ret;
> > >  }
> > >
> > > +int svm_vm_copy_asid_to(struct kvm *kvm, unsigned int mirror_kvm_fd)
> > > +{
> > > +     struct file *mirror_kvm_file;
> > > +     struct kvm *mirror_kvm;
> > > +     struct kvm_sev_info *mirror_kvm_sev;
> >
> > What about using src and dst, e.g. src_kvm, dest_kvm_fd, dest_kvm, etc...?  For
> > my brain, the mirror terminology adds an extra layer of translation.
> 
> I like source, but I think I'll keep mirror. I think it captures the current
> state of it better--this isn't it's own full featured sev vm, in a sense it's
> a reflection of the source.

The two things I dislike about mirror is that (for me) it's not clear whether
"mirror" is the source or the dest, and "mirror" implies that there is ongoing
synchronization.

> > > +
> > > +     /*
> > > +      * The mirror_kvm holds an enc_context_owner ref so its asid can't
> > > +      * disappear until we're done with it
> > > +      */
> > > +     kvm_get_kvm(kvm);
> >
> > Do we really need/want to take a reference to the source 'struct kvm'?  IMO,
> > the so called mirror should never be doing operations with its source context,
> > i.e. should not have easy access to 'struct kvm'.  We already have a reference
> > to the fd, any reason not to use that to ensure liveliness of the source?
> 
> I agree the mirror should never be running operations on the source. I don't
> know that holding the fd instead of the kvm makes that much better though,
> are there advantages to that I'm not seeing?

If there's no kvm pointer, it's much more difficult for someone to do the wrong
thing, and any such shenanigans stick out like a sore thumb in patches, which
makes reviewing future changes easier.

> > > +     mutex_unlock(&kvm->lock);
> > > +     mutex_lock(&mirror_kvm->lock);
> > > +
> > > +     /* Set enc_context_owner and copy its encryption context over */
> > > +     mirror_kvm_sev = &to_kvm_svm(mirror_kvm)->sev_info;
> > > +     mirror_kvm_sev->enc_context_owner = kvm;
> > > +     mirror_kvm_sev->asid = asid;
> > > +     mirror_kvm_sev->active = true;
> >
> > I would prefer a prep patch to move "INIT_LIST_HEAD(&sev->regions_list);" from
> > sev_guest_init() to when the VM is instantiated.  Shaving a few cycles in that
> > flow is meaningless, and not initializing the list of regions is odd, and will
> > cause problems if mirrors are allowed to pin memory (or do PSP commands).
> 
> It seems like we can keep this a lot simpler and easier to reason about by not
> allowing mirrors to pin memory or do psp commands. That was the intent. We
> don't gain anything but complexity by allowing this to be a fully featured SEV
> VM. Unless anyone can think of a good reason we'd want to have a mirror
> vm be able to do more than this?

I suspect the migration helper will need to pin memory independent of the real
VM.

But, for me, that's largely orthogonal to initializing regions_list.  Leaving a
list uninitialized for no good reason is an unnecessary risk, as any related
bugs are all but guaranteed to crash the host.

> > > @@ -5321,6 +5321,11 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
> > >                       kvm->arch.bus_lock_detection_enabled = true;
> > >               r = 0;
> > >               break;
> > > +     case KVM_CAP_VM_COPY_ENC_CONTEXT_TO:
> > > +             r = -ENOTTY;
> > > +             if (kvm_x86_ops.vm_copy_enc_context_to)
> > > +                     r = kvm_x86_ops.vm_copy_enc_context_to(kvm, cap->args[0]);
> >
> > This can be a static call.
> >
> > On a related topic, does this really need to be a separate ioctl()?  TDX can't
> > share encryption contexts, everything that KVM can do for a TDX guest requires
> > the per-VM context.  Unless there is a known non-x86 use case, it might be
> > better to make this a mem_enc_op, and then it can be named SEV_SHARE_ASID or
> > something.
> 
> I'd prefer to leave this as a capability in the same way the
> register_enc_region calls work. Moving it into mem_enc_ops means we'll have
> to do some messy locking to avoid race conditions with the second vm since
> kvm gets locked in enc_ops.

Eh, it's not that bad.

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 83e00e524513..0cb8a5022580 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -1124,6 +1124,9 @@ int svm_mem_enc_op(struct kvm *kvm, void __user *argp)
        if (copy_from_user(&sev_cmd, argp, sizeof(struct kvm_sev_cmd)))
                return -EFAULT;

+       if (sev_cmd.id == SEV_SHARE_ASID)
+               return sev_shared_asid(kvm, &sev_cmd);
+
        mutex_lock(&kvm->lock);

        switch (sev_cmd.id) {

> Also seems wierd to me having this hack grouped in with all the PSP commands.
> If i'm the only one that thinks this is cleaner, I'll move it though.

Heh, IMO, that ship already sailed.  KVM_MEMORY_ENCRYPT_OP is quite the misnomer
given that most of the commands do way more than fiddle with memory encryption.
At least with this one, the ASID is directly tied to hardware's encryption of
memory.

> Interesting about the platform, too. If you're sure we'll never need to build
> this for any other platform I'll at least rename it to be amd specific.
> There's no non-sev scenario anyone can think of that might want to do this?

Paolo Bonzini March 16, 2021, 5:58 p.m. UTC | #25

On 16/03/21 18:52, Sean Christopherson wrote:
>> I don't
>> know that holding the fd instead of the kvm makes that much better though,
>> are there advantages to that I'm not seeing?
> If there's no kvm pointer, it's much more difficult for someone to do the wrong
> thing, and any such shenanigans stick out like a sore thumb in patches, which
> makes reviewing future changes easier.

On the other hand holding the fd open complicates the code, reference 
counting rules are already hard enough.

I think we only need a replacement for "mirror", what about "dependent"? 
  "is_dependent_enc_context" seems clear enough.

Paolo

Sean Christopherson March 16, 2021, 6:08 p.m. UTC | #26

On Tue, Mar 16, 2021, Paolo Bonzini wrote:
> On 16/03/21 18:52, Sean Christopherson wrote:
> > > I don't
> > > know that holding the fd instead of the kvm makes that much better though,
> > > are there advantages to that I'm not seeing?
> > If there's no kvm pointer, it's much more difficult for someone to do the wrong
> > thing, and any such shenanigans stick out like a sore thumb in patches, which
> > makes reviewing future changes easier.
> 
> On the other hand holding the fd open complicates the code, reference
> counting rules are already hard enough.

How so?  KVM already has to do "fget(source_kvm)", can't we just hold onto to
that instead of doing an additional kvm_get_kvm()?

Kalra, Ashish May 24, 2021, 9:29 p.m. UTC | #27

[AMD Public Use]

Hello Paolo,

I am working on prototype code in qemu to start a mirror VM running in parallel to the primary VM. Initially I had an idea of a running a completely parallel VM like using the 
qemu’s microvm machine/platform, but the main issue with this idea is the difficulty in sharing the memory of primary VM with it.

Hence, I started exploring running an internal thread like the current per-vCPU thread(s) in qemu. The main issue is that qemu has a lot of global state, especially the KVMState 
structure which is per-VM, and all the KVM vCPUs are very tightly tied into it. It does not make sense to add a completely new KVMState structure instance for the mirror VM
as then the mirror VM does not remain lightweight at all. 

Hence, the mirror VM i am adding, has to integrate with the current KVMState structure and the “global” KVM state in qemu, this required adding some parallel KVM code in
qemu, for example to do ioctl's on the mirror VM, similar to the primary VM. Details below :

The mirror_vm_fd is added to the KVMState structure itself. 

The parallel code I mentioned is like the following :

#define kvm_mirror_vm_enable_cap(s, capability, cap_flags, ...)      \
    ({                                                               \
        struct kvm_enable_cap cap = {                                \
            .cap = capability,                                       \
            .flags = cap_flags,                                      \
        };                                                           \
        uint64_t args_tmp[] = { __VA_ARGS__ };                       \
        size_t n = MIN(ARRAY_SIZE(args_tmp), ARRAY_SIZE(cap.args));  \
        memcpy(cap.args, args_tmp, n * sizeof(cap.args[0]));         \
        kvm_mirror_vm_ioctl(s, KVM_ENABLE_CAP, &cap);                \
    })

+int kvm_mirror_vm_ioctl(KVMState *s, int type, ...)
+{
+    int ret;
+    void *arg;
+    va_list ap;
+
+    va_start(ap, type);
+    arg = va_arg(ap, void *);
+    va_end(ap);
+
+    trace_kvm_vm_ioctl(type, arg);
+    ret = ioctl(s->mirror_vm_fd, type, arg);
+    if (ret == -1) {
+        ret = -errno;
+    }
+    return ret;
+}
+

The vcpu ioctl code works as it is. 

The kvm_arch_put_registers() also needed a mirror VM variant kvm_arch_mirror_put_registers(), for reasons such as saving MSRs on the mirror VM required enabling
the in-kernel irqchip support on the mirror VM, otherwise, kvm_put_msrs() fails. Hence, kvm_arch_mirror_put_registers() makes the mirror VM simpler by not saving
any MSRs and not needing the in-kernel irqchip support.

I had lot of issues in dynamically adding a new vCPU, i.e., the CPUState structure due to qemu's object model (QOM) which requires that every qemu
structure/object has to contain the parent/base class/object and then all the derived classes after that. It was difficult to add a new CPU object dynamically, hence I have to reuse
one of the “-smp”  cpus passed on qemu command line as the mirror vCPU. This also assists in having the X86CPU "backing" structure for the mirror vCPU’s CPU object, 
which allows using most of the KVM code in qemu for the mirror vCPU. Also the mirror vCPU CPU object will have the CPUX86State structure embedded which contains the 
cpu register state for the mirror vCPU. 

The mirror vCPU is now running a simpler KVM run loop, it does not have any in-kernel irqchip (interrupt controller) or any other kvmapic interrupt controller supported
and enabled for it. As of now it is still doing both I/O and MMIO handling.

Looking fwd. to comments, feedback, thoughts on the above approach.

Thanks,
Ashish

-----Original Message-----
From: Paolo Bonzini <pbonzini@redhat.com> 
Sent: Thursday, March 11, 2021 10:30 AM
To: Tobin Feldman-Fitzthum <tobin@linux.ibm.com>; natet@google.com
Cc: Dov Murik <dovmurik@linux.vnet.ibm.com>; Lendacky, Thomas <Thomas.Lendacky@amd.com>; x86@kernel.org; kvm@vger.kernel.org; linux-kernel@vger.kernel.org; srutherford@google.com; seanjc@google.com; rientjes@google.com; Singh, Brijesh <brijesh.singh@amd.com>; Kalra, Ashish <Ashish.Kalra@amd.com>; Laszlo Ersek <lersek@redhat.com>; James Bottomley <jejb@linux.ibm.com>; Hubertus Franke <frankeh@us.ibm.com>
Subject: Re: [RFC] KVM: x86: Support KVM VMs sharing SEV context

On 11/03/21 16:30, Tobin Feldman-Fitzthum wrote:
> I am not sure how the mirror VM will be supported in QEMU. Usually 
> there is one QEMU process per-vm. Now we would need to run a second VM 
> and communicate with it during migration. Is there a way to do this 
> without adding significant complexity?

I can answer this part.  I think this will actually be simpler than with auxiliary vCPUs.  There will be a separate pair of VM+vCPU file descriptors within the same QEMU process, and some code to set up the memory map using KVM_SET_USER_MEMORY_REGION.

However, the code to run this VM will be very small as the VM does not have to do MMIO, interrupts, live migration (of itself), etc.  It just starts up and communicates with QEMU using a mailbox at a predetermined address.

I also think (but I'm not 100% sure) that the auxiliary VM does not have to watch changes in the primary VM's memory map (e.g. mapping and unmapping of BARs).  In QEMU terms, the auxiliary VM's memory map tracks RAMBlocks, not MemoryRegions, which makes things much simpler.

There are already many examples of mini VMMs running special purpose VMs in the kernel's tools/testing/selftests/kvm directory, and I don't think the QEMU code would be any more complex than that.

Paolo

Kalra, Ashish May 27, 2021, 3:51 p.m. UTC | #28

[AMD Public Use]

Looking at kvm selftests in the kernel, I think the other alternative is to :

Maintain separate data structures like struct kvm_vm, struct vcpu for the mirror VM, but then that means quite a bit of 
the KVM code in Qemu for the mirror VM will be duplicated. 

For example, this will add separate and duplicated functionality for : 

vm_check_cap/vm_enable_cap,
vm_create,
vcpu_run,
vcpu_get_regs, vcpu_sregs_get/set,
vcpu_ioctl,
vm_ioctl, etc., etc.

Also I think that once the mirror VM starts booting and running the UEFI code, it might be only during the PEI or DXE phase where
it will start actually running the MH code, so mirror VM probably still need to handles KVM_EXIT_IO, when SEC phase does I/O,
I can see PIC accesses and Debug Agent initialization stuff in Sec startup code.

Thanks,
Ashish

-----Original Message-----
From: Kalra, Ashish 
Sent: Monday, May 24, 2021 4:29 PM
To: Paolo Bonzini <pbonzini@redhat.com>; Tobin Feldman-Fitzthum <tobin@linux.ibm.com>; natet@google.com
Cc: Dov Murik <dovmurik@linux.vnet.ibm.com>; Lendacky, Thomas <Thomas.Lendacky@amd.com>; x86@kernel.org; kvm@vger.kernel.org; linux-kernel@vger.kernel.org; srutherford@google.com; seanjc@google.com; rientjes@google.com; Singh, Brijesh <brijesh.singh@amd.com>; Laszlo Ersek <lersek@redhat.com>; James Bottomley <jejb@linux.ibm.com>; Hubertus Franke <frankeh@us.ibm.com>; qemu-devel@nongnu.org
Subject: RE: [RFC] KVM: x86: Support KVM VMs sharing SEV context

[AMD Public Use]

Hello Paolo,

I am working on prototype code in qemu to start a mirror VM running in parallel to the primary VM. Initially I had an idea of a running a completely parallel VM like using the qemu’s microvm machine/platform, but the main issue with this idea is the difficulty in sharing the memory of primary VM with it.

Hence, I started exploring running an internal thread like the current per-vCPU thread(s) in qemu. The main issue is that qemu has a lot of global state, especially the KVMState structure which is per-VM, and all the KVM vCPUs are very tightly tied into it. It does not make sense to add a completely new KVMState structure instance for the mirror VM as then the mirror VM does not remain lightweight at all. 

Hence, the mirror VM i am adding, has to integrate with the current KVMState structure and the “global” KVM state in qemu, this required adding some parallel KVM code in qemu, for example to do ioctl's on the mirror VM, similar to the primary VM. Details below :

The mirror_vm_fd is added to the KVMState structure itself. 

The parallel code I mentioned is like the following :

#define kvm_mirror_vm_enable_cap(s, capability, cap_flags, ...)      \
    ({                                                               \
        struct kvm_enable_cap cap = {                                \
            .cap = capability,                                       \
            .flags = cap_flags,                                      \
        };                                                           \
        uint64_t args_tmp[] = { __VA_ARGS__ };                       \
        size_t n = MIN(ARRAY_SIZE(args_tmp), ARRAY_SIZE(cap.args));  \
        memcpy(cap.args, args_tmp, n * sizeof(cap.args[0]));         \
        kvm_mirror_vm_ioctl(s, KVM_ENABLE_CAP, &cap);                \
    })

+int kvm_mirror_vm_ioctl(KVMState *s, int type, ...) {
+    int ret;
+    void *arg;
+    va_list ap;
+
+    va_start(ap, type);
+    arg = va_arg(ap, void *);
+    va_end(ap);
+
+    trace_kvm_vm_ioctl(type, arg);
+    ret = ioctl(s->mirror_vm_fd, type, arg);
+    if (ret == -1) {
+        ret = -errno;
+    }
+    return ret;
+}
+

The vcpu ioctl code works as it is. 

The kvm_arch_put_registers() also needed a mirror VM variant kvm_arch_mirror_put_registers(), for reasons such as saving MSRs on the mirror VM required enabling the in-kernel irqchip support on the mirror VM, otherwise, kvm_put_msrs() fails. Hence, kvm_arch_mirror_put_registers() makes the mirror VM simpler by not saving any MSRs and not needing the in-kernel irqchip support.

I had lot of issues in dynamically adding a new vCPU, i.e., the CPUState structure due to qemu's object model (QOM) which requires that every qemu structure/object has to contain the parent/base class/object and then all the derived classes after that. It was difficult to add a new CPU object dynamically, hence I have to reuse one of the “-smp”  cpus passed on qemu command line as the mirror vCPU. This also assists in having the X86CPU "backing" structure for the mirror vCPU’s CPU object, which allows using most of the KVM code in qemu for the mirror vCPU. Also the mirror vCPU CPU object will have the CPUX86State structure embedded which contains the cpu register state for the mirror vCPU. 

The mirror vCPU is now running a simpler KVM run loop, it does not have any in-kernel irqchip (interrupt controller) or any other kvmapic interrupt controller supported and enabled for it. As of now it is still doing both I/O and MMIO handling.

Looking fwd. to comments, feedback, thoughts on the above approach.

Thanks,
Ashish

-----Original Message-----
From: Paolo Bonzini <pbonzini@redhat.com>
Sent: Thursday, March 11, 2021 10:30 AM
To: Tobin Feldman-Fitzthum <tobin@linux.ibm.com>; natet@google.com
Cc: Dov Murik <dovmurik@linux.vnet.ibm.com>; Lendacky, Thomas <Thomas.Lendacky@amd.com>; x86@kernel.org; kvm@vger.kernel.org; linux-kernel@vger.kernel.org; srutherford@google.com; seanjc@google.com; rientjes@google.com; Singh, Brijesh <brijesh.singh@amd.com>; Kalra, Ashish <Ashish.Kalra@amd.com>; Laszlo Ersek <lersek@redhat.com>; James Bottomley <jejb@linux.ibm.com>; Hubertus Franke <frankeh@us.ibm.com>
Subject: Re: [RFC] KVM: x86: Support KVM VMs sharing SEV context

On 11/03/21 16:30, Tobin Feldman-Fitzthum wrote:
> I am not sure how the mirror VM will be supported in QEMU. Usually 
> there is one QEMU process per-vm. Now we would need to run a second VM 
> and communicate with it during migration. Is there a way to do this 
> without adding significant complexity?

I can answer this part.  I think this will actually be simpler than with auxiliary vCPUs.  There will be a separate pair of VM+vCPU file descriptors within the same QEMU process, and some code to set up the memory map using KVM_SET_USER_MEMORY_REGION.

However, the code to run this VM will be very small as the VM does not have to do MMIO, interrupts, live migration (of itself), etc.  It just starts up and communicates with QEMU using a mailbox at a predetermined address.

I also think (but I'm not 100% sure) that the auxiliary VM does not have to watch changes in the primary VM's memory map (e.g. mapping and unmapping of BARs).  In QEMU terms, the auxiliary VM's memory map tracks RAMBlocks, not MemoryRegions, which makes things much simpler.

There are already many examples of mini VMMs running special purpose VMs in the kernel's tools/testing/selftests/kvm directory, and I don't think the QEMU code would be any more complex than that.

Paolo

Kalra, Ashish June 1, 2021, 8:26 a.m. UTC | #29

[AMD Public Use]

To add, using this alternative approach and handling KVM_EXIT_IO using kvm_handle_io, still requires 
CPUState{..} structure and the backing "X86CPU" structure, for example, as part of kvm_arch_post_run()
to get the MemTxAttrs needed by kvm_handle_io().

Or the other option is to embed the CPUX86State as part of the struct vcpu for the mirror VM, and then
use it to get the MemTxAttrs.

So there is still quite a bit of KVM code duplication in qemu with the alternative approach, but it is
probably still cleaner than the initial approach.

Thanks,
Ashish[

-----Original Message-----
From: Kalra, Ashish 
Sent: Thursday, May 27, 2021 10:51 AM
To: 'Paolo Bonzini' <pbonzini@redhat.com>; 'Tobin Feldman-Fitzthum' <tobin@linux.ibm.com>; 'natet@google.com' <natet@google.com>
Cc: 'Dov Murik' <dovmurik@linux.vnet.ibm.com>; Lendacky, Thomas <Thomas.Lendacky@amd.com>; 'x86@kernel.org' <x86@kernel.org>; 'kvm@vger.kernel.org' <kvm@vger.kernel.org>; 'linux-kernel@vger.kernel.org' <linux-kernel@vger.kernel.org>; 'srutherford@google.com' <srutherford@google.com>; 'seanjc@google.com' <seanjc@google.com>; 'rientjes@google.com' <rientjes@google.com>; Singh, Brijesh <brijesh.singh@amd.com>; 'Laszlo Ersek' <lersek@redhat.com>; 'James Bottomley' <jejb@linux.ibm.com>; 'Hubertus Franke' <frankeh@us.ibm.com>; 'qemu-devel@nongnu.org' <qemu-devel@nongnu.org>
Subject: RE: [RFC] KVM: x86: Support KVM VMs sharing SEV context

[AMD Public Use]

Looking at kvm selftests in the kernel, I think the other alternative is to :

Maintain separate data structures like struct kvm_vm, struct vcpu for the mirror VM, but then that means quite a bit of the KVM code in Qemu for the mirror VM will be duplicated. 

For example, this will add separate and duplicated functionality for : 

vm_check_cap/vm_enable_cap,
vm_create,
vcpu_run,
vcpu_get_regs, vcpu_sregs_get/set,
vcpu_ioctl,
vm_ioctl, etc., etc.

Also I think that once the mirror VM starts booting and running the UEFI code, it might be only during the PEI or DXE phase where it will start actually running the MH code, so mirror VM probably still need to handles KVM_EXIT_IO, when SEC phase does I/O, I can see PIC accesses and Debug Agent initialization stuff in Sec startup code.

Thanks,
Ashish

-----Original Message-----
From: Kalra, Ashish
Sent: Monday, May 24, 2021 4:29 PM
To: Paolo Bonzini <pbonzini@redhat.com>; Tobin Feldman-Fitzthum <tobin@linux.ibm.com>; natet@google.com
Cc: Dov Murik <dovmurik@linux.vnet.ibm.com>; Lendacky, Thomas <Thomas.Lendacky@amd.com>; x86@kernel.org; kvm@vger.kernel.org; linux-kernel@vger.kernel.org; srutherford@google.com; seanjc@google.com; rientjes@google.com; Singh, Brijesh <brijesh.singh@amd.com>; Laszlo Ersek <lersek@redhat.com>; James Bottomley <jejb@linux.ibm.com>; Hubertus Franke <frankeh@us.ibm.com>; qemu-devel@nongnu.org
Subject: RE: [RFC] KVM: x86: Support KVM VMs sharing SEV context

[AMD Public Use]

Hello Paolo,

I am working on prototype code in qemu to start a mirror VM running in parallel to the primary VM. Initially I had an idea of a running a completely parallel VM like using the qemu’s microvm machine/platform, but the main issue with this idea is the difficulty in sharing the memory of primary VM with it.

Hence, I started exploring running an internal thread like the current per-vCPU thread(s) in qemu. The main issue is that qemu has a lot of global state, especially the KVMState structure which is per-VM, and all the KVM vCPUs are very tightly tied into it. It does not make sense to add a completely new KVMState structure instance for the mirror VM as then the mirror VM does not remain lightweight at all. 

Hence, the mirror VM i am adding, has to integrate with the current KVMState structure and the “global” KVM state in qemu, this required adding some parallel KVM code in qemu, for example to do ioctl's on the mirror VM, similar to the primary VM. Details below :

The mirror_vm_fd is added to the KVMState structure itself. 

The parallel code I mentioned is like the following :

#define kvm_mirror_vm_enable_cap(s, capability, cap_flags, ...)      \
    ({                                                               \
        struct kvm_enable_cap cap = {                                \
            .cap = capability,                                       \
            .flags = cap_flags,                                      \
        };                                                           \
        uint64_t args_tmp[] = { __VA_ARGS__ };                       \
        size_t n = MIN(ARRAY_SIZE(args_tmp), ARRAY_SIZE(cap.args));  \
        memcpy(cap.args, args_tmp, n * sizeof(cap.args[0]));         \
        kvm_mirror_vm_ioctl(s, KVM_ENABLE_CAP, &cap);                \
    })

+int kvm_mirror_vm_ioctl(KVMState *s, int type, ...) {
+    int ret;
+    void *arg;
+    va_list ap;
+
+    va_start(ap, type);
+    arg = va_arg(ap, void *);
+    va_end(ap);
+
+    trace_kvm_vm_ioctl(type, arg);
+    ret = ioctl(s->mirror_vm_fd, type, arg);
+    if (ret == -1) {
+        ret = -errno;
+    }
+    return ret;
+}
+

The vcpu ioctl code works as it is. 

The kvm_arch_put_registers() also needed a mirror VM variant kvm_arch_mirror_put_registers(), for reasons such as saving MSRs on the mirror VM required enabling the in-kernel irqchip support on the mirror VM, otherwise, kvm_put_msrs() fails. Hence, kvm_arch_mirror_put_registers() makes the mirror VM simpler by not saving any MSRs and not needing the in-kernel irqchip support.

I had lot of issues in dynamically adding a new vCPU, i.e., the CPUState structure due to qemu's object model (QOM) which requires that every qemu structure/object has to contain the parent/base class/object and then all the derived classes after that. It was difficult to add a new CPU object dynamically, hence I have to reuse one of the “-smp”  cpus passed on qemu command line as the mirror vCPU. This also assists in having the X86CPU "backing" structure for the mirror vCPU’s CPU object, which allows using most of the KVM code in qemu for the mirror vCPU. Also the mirror vCPU CPU object will have the CPUX86State structure embedded which contains the cpu register state for the mirror vCPU. 

The mirror vCPU is now running a simpler KVM run loop, it does not have any in-kernel irqchip (interrupt controller) or any other kvmapic interrupt controller supported and enabled for it. As of now it is still doing both I/O and MMIO handling.

Looking fwd. to comments, feedback, thoughts on the above approach.

Thanks,
Ashish

-----Original Message-----
From: Paolo Bonzini <pbonzini@redhat.com>
Sent: Thursday, March 11, 2021 10:30 AM
To: Tobin Feldman-Fitzthum <tobin@linux.ibm.com>; natet@google.com
Cc: Dov Murik <dovmurik@linux.vnet.ibm.com>; Lendacky, Thomas <Thomas.Lendacky@amd.com>; x86@kernel.org; kvm@vger.kernel.org; linux-kernel@vger.kernel.org; srutherford@google.com; seanjc@google.com; rientjes@google.com; Singh, Brijesh <brijesh.singh@amd.com>; Kalra, Ashish <Ashish.Kalra@amd.com>; Laszlo Ersek <lersek@redhat.com>; James Bottomley <jejb@linux.ibm.com>; Hubertus Franke <frankeh@us.ibm.com>
Subject: Re: [RFC] KVM: x86: Support KVM VMs sharing SEV context

On 11/03/21 16:30, Tobin Feldman-Fitzthum wrote:
> I am not sure how the mirror VM will be supported in QEMU. Usually 
> there is one QEMU process per-vm. Now we would need to run a second VM 
> and communicate with it during migration. Is there a way to do this 
> without adding significant complexity?

I can answer this part.  I think this will actually be simpler than with auxiliary vCPUs.  There will be a separate pair of VM+vCPU file descriptors within the same QEMU process, and some code to set up the memory map using KVM_SET_USER_MEMORY_REGION.

However, the code to run this VM will be very small as the VM does not have to do MMIO, interrupts, live migration (of itself), etc.  It just starts up and communicates with QEMU using a mailbox at a predetermined address.

I also think (but I'm not 100% sure) that the auxiliary VM does not have to watch changes in the primary VM's memory map (e.g. mapping and unmapping of BARs).  In QEMU terms, the auxiliary VM's memory map tracks RAMBlocks, not MemoryRegions, which makes things much simpler.

There are already many examples of mini VMMs running special purpose VMs in the kernel's tools/testing/selftests/kvm directory, and I don't think the QEMU code would be any more complex than that.

Paolo

[RFC] KVM: x86: Support KVM VMs sharing SEV context

Commit Message

Comments

Patch