mbox series

[RFC,00/13] Add support for Mirror VM.

Message ID cover.1629118207.git.ashish.kalra@amd.com (mailing list archive)
Headers show
Series Add support for Mirror VM. | expand

Message

Kalra, Ashish Aug. 16, 2021, 1:25 p.m. UTC
From: Ashish Kalra <ashish.kalra@amd.com>

This is an RFC series for Mirror VM support that are 
essentially secondary VMs sharing the encryption context 
(ASID) with a primary VM. The patch-set creates a new 
VM and shares the primary VM's encryption context
with it using the KVM_CAP_VM_COPY_ENC_CONTEXT_FROM capability.
The mirror VM uses a separate pair of VM + vCPU file 
descriptors and also uses a simplified KVM run loop, 
for example, it does not support any interrupt vmexit's. etc.
Currently the mirror VM shares the address space of the
primary VM. 

The mirror VM can be used for running an in-guest migration 
helper (MH). It also might have future uses for other in-guest
operations.

The mirror VM support is enabled by adding a mirrorvcpus=N
suboption to -smp, which also designates a few vcpus (normally 1)
to the mirror VM.

Example usage for starting a 4-vcpu guest, of which 1 vcpu is marked as
mirror vcpu.

    qemu-system-x86_64 -smp 4,mirrorvcpus=1 ...

Ashish Kalra (7):
  kvm: Add Mirror VM ioctl and enable cap interfaces.
  kvm: Add Mirror VM support.
  kvm: create Mirror VM and share primary VM's encryption context.
  softmmu/cpu: Skip mirror vcpu's for pause, resume and synchronization.
  kvm/apic: Disable in-kernel APIC support for mirror vcpu's.
  hw/acpi: disable modern CPU hotplug interface for mirror vcpu's
  hw/i386/pc: reduce fw_cfg boot cpu count taking into account mirror
    vcpu's.

Dov Murik (5):
  machine: Add mirrorvcpus=N suboption to -smp
  hw/boards: Add mirror_vcpu flag to CPUArchId
  hw/i386: Mark mirror vcpus in possible_cpus
  cpu: Add boolean mirror_vcpu field to CPUState
  hw/i386: Set CPUState.mirror_vcpu=true for mirror vcpus

Tobin Feldman-Fitzthum (1):
  hw/acpi: Don't include mirror vcpus in ACPI tables

 accel/kvm/kvm-accel-ops.c |  45 ++++++-
 accel/kvm/kvm-all.c       | 244 +++++++++++++++++++++++++++++++++++++-
 accel/kvm/kvm-cpus.h      |   2 +
 hw/acpi/cpu.c             |  21 +++-
 hw/core/cpu-common.c      |   1 +
 hw/core/machine.c         |   7 ++
 hw/i386/acpi-build.c      |   5 +
 hw/i386/acpi-common.c     |   5 +
 hw/i386/kvm/apic.c        |  15 +++
 hw/i386/pc.c              |  10 ++
 hw/i386/x86.c             |  11 +-
 include/hw/acpi/cpu.h     |   1 +
 include/hw/boards.h       |   3 +
 include/hw/core/cpu.h     |   3 +
 include/hw/i386/x86.h     |   3 +-
 include/sysemu/kvm.h      |  15 +++
 qapi/machine.json         |   5 +-
 softmmu/cpus.c            |  27 +++++
 softmmu/vl.c              |   3 +
 target/i386/kvm/kvm.c     |  42 +++++++
 20 files changed, 459 insertions(+), 9 deletions(-)

Comments

Claudio Fontana Aug. 16, 2021, 2:01 p.m. UTC | #1
On 8/16/21 3:25 PM, Ashish Kalra wrote:
> From: Ashish Kalra <ashish.kalra@amd.com>
> 
> This is an RFC series for Mirror VM support that are 
> essentially secondary VMs sharing the encryption context 
> (ASID) with a primary VM. The patch-set creates a new 
> VM and shares the primary VM's encryption context
> with it using the KVM_CAP_VM_COPY_ENC_CONTEXT_FROM capability.
> The mirror VM uses a separate pair of VM + vCPU file 
> descriptors and also uses a simplified KVM run loop, 
> for example, it does not support any interrupt vmexit's. etc.
> Currently the mirror VM shares the address space of the
> primary VM. 

Hi,

I'd expect some entry in docs/ ?

Thanks,

Claudio

> 
> The mirror VM can be used for running an in-guest migration 
> helper (MH). It also might have future uses for other in-guest
> operations.
> 
> The mirror VM support is enabled by adding a mirrorvcpus=N
> suboption to -smp, which also designates a few vcpus (normally 1)
> to the mirror VM.
> 
> Example usage for starting a 4-vcpu guest, of which 1 vcpu is marked as
> mirror vcpu.
> 
>     qemu-system-x86_64 -smp 4,mirrorvcpus=1 ...
> 
> Ashish Kalra (7):
>   kvm: Add Mirror VM ioctl and enable cap interfaces.
>   kvm: Add Mirror VM support.
>   kvm: create Mirror VM and share primary VM's encryption context.
>   softmmu/cpu: Skip mirror vcpu's for pause, resume and synchronization.
>   kvm/apic: Disable in-kernel APIC support for mirror vcpu's.
>   hw/acpi: disable modern CPU hotplug interface for mirror vcpu's
>   hw/i386/pc: reduce fw_cfg boot cpu count taking into account mirror
>     vcpu's.
> 
> Dov Murik (5):
>   machine: Add mirrorvcpus=N suboption to -smp
>   hw/boards: Add mirror_vcpu flag to CPUArchId
>   hw/i386: Mark mirror vcpus in possible_cpus
>   cpu: Add boolean mirror_vcpu field to CPUState
>   hw/i386: Set CPUState.mirror_vcpu=true for mirror vcpus
> 
> Tobin Feldman-Fitzthum (1):
>   hw/acpi: Don't include mirror vcpus in ACPI tables
> 
>  accel/kvm/kvm-accel-ops.c |  45 ++++++-
>  accel/kvm/kvm-all.c       | 244 +++++++++++++++++++++++++++++++++++++-
>  accel/kvm/kvm-cpus.h      |   2 +
>  hw/acpi/cpu.c             |  21 +++-
>  hw/core/cpu-common.c      |   1 +
>  hw/core/machine.c         |   7 ++
>  hw/i386/acpi-build.c      |   5 +
>  hw/i386/acpi-common.c     |   5 +
>  hw/i386/kvm/apic.c        |  15 +++
>  hw/i386/pc.c              |  10 ++
>  hw/i386/x86.c             |  11 +-
>  include/hw/acpi/cpu.h     |   1 +
>  include/hw/boards.h       |   3 +
>  include/hw/core/cpu.h     |   3 +
>  include/hw/i386/x86.h     |   3 +-
>  include/sysemu/kvm.h      |  15 +++
>  qapi/machine.json         |   5 +-
>  softmmu/cpus.c            |  27 +++++
>  softmmu/vl.c              |   3 +
>  target/i386/kvm/kvm.c     |  42 +++++++
>  20 files changed, 459 insertions(+), 9 deletions(-)
>
Paolo Bonzini Aug. 16, 2021, 2:15 p.m. UTC | #2
On 16/08/21 15:25, Ashish Kalra wrote:
> From: Ashish Kalra<ashish.kalra@amd.com>
> 
> This is an RFC series for Mirror VM support that are
> essentially secondary VMs sharing the encryption context
> (ASID) with a primary VM. The patch-set creates a new
> VM and shares the primary VM's encryption context
> with it using the KVM_CAP_VM_COPY_ENC_CONTEXT_FROM capability.
> The mirror VM uses a separate pair of VM + vCPU file
> descriptors and also uses a simplified KVM run loop,
> for example, it does not support any interrupt vmexit's. etc.
> Currently the mirror VM shares the address space of the
> primary VM.
> 
> The mirror VM can be used for running an in-guest migration
> helper (MH). It also might have future uses for other in-guest
> operations.

Hi,

first of all, thanks for posting this work and starting the discussion.

However, I am not sure if the in-guest migration helper vCPUs should use 
the existing KVM support code.  For example, they probably can just 
always work with host CPUID (copied directly from 
KVM_GET_SUPPORTED_CPUID), and they do not need to interface with QEMU's 
MMIO logic.  They would just sit on a "HLT" instruction and communicate 
with the main migration loop using some kind of standardized ring buffer 
protocol; the migration loop then executes KVM_RUN in order to start the 
processing of pages, and expects a KVM_EXIT_HLT when the VM has nothing 
to do or requires processing on the host.

The migration helper can then also use its own address space, for 
example operating directly on ram_addr_t values with the helper running 
at very high virtual addresses.  Migration code can use a 
RAMBlockNotifier to invoke KVM_SET_USER_MEMORY_REGION on the mirror VM 
(and never enable dirty memory logging on the mirror VM, too, which has 
better performance).

With this implementation, the number of mirror vCPUs does not even have 
to be indicated on the command line.  The VM and its vCPUs can simply be 
created when migration starts.  In the SEV-ES case, the guest can even 
provide the VMSA that starts the migration helper.

The disadvantage is that, as you point out, in the future some of the 
infrastructure you introduce might be useful for VMPL0 operation on 
SEV-SNP.  My proposal above might require some code duplication. 
However, it might even be that VMPL0 operation works best with a model 
more similar to my sketch of the migration helper; it's really too early 
to say.

Paolo
Daniel P. Berrangé Aug. 16, 2021, 2:23 p.m. UTC | #3
On Mon, Aug 16, 2021 at 04:15:46PM +0200, Paolo Bonzini wrote:
> On 16/08/21 15:25, Ashish Kalra wrote:
> > From: Ashish Kalra<ashish.kalra@amd.com>
> > 
> > This is an RFC series for Mirror VM support that are
> > essentially secondary VMs sharing the encryption context
> > (ASID) with a primary VM. The patch-set creates a new
> > VM and shares the primary VM's encryption context
> > with it using the KVM_CAP_VM_COPY_ENC_CONTEXT_FROM capability.
> > The mirror VM uses a separate pair of VM + vCPU file
> > descriptors and also uses a simplified KVM run loop,
> > for example, it does not support any interrupt vmexit's. etc.
> > Currently the mirror VM shares the address space of the
> > primary VM.
> > 
> > The mirror VM can be used for running an in-guest migration
> > helper (MH). It also might have future uses for other in-guest
> > operations.
> 

snip

> With this implementation, the number of mirror vCPUs does not even have to
> be indicated on the command line.  The VM and its vCPUs can simply be
> created when migration starts.  In the SEV-ES case, the guest can even
> provide the VMSA that starts the migration helper.

I don't think management apps will accept that approach when pinning
guests. They will want control over how many extra vCPU threads are
created, what host pCPUs they are pinned to, and what schedular
policies might be applied to them.

Regards,
Daniel
Kalra, Ashish Aug. 16, 2021, 2:44 p.m. UTC | #4
Hello Paolo,

On Mon, Aug 16, 2021 at 04:15:46PM +0200, Paolo Bonzini wrote:
> On 16/08/21 15:25, Ashish Kalra wrote:
> > From: Ashish Kalra<ashish.kalra@amd.com>
> > 
> > This is an RFC series for Mirror VM support that are
> > essentially secondary VMs sharing the encryption context
> > (ASID) with a primary VM. The patch-set creates a new
> > VM and shares the primary VM's encryption context
> > with it using the KVM_CAP_VM_COPY_ENC_CONTEXT_FROM capability.
> > The mirror VM uses a separate pair of VM + vCPU file
> > descriptors and also uses a simplified KVM run loop,
> > for example, it does not support any interrupt vmexit's. etc.
> > Currently the mirror VM shares the address space of the
> > primary VM.
> > 
> > The mirror VM can be used for running an in-guest migration
> > helper (MH). It also might have future uses for other in-guest
> > operations.
> 
> Hi,
> 
> first of all, thanks for posting this work and starting the discussion.
> 
> However, I am not sure if the in-guest migration helper vCPUs should use the
> existing KVM support code.  For example, they probably can just always work
> with host CPUID (copied directly from KVM_GET_SUPPORTED_CPUID), and they do
> not need to interface with QEMU's MMIO logic.  They would just sit on a
> "HLT" instruction and communicate with the main migration loop using some
> kind of standardized ring buffer protocol; the migration loop then executes
> KVM_RUN in order to start the processing of pages, and expects a
> KVM_EXIT_HLT when the VM has nothing to do or requires processing on the
> host.
> 

]

I am not sure if we really don't need QEMU's MMIO logic, I think that once the
mirror VM starts booting and running the UEFI code, it might be only during
the PEI or DXE phase where it will start actually running the MH code,
so mirror VM probably still need to handles KVM_EXIT_IO when SEC phase does I/O,
I can see PIC accesses and Debug Agent initialization stuff in SEC startup code.

Addtionally this still requires CPUState{..} structure and the backing
"X86CPU" structure, for example, as part of kvm_arch_post_run() to get
the MemTxAttrs needed by kvm_handle_io().

Thanks,
Ashish

> The migration helper can then also use its own address space, for example
> operating directly on ram_addr_t values with the helper running at very high
> virtual addresses.  Migration code can use a RAMBlockNotifier to invoke
> KVM_SET_USER_MEMORY_REGION on the mirror VM (and never enable dirty memory
> logging on the mirror VM, too, which has better performance).
> 
> With this implementation, the number of mirror vCPUs does not even have to
> be indicated on the command line.  The VM and its vCPUs can simply be
> created when migration starts.  In the SEV-ES case, the guest can even
> provide the VMSA that starts the migration helper.
> 
> The disadvantage is that, as you point out, in the future some of the
> infrastructure you introduce might be useful for VMPL0 operation on SEV-SNP.
> My proposal above might require some code duplication. However, it might
> even be that VMPL0 operation works best with a model more similar to my
> sketch of the migration helper; it's really too early to say.
> 
> Paolo
>
Paolo Bonzini Aug. 16, 2021, 2:58 p.m. UTC | #5
On 16/08/21 16:44, Ashish Kalra wrote:
> I think that once the mirror VM starts booting and running the UEFI
> code, it might be only during the PEI or DXE phase where it will
> start actually running the MH code, so mirror VM probably still need
> to handles KVM_EXIT_IO when SEC phase does I/O, I can see PIC
> accesses and Debug Agent initialization stuff in SEC startup code.

That may be a design of the migration helper code that you were working
with, but it's not necessary.

The migration helper can be just some code that the guest "donates" to
the host.  The entry point need not be the usual 0xfffffff0; it can be
booted directly in 64-bit mode with a CR3 and EIP that the guest
provides to the guest---for example with a UEFI GUIDed structure.

In fact, the migration helper can run even before the guest has booted
and while the guest is paused, so I don't think that it is possible to
make use of any device emulation code in it.

Paolo
Paolo Bonzini Aug. 16, 2021, 3 p.m. UTC | #6
On 16/08/21 16:23, Daniel P. Berrangé wrote:
> snip
> 
>> With this implementation, the number of mirror vCPUs does not even have to
>> be indicated on the command line.  The VM and its vCPUs can simply be
>> created when migration starts.  In the SEV-ES case, the guest can even
>> provide the VMSA that starts the migration helper.
>
> I don't think management apps will accept that approach when pinning
> guests. They will want control over how many extra vCPU threads are
> created, what host pCPUs they are pinned to, and what schedular
> policies might be applied to them.

That doesn't require creating the migration threads at startup, or 
making them vCPU threads, does it?

The migration helper is guest code that is run within the context of the 
migration helper in order to decrypt/re-encrypt the code (using a 
different tweak that is based on e.g. the ram_addr_t rather than the 
HPA).  How does libvirt pin migration thread(s) currently?

Paolo
Kalra, Ashish Aug. 16, 2021, 3:13 p.m. UTC | #7
Hello Paolo,

On Mon, Aug 16, 2021 at 04:58:02PM +0200, Paolo Bonzini wrote:
> On 16/08/21 16:44, Ashish Kalra wrote:
> > I think that once the mirror VM starts booting and running the UEFI
> > code, it might be only during the PEI or DXE phase where it will
> > start actually running the MH code, so mirror VM probably still need
> > to handles KVM_EXIT_IO when SEC phase does I/O, I can see PIC
> > accesses and Debug Agent initialization stuff in SEC startup code.
> 
> That may be a design of the migration helper code that you were working
> with, but it's not necessary.
> 
Actually my comments are about a more generic MH code.

> The migration helper can be just some code that the guest "donates" to
> the host.  The entry point need not be the usual 0xfffffff0; it can be
> booted directly in 64-bit mode with a CR3 and EIP that the guest
> provides to the guest---for example with a UEFI GUIDed structure.

Yes, this is consistent with the MH code we are currently testing, it
boots directly into 64-bit mode. This is what Tobin's response is also
pointing out to.

Thanks,
Ashish
> 
> In fact, the migration helper can run even before the guest has booted
> and while the guest is paused, so I don't think that it is possible to
> make use of any device emulation code in it.
> 
> Paolo
>
Daniel P. Berrangé Aug. 16, 2021, 3:16 p.m. UTC | #8
On Mon, Aug 16, 2021 at 05:00:21PM +0200, Paolo Bonzini wrote:
> On 16/08/21 16:23, Daniel P. Berrangé wrote:
> > snip
> > 
> > > With this implementation, the number of mirror vCPUs does not even have to
> > > be indicated on the command line.  The VM and its vCPUs can simply be
> > > created when migration starts.  In the SEV-ES case, the guest can even
> > > provide the VMSA that starts the migration helper.
> > 
> > I don't think management apps will accept that approach when pinning
> > guests. They will want control over how many extra vCPU threads are
> > created, what host pCPUs they are pinned to, and what schedular
> > policies might be applied to them.
> 
> That doesn't require creating the migration threads at startup, or making
> them vCPU threads, does it?
> 
> The migration helper is guest code that is run within the context of the
> migration helper in order to decrypt/re-encrypt the code (using a different
> tweak that is based on e.g. the ram_addr_t rather than the HPA).  How does
> libvirt pin migration thread(s) currently?

I don't think we do explicit pinning of migration related threads right
now, which means they'll inherit characteristics of whichever thread
spawns the transient migration thread.  If the mgmt app has pinned the
emulator threads to a single CPU, then creating many migration threads
is a waste of time as they'll compete with each other.

I woudn't be needed to create migration threads at startup - we should
just think about how we would identify and control them when created
at runtime. The complexity here is a trust issue - once guest code has
been run, we can't trust what QMP tells us - so I'm not sure how we
would learn what PIDs are associated with the transiently created
migration threads, in order to set their properties.


Regards,
Daniel
Paolo Bonzini Aug. 16, 2021, 3:35 p.m. UTC | #9
On 16/08/21 17:16, Daniel P. Berrangé wrote:
> I woudn't be needed to create migration threads at startup - we should
> just think about how we would identify and control them when created
> at runtime. The complexity here is a trust issue - once guest code has
> been run, we can't trust what QMP tells us - so I'm not sure how we
> would learn what PIDs are associated with the transiently created
> migration threads, in order to set their properties.

That would apply anyway to any kind of thread though.  It doesn't matter 
whether the migration thread runs host or (mostly) guest code.

Paolo
Paolo Bonzini Aug. 16, 2021, 3:38 p.m. UTC | #10
On 16/08/21 17:13, Ashish Kalra wrote:
>>> I think that once the mirror VM starts booting and running the UEFI
>>> code, it might be only during the PEI or DXE phase where it will
>>> start actually running the MH code, so mirror VM probably still need
>>> to handles KVM_EXIT_IO when SEC phase does I/O, I can see PIC
>>> accesses and Debug Agent initialization stuff in SEC startup code.
>> That may be a design of the migration helper code that you were working
>> with, but it's not necessary.
>>
> Actually my comments are about a more generic MH code.

I don't think that would be a good idea; designing QEMU's migration 
helper interface to be as constrained as possible is a good thing.  The 
migration helper is extremely security sensitive code, so it should not 
expose itself to the attack surface of the whole of QEMU.

Paolo
Dr. David Alan Gilbert Aug. 16, 2021, 3:48 p.m. UTC | #11
* Paolo Bonzini (pbonzini@redhat.com) wrote:
> On 16/08/21 17:13, Ashish Kalra wrote:
> > > > I think that once the mirror VM starts booting and running the UEFI
> > > > code, it might be only during the PEI or DXE phase where it will
> > > > start actually running the MH code, so mirror VM probably still need
> > > > to handles KVM_EXIT_IO when SEC phase does I/O, I can see PIC
> > > > accesses and Debug Agent initialization stuff in SEC startup code.
> > > That may be a design of the migration helper code that you were working
> > > with, but it's not necessary.
> > > 
> > Actually my comments are about a more generic MH code.
> 
> I don't think that would be a good idea; designing QEMU's migration helper
> interface to be as constrained as possible is a good thing.  The migration
> helper is extremely security sensitive code, so it should not expose itself
> to the attack surface of the whole of QEMU.

It's also odd in that it's provided by the guest and acting on behalf of
the migration code; that's an unusually trusting relationship.

Dave

> Paolo
>
Dr. David Alan Gilbert Aug. 16, 2021, 5:23 p.m. UTC | #12
* Paolo Bonzini (pbonzini@redhat.com) wrote:
> On 16/08/21 15:25, Ashish Kalra wrote:
> > From: Ashish Kalra<ashish.kalra@amd.com>
> > 
> > This is an RFC series for Mirror VM support that are
> > essentially secondary VMs sharing the encryption context
> > (ASID) with a primary VM. The patch-set creates a new
> > VM and shares the primary VM's encryption context
> > with it using the KVM_CAP_VM_COPY_ENC_CONTEXT_FROM capability.
> > The mirror VM uses a separate pair of VM + vCPU file
> > descriptors and also uses a simplified KVM run loop,
> > for example, it does not support any interrupt vmexit's. etc.
> > Currently the mirror VM shares the address space of the
> > primary VM.
> > 
> > The mirror VM can be used for running an in-guest migration
> > helper (MH). It also might have future uses for other in-guest
> > operations.
> 
> Hi,
> 
> first of all, thanks for posting this work and starting the discussion.
> 
> However, I am not sure if the in-guest migration helper vCPUs should use the
> existing KVM support code.  For example, they probably can just always work
> with host CPUID (copied directly from KVM_GET_SUPPORTED_CPUID),

Doesn't at least one form of SEV have some masking of CPUID that's
visible to the guest; so perhaps we have to match the main vCPUs idea of
CPUIDs?

>  and they do
> not need to interface with QEMU's MMIO logic.  They would just sit on a
> "HLT" instruction and communicate with the main migration loop using some
> kind of standardized ring buffer protocol; the migration loop then executes
> KVM_RUN in order to start the processing of pages, and expects a
> KVM_EXIT_HLT when the VM has nothing to do or requires processing on the
> host.
> 
> The migration helper can then also use its own address space, for example
> operating directly on ram_addr_t values with the helper running at very high
> virtual addresses.  Migration code can use a RAMBlockNotifier to invoke
> KVM_SET_USER_MEMORY_REGION on the mirror VM (and never enable dirty memory
> logging on the mirror VM, too, which has better performance).

How does the use of a very high virtual address help ?

> With this implementation, the number of mirror vCPUs does not even have to
> be indicated on the command line.  The VM and its vCPUs can simply be
> created when migration starts.  In the SEV-ES case, the guest can even
> provide the VMSA that starts the migration helper.
> 
> The disadvantage is that, as you point out, in the future some of the
> infrastructure you introduce might be useful for VMPL0 operation on SEV-SNP.
> My proposal above might require some code duplication. However, it might
> even be that VMPL0 operation works best with a model more similar to my
> sketch of the migration helper; it's really too early to say.
> 

Dave

> Paolo
>
Steve Rutherford Aug. 16, 2021, 11:53 p.m. UTC | #13
On Mon, Aug 16, 2021 at 6:37 AM Ashish Kalra <Ashish.Kalra@amd.com> wrote:
>
> From: Ashish Kalra <ashish.kalra@amd.com>
>
> This is an RFC series for Mirror VM support that are
> essentially secondary VMs sharing the encryption context
> (ASID) with a primary VM. The patch-set creates a new
> VM and shares the primary VM's encryption context
> with it using the KVM_CAP_VM_COPY_ENC_CONTEXT_FROM capability.
> The mirror VM uses a separate pair of VM + vCPU file
> descriptors and also uses a simplified KVM run loop,
> for example, it does not support any interrupt vmexit's. etc.
> Currently the mirror VM shares the address space of the
> primary VM.
Sharing an address space is incompatible with post-copy migration via
UFFD on the target side. I'll be honest and say I'm not deeply
familiar with QEMU's implementation of post-copy, but I imagine there
must be a mapping of guest memory that doesn't fault: on the target
side (or on both sides), the migration helper will need to have it's
view of guest memory go through that mapping, or a similar mapping.

Separately, I'm a little weary of leaving the migration helper mapped
into the shared address space as writable. Since the migration threads
will be executing guest-owned code, the guest could use these threads
to do whatever it pleases (including getting free cycles). The
migration helper's code needs to be trusted by both the host and the
guest. Making it non-writable, sourced by the host, and attested by
the hardware would mitigate these concerns. The host could also try to
monitor for malicious use of migration threads, but that would be
pretty finicky.  The host could competitively schedule the migration
helper vCPUs with the guest vCPUs, but I'd imagine that wouldn't be
the best for guest performance.


--Steve
Michael S. Tsirkin Aug. 17, 2021, 7:05 a.m. UTC | #14
On Mon, Aug 16, 2021 at 04:53:17PM -0700, Steve Rutherford wrote:
> Separately, I'm a little weary of leaving the migration helper mapped
> into the shared address space as writable. Since the migration threads
> will be executing guest-owned code, the guest could use these threads
> to do whatever it pleases (including getting free cycles). The
> migration helper's code needs to be trusted by both the host and the
> guest. Making it non-writable, sourced by the host, and attested by
> the hardware would mitigate these concerns.

Well it's an ABI to maintain against *both* guest and host then.

And a separate attestation isn't making things easier to manage.

I feel guest risks much more than the hypervisor here,
the hypervisor at worst is giving out free cycles and that can
be mitigated, so it makes sense to have guest be in control.

How about we source it from guest but write-protect it on the
hypervisor side? It could include a signature that hypervisor could
verify, which would be more flexible than hardware attestation.
Dr. David Alan Gilbert Aug. 17, 2021, 8:38 a.m. UTC | #15
* Steve Rutherford (srutherford@google.com) wrote:
> On Mon, Aug 16, 2021 at 6:37 AM Ashish Kalra <Ashish.Kalra@amd.com> wrote:
> >
> > From: Ashish Kalra <ashish.kalra@amd.com>
> >
> > This is an RFC series for Mirror VM support that are
> > essentially secondary VMs sharing the encryption context
> > (ASID) with a primary VM. The patch-set creates a new
> > VM and shares the primary VM's encryption context
> > with it using the KVM_CAP_VM_COPY_ENC_CONTEXT_FROM capability.
> > The mirror VM uses a separate pair of VM + vCPU file
> > descriptors and also uses a simplified KVM run loop,
> > for example, it does not support any interrupt vmexit's. etc.
> > Currently the mirror VM shares the address space of the
> > primary VM.
> Sharing an address space is incompatible with post-copy migration via
> UFFD on the target side. I'll be honest and say I'm not deeply
> familiar with QEMU's implementation of post-copy, but I imagine there
> must be a mapping of guest memory that doesn't fault: on the target
> side (or on both sides), the migration helper will need to have it's
> view of guest memory go through that mapping, or a similar mapping.

Ignoring SEV, our postcopy currently has a single mapping which is
guarded by UFFD. There is no 'no-fault' mapping.  We use the uffd ioctl
to 'place' a page into that space when we receive it.
But yes, I guess that can't work with SEV; as you say; if the helper
has to do the write, it'll have to do it into a shadow that it can write
to, even though the rest of th e guest must UF on access.

> Separately, I'm a little weary of leaving the migration helper mapped
> into the shared address space as writable. Since the migration threads
> will be executing guest-owned code, the guest could use these threads
> to do whatever it pleases (including getting free cycles)a

Agreed.

> . The
> migration helper's code needs to be trusted by both the host and the
> guest. 


> Making it non-writable, sourced by the host, and attested by
> the hardware would mitigate these concerns.

Some people worry about having the host supply the guest firmware,
because they worry they'll be railroaded into using something they
don't actually trust, and if there aim of using SEV etc is to avoid
trusting the cloud owner, that breaks that.

So for them, I think they want the migration helper to be trusted by the
guest and tolerated by the host.

> The host could also try to
> monitor for malicious use of migration threads, but that would be
> pretty finicky.  The host could competitively schedule the migration
> helper vCPUs with the guest vCPUs, but I'd imagine that wouldn't be
> the best for guest performance.

The CPU usage has to go somewhere.

Dave

> 
> --Steve
>
Kalra, Ashish Aug. 17, 2021, 2:08 p.m. UTC | #16
Hello Dave, Steve,

On Tue, Aug 17, 2021 at 09:38:24AM +0100, Dr. David Alan Gilbert wrote:
> * Steve Rutherford (srutherford@google.com) wrote:
> > On Mon, Aug 16, 2021 at 6:37 AM Ashish Kalra <Ashish.Kalra@amd.com> wrote:
> > >
> > > From: Ashish Kalra <ashish.kalra@amd.com>
> > >
> > > This is an RFC series for Mirror VM support that are
> > > essentially secondary VMs sharing the encryption context
> > > (ASID) with a primary VM. The patch-set creates a new
> > > VM and shares the primary VM's encryption context
> > > with it using the KVM_CAP_VM_COPY_ENC_CONTEXT_FROM capability.
> > > The mirror VM uses a separate pair of VM + vCPU file
> > > descriptors and also uses a simplified KVM run loop,
> > > for example, it does not support any interrupt vmexit's. etc.
> > > Currently the mirror VM shares the address space of the
> > > primary VM.
> > Sharing an address space is incompatible with post-copy migration via
> > UFFD on the target side. I'll be honest and say I'm not deeply
> > familiar with QEMU's implementation of post-copy, but I imagine there
> > must be a mapping of guest memory that doesn't fault: on the target
> > side (or on both sides), the migration helper will need to have it's
> > view of guest memory go through that mapping, or a similar mapping.
> 
> Ignoring SEV, our postcopy currently has a single mapping which is
> guarded by UFFD. There is no 'no-fault' mapping.  We use the uffd ioctl
> to 'place' a page into that space when we receive it.
> But yes, I guess that can't work with SEV; as you say; if the helper
> has to do the write, it'll have to do it into a shadow that it can write
> to, even though the rest of th e guest must UF on access.
> 

I assume that MH will be sharing the address space for source VM,
this will be in compatibility with host based live migration, the source
VM will be running in context of the qemu process (with all it's vcpu
threads and migration thread). 

Surely sharing the address space on target side will be incompatible
with post copy migration, as post copy migration will need to setup UFFD
mappings to start running the target VM while post copy migration is
active.

But will the UFFD mappings only be setup till the post-copy migration is
active, won't similar page mappings as source VM be setup on target VM
after the migration process is complete ?

Thanks,
Ashish

> > Separately, I'm a little weary of leaving the migration helper mapped
> > into the shared address space as writable. Since the migration threads
> > will be executing guest-owned code, the guest could use these threads
> > to do whatever it pleases (including getting free cycles)a
> 
> Agreed.
> 
> > . The
> > migration helper's code needs to be trusted by both the host and the
> > guest. 
> 
> 
> > Making it non-writable, sourced by the host, and attested by
> > the hardware would mitigate these concerns.
> 
> Some people worry about having the host supply the guest firmware,
> because they worry they'll be railroaded into using something they
> don't actually trust, and if there aim of using SEV etc is to avoid
> trusting the cloud owner, that breaks that.
> 
> So for them, I think they want the migration helper to be trusted by the
> guest and tolerated by the host.
> 
> > The host could also try to
> > monitor for malicious use of migration threads, but that would be
> > pretty finicky.  The host could competitively schedule the migration
> > helper vCPUs with the guest vCPUs, but I'd imagine that wouldn't be
> > the best for guest performance.
> 
> The CPU usage has to go somewhere.
> 
> Dave
> 
> > 
> > --Steve
> > 
> -- 
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>
Paolo Bonzini Aug. 17, 2021, 4:32 p.m. UTC | #17
On 17/08/21 01:53, Steve Rutherford wrote:
> Separately, I'm a little weary of leaving the migration helper mapped
> into the shared address space as writable.

A related question here is what the API should be for how the migration 
helper sees the memory in both physical and virtual address.

First of all, I would like the addresses passed to and from the 
migration helper to *not* be guest physical addresses (this is what I 
referred to as QEMU's ram_addr_t in other messages).  The reason is that 
some unmapped memory regions, such as virtio-mem hotplugged memory, 
would still have to be transferred and could be encrypted.  While the 
guest->host hypercall interface uses guest physical addresses to 
communicate which pages are encrypted, the host can do the 
GPA->ram_addr_t conversion and remember the encryption status of 
currently-unmapped regions.

This poses a problem, in that the guest needs to prepare the page tables 
for the migration helper and those need to use the migration helper's 
physical address space.

There's three possibilities for this:

1) the easy one: the bottom 4G of guest memory are mapped in the mirror 
VM 1:1.  The ram_addr_t-based addresses are shifted by either 4G or a 
huge value such as 2^42 (MAXPHYADDR - physical address reduction - 1). 
This even lets the migration helper reuse the OVMF runtime services 
memory map (but be careful about thread safety...).

2) the more future-proof one.  Here, the migration helper tells QEMU 
which area to copy from the guest to the mirror VM, as a (main GPA, 
length, mirror GPA) tuple.  This could happen for example the first time 
the guest writes 1 to MSR_KVM_MIGRATION_CONTROL.  When migration starts, 
QEMU uses this information to issue KVM_SET_USER_MEMORY_REGION 
accordingly.  The page tables are built for this (usually very high) 
mirror GPA and the migration helper operates in a completely separate 
address space.  However, the backing memory would still be shared 
between the main and mirror VMs.  I am saying this is more future proof 
because we have more flexibility in setting up the physical address 
space of the mirror VM.

3) the paranoid one, which I think is what you hint at above: this is an 
extension of (2), where userspace invokes the PSP send/receive API to 
copy the small requested area of the main VM into the mirror VM.  The 
mirror VM code and data are completely separate from the main VM.  All 
that the mirror VM shares is the ram_addr_t data.  Though I am not even 
sure it is possible to use the send/receive API this way...

What do you think?

Paolo
Tobin Feldman-Fitzthum Aug. 17, 2021, 8:50 p.m. UTC | #18
On 8/17/21 12:32 PM, Paolo Bonzini wrote:
> On 17/08/21 01:53, Steve Rutherford wrote:
>> Separately, I'm a little weary of leaving the migration helper mapped
>> into the shared address space as writable.
>
> A related question here is what the API should be for how the 
> migration helper sees the memory in both physical and virtual address.
>
> First of all, I would like the addresses passed to and from the 
> migration helper to *not* be guest physical addresses (this is what I 
> referred to as QEMU's ram_addr_t in other messages).  The reason is 
> that some unmapped memory regions, such as virtio-mem hotplugged 
> memory, would still have to be transferred and could be encrypted.  
> While the guest->host hypercall interface uses guest physical 
> addresses to communicate which pages are encrypted, the host can do 
> the GPA->ram_addr_t conversion and remember the encryption status of 
> currently-unmapped regions.
>
> This poses a problem, in that the guest needs to prepare the page 
> tables for the migration helper and those need to use the migration 
> helper's physical address space.
>
> There's three possibilities for this:
>
> 1) the easy one: the bottom 4G of guest memory are mapped in the 
> mirror VM 1:1.  The ram_addr_t-based addresses are shifted by either 
> 4G or a huge value such as 2^42 (MAXPHYADDR - physical address 
> reduction - 1). This even lets the migration helper reuse the OVMF 
> runtime services memory map (but be careful about thread safety...).

This is essentially what we do in our prototype, although we have an 
even simpler approach. We have a 1:1 mapping that maps an address to 
itself with the cbit set. During Migration QEMU asks the migration 
handler to import/export encrypted pages and provides the GPA for said 
page. Since the migration handler only exports/imports encrypted pages, 
we can have the cbit set for every page in our mapping. We can still use 
OVMF functions with these mappings because they are on encrypted pages. 
The MH does need to use a few shared pages (to communicate with QEMU, 
for instance), so we have another mapping without the cbit that is at a 
large offset.

I think this is basically equivalent to what you suggest. As you point 
out above, this approach does require that any page that will be 
exported/imported by the MH is mapped in the guest. Is this a bad 
assumption? The VMSA for SEV-ES is one example of a region that is 
encrypted but not mapped in the guest (the PSP handles it directly). We 
have been planning to map the VMSA into the guest to support migration 
with SEV-ES (along with other changes).

> 2) the more future-proof one.  Here, the migration helper tells QEMU 
> which area to copy from the guest to the mirror VM, as a (main GPA, 
> length, mirror GPA) tuple.  This could happen for example the first 
> time the guest writes 1 to MSR_KVM_MIGRATION_CONTROL.  When migration 
> starts, QEMU uses this information to issue KVM_SET_USER_MEMORY_REGION 
> accordingly.  The page tables are built for this (usually very high) 
> mirror GPA and the migration helper operates in a completely separate 
> address space.  However, the backing memory would still be shared 
> between the main and mirror VMs.  I am saying this is more future 
> proof because we have more flexibility in setting up the physical 
> address space of the mirror VM.

The Migration Handler in OVMF is not a contiguous region of memory. The 
MH uses OVMF helper functions that are allocated in various regions of 
runtime memory. I guess I can see how separating the memory of the MH 
and the guest OS could be positive. On the other hand, since the MH is 
in OVMF, it is fundamentally designed to coexist with the guest OS.

What do you envision in terms of future changes to the mirror address space?

> 3) the paranoid one, which I think is what you hint at above: this is 
> an extension of (2), where userspace invokes the PSP send/receive API 
> to copy the small requested area of the main VM into the mirror VM.  
> The mirror VM code and data are completely separate from the main VM.  
> All that the mirror VM shares is the ram_addr_t data. Though I am not 
> even sure it is possible to use the send/receive API this way...

Yeah not sure if you could use the PSP for this.

-Tobin

>
> What do you think?
>
> Paolo
>
>
Steve Rutherford Aug. 17, 2021, 9:54 p.m. UTC | #19
On Tue, Aug 17, 2021 at 9:32 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 17/08/21 01:53, Steve Rutherford wrote:
> > Separately, I'm a little weary of leaving the migration helper mapped
> > into the shared address space as writable.
>
> A related question here is what the API should be for how the migration
> helper sees the memory in both physical and virtual address.
>
> First of all, I would like the addresses passed to and from the
> migration helper to *not* be guest physical addresses (this is what I
> referred to as QEMU's ram_addr_t in other messages).  The reason is that
> some unmapped memory regions, such as virtio-mem hotplugged memory,
> would still have to be transferred and could be encrypted.  While the
> guest->host hypercall interface uses guest physical addresses to
> communicate which pages are encrypted, the host can do the
> GPA->ram_addr_t conversion and remember the encryption status of
> currently-unmapped regions.
>
> This poses a problem, in that the guest needs to prepare the page tables
> for the migration helper and those need to use the migration helper's
> physical address space.
>
> There's three possibilities for this:
>
> 1) the easy one: the bottom 4G of guest memory are mapped in the mirror
> VM 1:1.  The ram_addr_t-based addresses are shifted by either 4G or a
> huge value such as 2^42 (MAXPHYADDR - physical address reduction - 1).
> This even lets the migration helper reuse the OVMF runtime services
> memory map (but be careful about thread safety...).
If I understand what you are proposing, this would only work for
SEV/SEV-ES, since the RMP prevents these remapping games. This makes
me less enthusiastic about this (but I suspect that's why you call
this less future proof).
>
> 2) the more future-proof one.  Here, the migration helper tells QEMU
> which area to copy from the guest to the mirror VM, as a (main GPA,
> length, mirror GPA) tuple.  This could happen for example the first time
> the guest writes 1 to MSR_KVM_MIGRATION_CONTROL.  When migration starts,
> QEMU uses this information to issue KVM_SET_USER_MEMORY_REGION
> accordingly.  The page tables are built for this (usually very high)
> mirror GPA and the migration helper operates in a completely separate
> address space.  However, the backing memory would still be shared
> between the main and mirror VMs.  I am saying this is more future proof
> because we have more flexibility in setting up the physical address
> space of the mirror VM.
>
> 3) the paranoid one, which I think is what you hint at above: this is an
> extension of (2), where userspace invokes the PSP send/receive API to
> copy the small requested area of the main VM into the mirror VM.  The
> mirror VM code and data are completely separate from the main VM.  All
> that the mirror VM shares is the ram_addr_t data.  Though I am not even
> sure it is possible to use the send/receive API this way...
Moreso what I was hinting at was treating the MH's code and data as
firmware is treated, i.e. initialize it via LAUNCH_UPDATE_DATA.
Getting the guest to trust host supplied code (i.e. firmware) needs to
happen regardless.

>
> What do you think?

My intuition for this leans more on the host, but matches some of the
bits you've mentioned in (2)/(3). My intuition would be to put the
migration helper incredibly high in gPA space, so that it does not
collide with the rest of the guest (and can then stay in the same
place for a fairly long period of time without needing to poke a hole
in the guest). Then you can leave the ram_addr_t-based addresses
mapped normally (without the offsetting). All this together allows the
migration helper to be orthogonal to the normal guest and normal
firmware.

In this case, since the migration helper has a somewhat stable base
address, you can have a prebaked entry point and page tables
(determined at build time). The shared communication pages can come
from neighboring high-memory. The migration helper can support a
straightforward halt loop (or PIO loop, or whatever) where it reads
from a predefined page to find what work needs to be done (perhaps
with that page depending on which CPU it is, so you can support
multithreading of the migration helper). Additionally, having it high
in memory makes it quite easy to assess who owns which addresses: high
mem is under the purview of the migration helper and does not need to
be dirty tracked. Only "low" memory can and needs to be encrypted for
transport to the target side.

--Steve
>
> Paolo
>
Steve Rutherford Aug. 17, 2021, 10:04 p.m. UTC | #20
On Tue, Aug 17, 2021 at 1:50 PM Tobin Feldman-Fitzthum
<tobin@linux.ibm.com> wrote:
>
>
> On 8/17/21 12:32 PM, Paolo Bonzini wrote:
> > There's three possibilities for this:
> >
> > 1) the easy one: the bottom 4G of guest memory are mapped in the
> > mirror VM 1:1.  The ram_addr_t-based addresses are shifted by either
> > 4G or a huge value such as 2^42 (MAXPHYADDR - physical address
> > reduction - 1). This even lets the migration helper reuse the OVMF
> > runtime services memory map (but be careful about thread safety...).
>
> This is essentially what we do in our prototype, although we have an
> even simpler approach. We have a 1:1 mapping that maps an address to
> itself with the cbit set. During Migration QEMU asks the migration
> handler to import/export encrypted pages and provides the GPA for said
> page. Since the migration handler only exports/imports encrypted pages,
> we can have the cbit set for every page in our mapping. We can still use
> OVMF functions with these mappings because they are on encrypted pages.
> The MH does need to use a few shared pages (to communicate with QEMU,
> for instance), so we have another mapping without the cbit that is at a
> large offset.
>
> I think this is basically equivalent to what you suggest. As you point
> out above, this approach does require that any page that will be
> exported/imported by the MH is mapped in the guest. Is this a bad
> assumption? The VMSA for SEV-ES is one example of a region that is
> encrypted but not mapped in the guest (the PSP handles it directly). We
> have been planning to map the VMSA into the guest to support migration
> with SEV-ES (along with other changes).

Ahh, It sounds like you are looking into sidestepping the existing
AMD-SP flows for migration. I assume the idea is to spin up a VM on
the target side, and have the two VMs attest to each other. How do the
two sides know if the other is legitimate? I take it that the source
is directing the LAUNCH flows?


--Steve
Paolo Bonzini Aug. 17, 2021, 10:37 p.m. UTC | #21
On Tue, Aug 17, 2021 at 11:54 PM Steve Rutherford
<srutherford@google.com> wrote:
> > 1) the easy one: the bottom 4G of guest memory are mapped in the mirror
> > VM 1:1.  The ram_addr_t-based addresses are shifted by either 4G or a
> > huge value such as 2^42 (MAXPHYADDR - physical address reduction - 1).
> > This even lets the migration helper reuse the OVMF runtime services
> > memory map (but be careful about thread safety...).
>
> If I understand what you are proposing, this would only work for
> SEV/SEV-ES, since the RMP prevents these remapping games. This makes
> me less enthusiastic about this (but I suspect that's why you call
> this less future proof).

I called it less future proof because it allows the migration helper
to rely more on OVMF details, but those may not apply in the future.

However you're right about SNP; the same page cannot be mapped twice
at different GPAs by a single ASID (which includes the VM and the
migration helper). :( That does throw a wrench in the idea of mapping
pages by ram_addr_t(*), and this applies to both schemes.

Migrating RAM in PCI BARs is a mess anyway for SNP, because PCI BARs
can be moved and every time they do the migration helper needs to wait
for validation to happen. :(

Paolo

(*) ram_addr_t is not a GPA; it is constant throughout the life of the
guest and independent of e.g. PCI BARs. Internally, when QEMU
retrieves the dirty page bitmap from KVM it stores the bits indexed by
ram_addr_t (shifted right by PAGE_SHIFT).

> > 2) the more future-proof one.  Here, the migration helper tells QEMU
> > which area to copy from the guest to the mirror VM, as a (main GPA,
> > length, mirror GPA) tuple.  This could happen for example the first time
> > the guest writes 1 to MSR_KVM_MIGRATION_CONTROL.  When migration starts,
> > QEMU uses this information to issue KVM_SET_USER_MEMORY_REGION
> > accordingly.  The page tables are built for this (usually very high)
> > mirror GPA and the migration helper operates in a completely separate
> > address space.  However, the backing memory would still be shared
> > between the main and mirror VMs.  I am saying this is more future proof
> > because we have more flexibility in setting up the physical address
> > space of the mirror VM.
>
> My intuition for this leans more on the host, but matches some of the
> bits you've mentioned in (2)/(3). My intuition would be to put the
> migration helper incredibly high in gPA space, so that it does not
> collide with the rest of the guest (and can then stay in the same
> place for a fairly long period of time without needing to poke a hole
> in the guest). Then you can leave the ram_addr_t-based addresses
> mapped normally (without the offsetting). All this together allows the
> migration helper to be orthogonal to the normal guest and normal
> firmware.
>
> In this case, since the migration helper has a somewhat stable base
> address, you can have a prebaked entry point and page tables
> (determined at build time). The shared communication pages can come
> from neighboring high-memory. The migration helper can support a
> straightforward halt loop (or PIO loop, or whatever) where it reads
> from a predefined page to find what work needs to be done (perhaps
> with that page depending on which CPU it is, so you can support
> multithreading of the migration helper). Additionally, having it high
> in memory makes it quite easy to assess who owns which addresses: high
> mem is under the purview of the migration helper and does not need to
> be dirty tracked. Only "low" memory can and needs to be encrypted for
> transport to the target side.
>
> --Steve
> >
> > Paolo
> >
>
James Bottomley Aug. 17, 2021, 10:57 p.m. UTC | #22
On Wed, 2021-08-18 at 00:37 +0200, Paolo Bonzini wrote:
> On Tue, Aug 17, 2021 at 11:54 PM Steve Rutherford
> <srutherford@google.com> wrote:
> > > 1) the easy one: the bottom 4G of guest memory are mapped in the
> > > mirror
> > > VM 1:1.  The ram_addr_t-based addresses are shifted by either 4G
> > > or a
> > > huge value such as 2^42 (MAXPHYADDR - physical address reduction
> > > - 1).
> > > This even lets the migration helper reuse the OVMF runtime
> > > services
> > > memory map (but be careful about thread safety...).
> > 
> > If I understand what you are proposing, this would only work for
> > SEV/SEV-ES, since the RMP prevents these remapping games. This
> > makes
> > me less enthusiastic about this (but I suspect that's why you call
> > this less future proof).
> 
> I called it less future proof because it allows the migration helper
> to rely more on OVMF details, but those may not apply in the future.
> 
> However you're right about SNP; the same page cannot be mapped twice
> at different GPAs by a single ASID (which includes the VM and the
> migration helper). :( That does throw a wrench in the idea of mapping
> pages by ram_addr_t(*), and this applies to both schemes.

Right, but in the current IBM approach, since we use the same mapping
for guest and mirror, we have the same GPA in both and it should work
with -SNP.

> Migrating RAM in PCI BARs is a mess anyway for SNP, because PCI BARs
> can be moved and every time they do the migration helper needs to
> wait for validation to happen. :(

Realistically, migration is becoming a royal pain, not just for
confidential computing, but for virtual functions in general.  I really
think we should look at S3 suspend, where we shut down the drivers and
then reattach on S3 resume as the potential pathway to getting
migration working both for virtual functions and this use case.

James
Steve Rutherford Aug. 17, 2021, 11:10 p.m. UTC | #23
On Tue, Aug 17, 2021 at 3:57 PM James Bottomley <jejb@linux.ibm.com> wrote:
> Realistically, migration is becoming a royal pain, not just for
> confidential computing, but for virtual functions in general.  I really
> think we should look at S3 suspend, where we shut down the drivers and
> then reattach on S3 resume as the potential pathway to getting
> migration working both for virtual functions and this use case.

This type of migration seems a little bit less "live", which makes me
concerned about its performance characteristics.

Steve
Paolo Bonzini Aug. 17, 2021, 11:20 p.m. UTC | #24
On Tue, Aug 17, 2021 at 10:51 PM Tobin Feldman-Fitzthum
<tobin@linux.ibm.com> wrote:
> This is essentially what we do in our prototype, although we have an
> even simpler approach. We have a 1:1 mapping that maps an address to
> itself with the cbit set. During Migration QEMU asks the migration
> handler to import/export encrypted pages and provides the GPA for said
> page. Since the migration handler only exports/imports encrypted pages,
> we can have the cbit set for every page in our mapping. We can still use
> OVMF functions with these mappings because they are on encrypted pages.
> The MH does need to use a few shared pages (to communicate with QEMU,
> for instance), so we have another mapping without the cbit that is at a
> large offset.
>
> I think this is basically equivalent to what you suggest. As you point
> out above, this approach does require that any page that will be
> exported/imported by the MH is mapped in the guest. Is this a bad
> assumption?

It should work well enough in the common case; and with SNP it looks
like it is a necessary assumption anyway. *shrug*

It would be a bit ugly because QEMU has to constantly convert
ram_addr_t's to guest physical addresses and back. But that's not a
performance problem.

The only important bit is that the encryption status bitmap be indexed
by ram_addr_t. This lets QEMU detect the problem of a ram_addr_t that
is marked encrypted but is not currently mapped, and abort migration.

> The Migration Handler in OVMF is not a contiguous region of memory. The
> MH uses OVMF helper functions that are allocated in various regions of
> runtime memory. I guess I can see how separating the memory of the MH
> and the guest OS could be positive. On the other hand, since the MH is
> in OVMF, it is fundamentally designed to coexist with the guest OS.

IIRC runtime services are not SMP-safe, so the migration helper cannot
coexist with the guest OS without extra care. I checked quickly and
CryptoPkg/Library/BaseCryptLib/SysCall/RuntimeMemAllocation.c does not
use any lock, so it would be bad if both OS-invoked runtime services
and the MH invoked the CryptoPkg malloc at the same time.

Paolo
James Bottomley Aug. 18, 2021, 2:49 a.m. UTC | #25
On Tue, 2021-08-17 at 16:10 -0700, Steve Rutherford wrote:
> On Tue, Aug 17, 2021 at 3:57 PM James Bottomley <jejb@linux.ibm.com>
> wrote:
> > Realistically, migration is becoming a royal pain, not just for
> > confidential computing, but for virtual functions in general.  I
> > really think we should look at S3 suspend, where we shut down the
> > drivers and then reattach on S3 resume as the potential pathway to
> > getting migration working both for virtual functions and this use
> > case.
> 
> This type of migration seems a little bit less "live", which makes me
> concerned about its performance characteristics.

Well, there are too many scenarios we just fail at migration today.  We
need help from the guest to quiesce or shut down the interior devices,
and S3 suspend seems to be the machine signal for that.  I think in
most clouds guests would accept some loss of "liveness" for a gain in
reliability as long as we keep them within the SLA ... which is 5
minutes a year for 5 nines.  Most failed migrations also instantly fail
SLAs because of the recovery times involved so I don't see what's to be
achieved by keeping the current "we can migrate sometimes" approach.

James
Kalra, Ashish Aug. 18, 2021, 10:31 a.m. UTC | #26
Hello Paolo,

On Mon, Aug 16, 2021 at 05:38:55PM +0200, Paolo Bonzini wrote:
> On 16/08/21 17:13, Ashish Kalra wrote:
> > > > I think that once the mirror VM starts booting and running the UEFI
> > > > code, it might be only during the PEI or DXE phase where it will
> > > > start actually running the MH code, so mirror VM probably still need
> > > > to handles KVM_EXIT_IO when SEC phase does I/O, I can see PIC
> > > > accesses and Debug Agent initialization stuff in SEC startup code.
> > > That may be a design of the migration helper code that you were working
> > > with, but it's not necessary.
> > > 
> > Actually my comments are about a more generic MH code.
> 
> I don't think that would be a good idea; designing QEMU's migration helper
> interface to be as constrained as possible is a good thing.  The migration
> helper is extremely security sensitive code, so it should not expose itself
> to the attack surface of the whole of QEMU.
> 
> 
One question i have here, is that where exactly will the MH code exist
in QEMU ?

I assume it will be only x86 platform specific code, we probably will
never support it on other platforms ?

So it will probably exist in hw/i386, something similar to "microvm"
support and using the same TYPE_X86_MACHINE ?

Also if we are not going to use the existing KVM support code and
adding some duplicate KVM interface code, do we need to interface with
this added KVM code via the QEMU accelerator framework, or simply invoke
this KVM code statically ?

Thanks,
Ashish
James Bottomley Aug. 18, 2021, 11:25 a.m. UTC | #27
On Wed, 2021-08-18 at 10:31 +0000, Ashish Kalra wrote:
> Hello Paolo,
> 
> On Mon, Aug 16, 2021 at 05:38:55PM +0200, Paolo Bonzini wrote:
> > On 16/08/21 17:13, Ashish Kalra wrote:
> > > > > I think that once the mirror VM starts booting and running
> > > > > the UEFI code, it might be only during the PEI or DXE phase
> > > > > where it will start actually running the MH code, so mirror
> > > > > VM probably still need to handles KVM_EXIT_IO when SEC phase
> > > > > does I/O, I can see PIC accesses and Debug Agent
> > > > > initialization stuff in SEC startup code.
> > > > That may be a design of the migration helper code that you were
> > > > working with, but it's not necessary.
> > > > 
> > > Actually my comments are about a more generic MH code.
> > 
> > I don't think that would be a good idea; designing QEMU's migration
> > helper interface to be as constrained as possible is a good
> > thing.  The migration helper is extremely security sensitive code,
> > so it should not expose itself to the attack surface of the whole
> > of QEMU.

The attack surface of the MH in the guest is simply the API.  The API
needs to do two things:

   1. validate a correct endpoint and negotiate a wrapping key
   2. When requested by QEMU, wrap a section of guest encrypted memory
      with the wrapping key and return it.

The big security risk is in 1. if the MH can be tricked into
communicating with the wrong endpoint it will leak the entire guest. 
If we can lock that down, I don't see any particular security problem
with 2. So, provided we get the security properties of the API correct,
I think we won't have to worry over much about exposure of the API.

> > One question i have here, is that where exactly will the MH code
> exist in QEMU ?

I assume it will be only x86 platform specific code, we probably will
never support it on other platforms ?

So it will probably exist in hw/i386, something similar to "microvm"
support and using the same TYPE_X86_MACHINE ?

I don't think it should be x86 only.  The migration handler receiver
should be completely CPU agnostic.  It's possible other CPUs will grow
an encrypted memory confidential computing capability (Power already
has one and ARM is "thinking" about it, but even if it doesn't, there's
a similar problem if you want to use trustzone isolation in VMs).  I
would envisage migration working substantially similarly on all of them
(need to ask an agent in the guest to wrap an encrypted page for
transport) so I think we should add this capability to the generic QEMU
migration code and let other architectures take advantage of it as they
grow the facility.

> Also if we are not going to use the existing KVM support code and
> adding some duplicate KVM interface code, do we need to interface
> with this added KVM code via the QEMU accelerator framework, or
> simply invoke this KVM code statically ?

I think we need to design the interface as cleanly as possible, so it
just depends what's easiest.  We certainly need some KVM support for
the mirror CPUs, I think but it's not clear to me yet what the simplest
way to do the interface is.

James
Kalra, Ashish Aug. 18, 2021, 2:06 p.m. UTC | #28
On Wed, Aug 18, 2021 at 12:37:32AM +0200, Paolo Bonzini wrote:
> On Tue, Aug 17, 2021 at 11:54 PM Steve Rutherford
> <srutherford@google.com> wrote:
> > > 1) the easy one: the bottom 4G of guest memory are mapped in the mirror
> > > VM 1:1.  The ram_addr_t-based addresses are shifted by either 4G or a
> > > huge value such as 2^42 (MAXPHYADDR - physical address reduction - 1).
> > > This even lets the migration helper reuse the OVMF runtime services
> > > memory map (but be careful about thread safety...).
> >
> > If I understand what you are proposing, this would only work for
> > SEV/SEV-ES, since the RMP prevents these remapping games. This makes
> > me less enthusiastic about this (but I suspect that's why you call
> > this less future proof).
> 
> I called it less future proof because it allows the migration helper
> to rely more on OVMF details, but those may not apply in the future.
> 
> However you're right about SNP; the same page cannot be mapped twice
> at different GPAs by a single ASID (which includes the VM and the
> migration helper). :( That does throw a wrench in the idea of mapping
> pages by ram_addr_t(*), and this applies to both schemes.
> 
> Migrating RAM in PCI BARs is a mess anyway for SNP, because PCI BARs
> can be moved and every time they do the migration helper needs to wait
> for validation to happen. :(
> 
> Paolo
> 
> (*) ram_addr_t is not a GPA; it is constant throughout the life of the
> guest and independent of e.g. PCI BARs. Internally, when QEMU
> retrieves the dirty page bitmap from KVM it stores the bits indexed by
> ram_addr_t (shifted right by PAGE_SHIFT).

With reference to SNP here, the mirror VM model seems to have a nice
fit with SNP:

SNP will support the separate address spaces for main VM and mirror VMs
implicitly, with the MH/MA running in VMPL0. 

Additionally, vTOM can be used to separate mirror VM and main VM memory,
with private mirror VM memory below vTOM and all the shared stuff with
main VM setup above vTOM. 

The design here should probably base itself on this model to probably
allow an easy future port to SNP and also make it more futurer-proof.

Thanks,
Ashish

> > > 2) the more future-proof one.  Here, the migration helper tells QEMU
> > > which area to copy from the guest to the mirror VM, as a (main GPA,
> > > length, mirror GPA) tuple.  This could happen for example the first time
> > > the guest writes 1 to MSR_KVM_MIGRATION_CONTROL.  When migration starts,
> > > QEMU uses this information to issue KVM_SET_USER_MEMORY_REGION
> > > accordingly.  The page tables are built for this (usually very high)
> > > mirror GPA and the migration helper operates in a completely separate
> > > address space.  However, the backing memory would still be shared
> > > between the main and mirror VMs.  I am saying this is more future proof
> > > because we have more flexibility in setting up the physical address
> > > space of the mirror VM.
> >
> > My intuition for this leans more on the host, but matches some of the
> > bits you've mentioned in (2)/(3). My intuition would be to put the
> > migration helper incredibly high in gPA space, so that it does not
> > collide with the rest of the guest (and can then stay in the same
> > place for a fairly long period of time without needing to poke a hole
> > in the guest). Then you can leave the ram_addr_t-based addresses
> > mapped normally (without the offsetting). All this together allows the
> > migration helper to be orthogonal to the normal guest and normal
> > firmware.
> >
> > In this case, since the migration helper has a somewhat stable base
> > address, you can have a prebaked entry point and page tables
> > (determined at build time). The shared communication pages can come
> > from neighboring high-memory. The migration helper can support a
> > straightforward halt loop (or PIO loop, or whatever) where it reads
> > from a predefined page to find what work needs to be done (perhaps
> > with that page depending on which CPU it is, so you can support
> > multithreading of the migration helper). Additionally, having it high
> > in memory makes it quite easy to assess who owns which addresses: high
> > mem is under the purview of the migration helper and does not need to
> > be dirty tracked. Only "low" memory can and needs to be encrypted for
> > transport to the target side.
> >
> > --Steve
> > >
> > > Paolo
> > >
> >
>
Dr. David Alan Gilbert Aug. 18, 2021, 3:31 p.m. UTC | #29
* James Bottomley (jejb@linux.ibm.com) wrote:
> On Wed, 2021-08-18 at 10:31 +0000, Ashish Kalra wrote:
> > Hello Paolo,
> > 
> > On Mon, Aug 16, 2021 at 05:38:55PM +0200, Paolo Bonzini wrote:
> > > On 16/08/21 17:13, Ashish Kalra wrote:
> > > > > > I think that once the mirror VM starts booting and running
> > > > > > the UEFI code, it might be only during the PEI or DXE phase
> > > > > > where it will start actually running the MH code, so mirror
> > > > > > VM probably still need to handles KVM_EXIT_IO when SEC phase
> > > > > > does I/O, I can see PIC accesses and Debug Agent
> > > > > > initialization stuff in SEC startup code.
> > > > > That may be a design of the migration helper code that you were
> > > > > working with, but it's not necessary.
> > > > > 
> > > > Actually my comments are about a more generic MH code.
> > > 
> > > I don't think that would be a good idea; designing QEMU's migration
> > > helper interface to be as constrained as possible is a good
> > > thing.  The migration helper is extremely security sensitive code,
> > > so it should not expose itself to the attack surface of the whole
> > > of QEMU.
> 
> The attack surface of the MH in the guest is simply the API.  The API
> needs to do two things:
> 
>    1. validate a correct endpoint and negotiate a wrapping key
>    2. When requested by QEMU, wrap a section of guest encrypted memory
>       with the wrapping key and return it.
> 
> The big security risk is in 1. if the MH can be tricked into
> communicating with the wrong endpoint it will leak the entire guest. 
> If we can lock that down, I don't see any particular security problem
> with 2. So, provided we get the security properties of the API correct,
> I think we won't have to worry over much about exposure of the API.

Well, we'd have to make sure it only does stuff on behalf of qemu; if
the guest can ever write to MH's memory it could do something that the
guest shouldn't be able to.

Dave

> > > One question i have here, is that where exactly will the MH code
> > exist in QEMU ?
> 
> I assume it will be only x86 platform specific code, we probably will
> never support it on other platforms ?
> 
> So it will probably exist in hw/i386, something similar to "microvm"
> support and using the same TYPE_X86_MACHINE ?
> 
> I don't think it should be x86 only.  The migration handler receiver
> should be completely CPU agnostic.  It's possible other CPUs will grow
> an encrypted memory confidential computing capability (Power already
> has one and ARM is "thinking" about it, but even if it doesn't, there's
> a similar problem if you want to use trustzone isolation in VMs).  I
> would envisage migration working substantially similarly on all of them
> (need to ask an agent in the guest to wrap an encrypted page for
> transport) so I think we should add this capability to the generic QEMU
> migration code and let other architectures take advantage of it as they
> grow the facility.
> 
> > Also if we are not going to use the existing KVM support code and
> > adding some duplicate KVM interface code, do we need to interface
> > with this added KVM code via the QEMU accelerator framework, or
> > simply invoke this KVM code statically ?
> 
> I think we need to design the interface as cleanly as possible, so it
> just depends what's easiest.  We certainly need some KVM support for
> the mirror CPUs, I think but it's not clear to me yet what the simplest
> way to do the interface is.
> 
> James
> 
>
Tobin Feldman-Fitzthum Aug. 18, 2021, 3:32 p.m. UTC | #30
On 8/17/21 6:04 PM, Steve Rutherford wrote:
> On Tue, Aug 17, 2021 at 1:50 PM Tobin Feldman-Fitzthum
> <tobin@linux.ibm.com> wrote:
>> This is essentially what we do in our prototype, although we have an
>> even simpler approach. We have a 1:1 mapping that maps an address to
>> itself with the cbit set. During Migration QEMU asks the migration
>> handler to import/export encrypted pages and provides the GPA for said
>> page. Since the migration handler only exports/imports encrypted pages,
>> we can have the cbit set for every page in our mapping. We can still use
>> OVMF functions with these mappings because they are on encrypted pages.
>> The MH does need to use a few shared pages (to communicate with QEMU,
>> for instance), so we have another mapping without the cbit that is at a
>> large offset.
>>
>> I think this is basically equivalent to what you suggest. As you point
>> out above, this approach does require that any page that will be
>> exported/imported by the MH is mapped in the guest. Is this a bad
>> assumption? The VMSA for SEV-ES is one example of a region that is
>> encrypted but not mapped in the guest (the PSP handles it directly). We
>> have been planning to map the VMSA into the guest to support migration
>> with SEV-ES (along with other changes).
> Ahh, It sounds like you are looking into sidestepping the existing
> AMD-SP flows for migration. I assume the idea is to spin up a VM on
> the target side, and have the two VMs attest to each other. How do the
> two sides know if the other is legitimate? I take it that the source
> is directing the LAUNCH flows?

Yeah we don't use PSP migration flows at all. We don't need to send the 
MH code from the source to the target because the MH lives in firmware, 
which is common between the two. We start the target like a normal VM 
rather than waiting for an incoming migration. The plan is to treat the 
target like a normal VM for attestation as well. The guest owner will 
attest the target VM just like they would any other VM that is started 
on their behalf. Secret injection can be used to establish a shared key 
for the source and target.

-Tobin

>
> --Steve
>
James Bottomley Aug. 18, 2021, 3:35 p.m. UTC | #31
On Wed, 2021-08-18 at 16:31 +0100, Dr. David Alan Gilbert wrote:
> * James Bottomley (jejb@linux.ibm.com) wrote:
> > On Wed, 2021-08-18 at 10:31 +0000, Ashish Kalra wrote:
> > > Hello Paolo,
> > > 
> > > On Mon, Aug 16, 2021 at 05:38:55PM +0200, Paolo Bonzini wrote:
> > > > On 16/08/21 17:13, Ashish Kalra wrote:
> > > > > > > I think that once the mirror VM starts booting and
> > > > > > > running the UEFI code, it might be only during the PEI or
> > > > > > > DXE phase where it will start actually running the MH
> > > > > > > code, so mirror VM probably still need to handles
> > > > > > > KVM_EXIT_IO when SEC phase does I/O, I can see PIC
> > > > > > > accesses and Debug Agent initialization stuff in SEC
> > > > > > > startup code.
> > > > > > That may be a design of the migration helper code that you
> > > > > > were working with, but it's not necessary.
> > > > > > 
> > > > > Actually my comments are about a more generic MH code.
> > > > 
> > > > I don't think that would be a good idea; designing QEMU's
> > > > migration helper interface to be as constrained as possible is
> > > > a good thing.  The migration helper is extremely security
> > > > sensitive code, so it should not expose itself to the attack
> > > > surface of the whole of QEMU.
> > 
> > The attack surface of the MH in the guest is simply the API.  The
> > API needs to do two things:
> > 
> >    1. validate a correct endpoint and negotiate a wrapping key
> >    2. When requested by QEMU, wrap a section of guest encrypted
> > memory
> >       with the wrapping key and return it.
> > 
> > The big security risk is in 1. if the MH can be tricked into
> > communicating with the wrong endpoint it will leak the entire
> > guest.  If we can lock that down, I don't see any particular
> > security problem with 2. So, provided we get the security
> > properties of the API correct, I think we won't have to worry over
> > much about exposure of the API.
> 
> Well, we'd have to make sure it only does stuff on behalf of qemu; if
> the guest can ever write to MH's memory it could do something that
> the guest shouldn't be able to.

Given the lack of SMI, we can't guarantee that with plain SEV and -ES. 
Once we move to -SNP, we can use VMPLs to achieve this.

But realistically, given the above API, even if the guest is malicious,
what can it do?  I think it's simply return bogus pages that cause a
crash on start after migration, which doesn't look like a huge risk to
the cloud to me (it's more a self destructive act on behalf of the
guest).

James
Dr. David Alan Gilbert Aug. 18, 2021, 3:43 p.m. UTC | #32
* James Bottomley (jejb@linux.ibm.com) wrote:
> On Wed, 2021-08-18 at 16:31 +0100, Dr. David Alan Gilbert wrote:
> > * James Bottomley (jejb@linux.ibm.com) wrote:
> > > On Wed, 2021-08-18 at 10:31 +0000, Ashish Kalra wrote:
> > > > Hello Paolo,
> > > > 
> > > > On Mon, Aug 16, 2021 at 05:38:55PM +0200, Paolo Bonzini wrote:
> > > > > On 16/08/21 17:13, Ashish Kalra wrote:
> > > > > > > > I think that once the mirror VM starts booting and
> > > > > > > > running the UEFI code, it might be only during the PEI or
> > > > > > > > DXE phase where it will start actually running the MH
> > > > > > > > code, so mirror VM probably still need to handles
> > > > > > > > KVM_EXIT_IO when SEC phase does I/O, I can see PIC
> > > > > > > > accesses and Debug Agent initialization stuff in SEC
> > > > > > > > startup code.
> > > > > > > That may be a design of the migration helper code that you
> > > > > > > were working with, but it's not necessary.
> > > > > > > 
> > > > > > Actually my comments are about a more generic MH code.
> > > > > 
> > > > > I don't think that would be a good idea; designing QEMU's
> > > > > migration helper interface to be as constrained as possible is
> > > > > a good thing.  The migration helper is extremely security
> > > > > sensitive code, so it should not expose itself to the attack
> > > > > surface of the whole of QEMU.
> > > 
> > > The attack surface of the MH in the guest is simply the API.  The
> > > API needs to do two things:
> > > 
> > >    1. validate a correct endpoint and negotiate a wrapping key
> > >    2. When requested by QEMU, wrap a section of guest encrypted
> > > memory
> > >       with the wrapping key and return it.
> > > 
> > > The big security risk is in 1. if the MH can be tricked into
> > > communicating with the wrong endpoint it will leak the entire
> > > guest.  If we can lock that down, I don't see any particular
> > > security problem with 2. So, provided we get the security
> > > properties of the API correct, I think we won't have to worry over
> > > much about exposure of the API.
> > 
> > Well, we'd have to make sure it only does stuff on behalf of qemu; if
> > the guest can ever write to MH's memory it could do something that
> > the guest shouldn't be able to.
> 
> Given the lack of SMI, we can't guarantee that with plain SEV and -ES. 
> Once we move to -SNP, we can use VMPLs to achieve this.

Doesn't the MH have access to different slots and running on separate
vCPUs; so it's still got some separation?

> But realistically, given the above API, even if the guest is malicious,
> what can it do?  I think it's simply return bogus pages that cause a
> crash on start after migration, which doesn't look like a huge risk to
> the cloud to me (it's more a self destructive act on behalf of the
> guest).

I'm a bit worried about the data structures that are shared between the
migration code in qemu and the MH; the code in qemu is going to have to
be paranoid about not trusting anything coming from the MH.

Dave

> James
> 
>
James Bottomley Aug. 18, 2021, 4:28 p.m. UTC | #33
On Wed, 2021-08-18 at 16:43 +0100, Dr. David Alan Gilbert wrote:
> * James Bottomley (jejb@linux.ibm.com) wrote:
[...]
> > Given the lack of SMI, we can't guarantee that with plain SEV and
> > -ES. Once we move to -SNP, we can use VMPLs to achieve this.
> 
> Doesn't the MH have access to different slots and running on separate
> vCPUs; so it's still got some separation?

Remember that the OVMF code is provided by the host, but its attested
to and run by the guest.  Once the guest takes control (i.e. after OVMF
boots the next thing), we can't guarantee that it wont overwrite the MH
code, so the host must treat the MH as untrusted.

> > But realistically, given the above API, even if the guest is
> > malicious, what can it do?  I think it's simply return bogus pages
> > that cause a crash on start after migration, which doesn't look
> > like a huge risk to the cloud to me (it's more a self destructive
> > act on behalf of the guest).
> 
> I'm a bit worried about the data structures that are shared between
> the migration code in qemu and the MH; the code in qemu is going to
> have to be paranoid about not trusting anything coming from the MH.

Given that we have to treat the host MH structure as untrusted, this is
definitely something we have to do.  Although the primary API is simply
"here's a buffer, please fill it", so there's not much checking to do,
we just have to be careful that we don't expose any more of the buffer
than the guest needs to write to ... and, obviously, clean it before
exposing it to the guest.

James
Kalra, Ashish Aug. 18, 2021, 5:07 p.m. UTC | #34
On Wed, Aug 18, 2021 at 02:06:25PM +0000, Ashish Kalra wrote:
> On Wed, Aug 18, 2021 at 12:37:32AM +0200, Paolo Bonzini wrote:
> > On Tue, Aug 17, 2021 at 11:54 PM Steve Rutherford
> > <srutherford@google.com> wrote:
> > > > 1) the easy one: the bottom 4G of guest memory are mapped in the mirror
> > > > VM 1:1.  The ram_addr_t-based addresses are shifted by either 4G or a
> > > > huge value such as 2^42 (MAXPHYADDR - physical address reduction - 1).
> > > > This even lets the migration helper reuse the OVMF runtime services
> > > > memory map (but be careful about thread safety...).
> > >
> > > If I understand what you are proposing, this would only work for
> > > SEV/SEV-ES, since the RMP prevents these remapping games. This makes
> > > me less enthusiastic about this (but I suspect that's why you call
> > > this less future proof).
> > 
> > I called it less future proof because it allows the migration helper
> > to rely more on OVMF details, but those may not apply in the future.
> > 
> > However you're right about SNP; the same page cannot be mapped twice
> > at different GPAs by a single ASID (which includes the VM and the
> > migration helper). :( That does throw a wrench in the idea of mapping
> > pages by ram_addr_t(*), and this applies to both schemes.
> > 
> > Migrating RAM in PCI BARs is a mess anyway for SNP, because PCI BARs
> > can be moved and every time they do the migration helper needs to wait
> > for validation to happen. :(
> > 
> > Paolo
> > 
> > (*) ram_addr_t is not a GPA; it is constant throughout the life of the
> > guest and independent of e.g. PCI BARs. Internally, when QEMU
> > retrieves the dirty page bitmap from KVM it stores the bits indexed by
> > ram_addr_t (shifted right by PAGE_SHIFT).
> 
> With reference to SNP here, the mirror VM model seems to have a nice
> fit with SNP:
> 
> SNP will support the separate address spaces for main VM and mirror VMs
> implicitly, with the MH/MA running in VMPL0. 
> 

Need to correct this statement, there is no separate address space as
such, there is basically page level permission/protection between VMPL
levels. 

> Additionally, vTOM can be used to separate mirror VM and main VM memory,
> with private mirror VM memory below vTOM and all the shared stuff with
> main VM setup above vTOM. 
> 

I need to take back the above statement, memory above vTOM is basically
decrypted memory. 

Thanks,
Ashish

> The design here should probably base itself on this model to probably
> allow an easy future port to SNP and also make it more futurer-proof.

> > > > 2) the more future-proof one.  Here, the migration helper tells QEMU
> > > > which area to copy from the guest to the mirror VM, as a (main GPA,
> > > > length, mirror GPA) tuple.  This could happen for example the first time
> > > > the guest writes 1 to MSR_KVM_MIGRATION_CONTROL.  When migration starts,
> > > > QEMU uses this information to issue KVM_SET_USER_MEMORY_REGION
> > > > accordingly.  The page tables are built for this (usually very high)
> > > > mirror GPA and the migration helper operates in a completely separate
> > > > address space.  However, the backing memory would still be shared
> > > > between the main and mirror VMs.  I am saying this is more future proof
> > > > because we have more flexibility in setting up the physical address
> > > > space of the mirror VM.
> > >
> > > My intuition for this leans more on the host, but matches some of the
> > > bits you've mentioned in (2)/(3). My intuition would be to put the
> > > migration helper incredibly high in gPA space, so that it does not
> > > collide with the rest of the guest (and can then stay in the same
> > > place for a fairly long period of time without needing to poke a hole
> > > in the guest). Then you can leave the ram_addr_t-based addresses
> > > mapped normally (without the offsetting). All this together allows the
> > > migration helper to be orthogonal to the normal guest and normal
> > > firmware.
> > >
> > > In this case, since the migration helper has a somewhat stable base
> > > address, you can have a prebaked entry point and page tables
> > > (determined at build time). The shared communication pages can come
> > > from neighboring high-memory. The migration helper can support a
> > > straightforward halt loop (or PIO loop, or whatever) where it reads
> > > from a predefined page to find what work needs to be done (perhaps
> > > with that page depending on which CPU it is, so you can support
> > > multithreading of the migration helper). Additionally, having it high
> > > in memory makes it quite easy to assess who owns which addresses: high
> > > mem is under the purview of the migration helper and does not need to
> > > be dirty tracked. Only "low" memory can and needs to be encrypted for
> > > transport to the target side.
> > >
> > > --Steve
> > > >
> > > > Paolo
> > > >
> > >
> >
Dr. David Alan Gilbert Aug. 18, 2021, 5:30 p.m. UTC | #35
* James Bottomley (jejb@linux.ibm.com) wrote:
> On Wed, 2021-08-18 at 16:43 +0100, Dr. David Alan Gilbert wrote:
> > * James Bottomley (jejb@linux.ibm.com) wrote:
> [...]
> > > Given the lack of SMI, we can't guarantee that with plain SEV and
> > > -ES. Once we move to -SNP, we can use VMPLs to achieve this.
> > 
> > Doesn't the MH have access to different slots and running on separate
> > vCPUs; so it's still got some separation?
> 
> Remember that the OVMF code is provided by the host, but its attested
> to and run by the guest.  Once the guest takes control (i.e. after OVMF
> boots the next thing), we can't guarantee that it wont overwrite the MH
> code, so the host must treat the MH as untrusted.

Yeh; if it's in a romimage I guess we could write protect it?
(Not that I'd trust it still)

> > > But realistically, given the above API, even if the guest is
> > > malicious, what can it do?  I think it's simply return bogus pages
> > > that cause a crash on start after migration, which doesn't look
> > > like a huge risk to the cloud to me (it's more a self destructive
> > > act on behalf of the guest).
> > 
> > I'm a bit worried about the data structures that are shared between
> > the migration code in qemu and the MH; the code in qemu is going to
> > have to be paranoid about not trusting anything coming from the MH.
> 
> Given that we have to treat the host MH structure as untrusted, this is
> definitely something we have to do.  Although the primary API is simply
> "here's a buffer, please fill it", so there's not much checking to do,
> we just have to be careful that we don't expose any more of the buffer
> than the guest needs to write to ... and, obviously, clean it before
> exposing it to the guest.

I was assuming life got a bit more complicated than that; and we had
to have lists of pages we were requesting, and a list of pages that were
cooked and the qemu thread and the helper thread all had to work in
parallel.  So I'm guessing some list or bookkeeeping that we need to be
very careful of.

Dave

> James
> 
>
James Bottomley Aug. 18, 2021, 6:51 p.m. UTC | #36
On Wed, 2021-08-18 at 18:30 +0100, Dr. David Alan Gilbert wrote:
> * James Bottomley (jejb@linux.ibm.com) wrote:
> > On Wed, 2021-08-18 at 16:43 +0100, Dr. David Alan Gilbert wrote:
> > > * James Bottomley (jejb@linux.ibm.com) wrote:
> > [...]
> > > > Given the lack of SMI, we can't guarantee that with plain SEV
> > > > and -ES. Once we move to -SNP, we can use VMPLs to achieve
> > > > this.
> > > 
> > > Doesn't the MH have access to different slots and running on
> > > separate vCPUs; so it's still got some separation?
> > 
> > Remember that the OVMF code is provided by the host, but its
> > attested to and run by the guest.  Once the guest takes control
> > (i.e. after OVMF boots the next thing), we can't guarantee that it
> > wont overwrite the MH code, so the host must treat the MH as
> > untrusted.
> 
> Yeh; if it's in a romimage I guess we could write protect it?
> (Not that I'd trust it still)

Yes, but unfortunately OVMF (and edk2 in general) has another pitfall
for you: the initial pflash may be a read only ROM image, but it
uncompresses itself to low RAM and executes itself out of there. 
Anything in either PEI or DXE (which is where the migration handler
lies) is RAM based after decompression.

> > > > But realistically, given the above API, even if the guest is
> > > > malicious, what can it do?  I think it's simply return bogus
> > > > pages that cause a crash on start after migration, which
> > > > doesn't look like a huge risk to the cloud to me (it's more a
> > > > self destructive act on behalf of the guest).
> > > 
> > > I'm a bit worried about the data structures that are shared
> > > between the migration code in qemu and the MH; the code in qemu
> > > is going to have to be paranoid about not trusting anything
> > > coming from the MH.
> > 
> > Given that we have to treat the host MH structure as untrusted,
> > this is definitely something we have to do.  Although the primary
> > API is simply "here's a buffer, please fill it", so there's not
> > much checking to do, we just have to be careful that we don't
> > expose any more of the buffer than the guest needs to write to ...
> > and, obviously, clean it before exposing it to the guest.
> 
> I was assuming life got a bit more complicated than that; and we had
> to have lists of pages we were requesting, and a list of pages that
> were cooked and the qemu thread and the helper thread all had to work
> in parallel.  So I'm guessing some list or bookkeeeping that we need
> to be very careful of.

I was more or less imagining a GPA address and length, so range based,
but it could be we need something more sophisticated ... Tobin will
look after that part.  However, either way, we just need to be careful.

Regards,

James
Dr. David Alan Gilbert Aug. 18, 2021, 7:04 p.m. UTC | #37
* Tobin Feldman-Fitzthum (tobin@linux.ibm.com) wrote:
> On 8/17/21 6:04 PM, Steve Rutherford wrote:
> > On Tue, Aug 17, 2021 at 1:50 PM Tobin Feldman-Fitzthum
> > <tobin@linux.ibm.com> wrote:
> > > This is essentially what we do in our prototype, although we have an
> > > even simpler approach. We have a 1:1 mapping that maps an address to
> > > itself with the cbit set. During Migration QEMU asks the migration
> > > handler to import/export encrypted pages and provides the GPA for said
> > > page. Since the migration handler only exports/imports encrypted pages,
> > > we can have the cbit set for every page in our mapping. We can still use
> > > OVMF functions with these mappings because they are on encrypted pages.
> > > The MH does need to use a few shared pages (to communicate with QEMU,
> > > for instance), so we have another mapping without the cbit that is at a
> > > large offset.
> > > 
> > > I think this is basically equivalent to what you suggest. As you point
> > > out above, this approach does require that any page that will be
> > > exported/imported by the MH is mapped in the guest. Is this a bad
> > > assumption? The VMSA for SEV-ES is one example of a region that is
> > > encrypted but not mapped in the guest (the PSP handles it directly). We
> > > have been planning to map the VMSA into the guest to support migration
> > > with SEV-ES (along with other changes).
> > Ahh, It sounds like you are looking into sidestepping the existing
> > AMD-SP flows for migration. I assume the idea is to spin up a VM on
> > the target side, and have the two VMs attest to each other. How do the
> > two sides know if the other is legitimate? I take it that the source
> > is directing the LAUNCH flows?
> 
> Yeah we don't use PSP migration flows at all. We don't need to send the MH
> code from the source to the target because the MH lives in firmware, which
> is common between the two.

Are you relying on the target firmware to be *identical* or purely for
it to be *compatible* ?  It's normal for a migration to be the result of
wanting to do an upgrade; and that means the destination build of OVMF
might be newer (or older, or ...).

Dave


> We start the target like a normal VM rather than
> waiting for an incoming migration. The plan is to treat the target like a
> normal VM for attestation as well. The guest owner will attest the target VM
> just like they would any other VM that is started on their behalf. Secret
> injection can be used to establish a shared key for the source and target.
> 
> -Tobin
> 
> > 
> > --Steve
> > 
>
Tobin Feldman-Fitzthum Aug. 18, 2021, 9:42 p.m. UTC | #38
On 8/18/21 3:04 PM, Dr. David Alan Gilbert wrote:
> * Tobin Feldman-Fitzthum (tobin@linux.ibm.com) wrote:
>> On 8/17/21 6:04 PM, Steve Rutherford wrote:
>>> Ahh, It sounds like you are looking into sidestepping the existing
>>> AMD-SP flows for migration. I assume the idea is to spin up a VM on
>>> the target side, and have the two VMs attest to each other. How do the
>>> two sides know if the other is legitimate? I take it that the source
>>> is directing the LAUNCH flows?
>> Yeah we don't use PSP migration flows at all. We don't need to send the MH
>> code from the source to the target because the MH lives in firmware, which
>> is common between the two.
> Are you relying on the target firmware to be *identical* or purely for
> it to be *compatible* ?  It's normal for a migration to be the result of
> wanting to do an upgrade; and that means the destination build of OVMF
> might be newer (or older, or ...).
>
> Dave

This is a good point. The migration handler on the source and target 
must have the same memory footprint or bad things will happen. Using the 
same firmware on the source and target is an easy way to guarantee this. 
Since the MH in OVMF is not a contiguous region of memory, but a group 
of functions scattered around OVMF, it is a bit difficult to guarantee 
that the memory footprint is the same if the build is different.

-Tobin

>
>
>> We start the target like a normal VM rather than
>> waiting for an incoming migration. The plan is to treat the target like a
>> normal VM for attestation as well. The guest owner will attest the target VM
>> just like they would any other VM that is started on their behalf. Secret
>> injection can be used to establish a shared key for the source and target.
>>
>> -Tobin
>>
>>> --Steve
>>>
Dr. David Alan Gilbert Aug. 19, 2021, 8:22 a.m. UTC | #39
* Tobin Feldman-Fitzthum (tobin@linux.ibm.com) wrote:
> On 8/18/21 3:04 PM, Dr. David Alan Gilbert wrote:
> > * Tobin Feldman-Fitzthum (tobin@linux.ibm.com) wrote:
> > > On 8/17/21 6:04 PM, Steve Rutherford wrote:
> > > > Ahh, It sounds like you are looking into sidestepping the existing
> > > > AMD-SP flows for migration. I assume the idea is to spin up a VM on
> > > > the target side, and have the two VMs attest to each other. How do the
> > > > two sides know if the other is legitimate? I take it that the source
> > > > is directing the LAUNCH flows?
> > > Yeah we don't use PSP migration flows at all. We don't need to send the MH
> > > code from the source to the target because the MH lives in firmware, which
> > > is common between the two.
> > Are you relying on the target firmware to be *identical* or purely for
> > it to be *compatible* ?  It's normal for a migration to be the result of
> > wanting to do an upgrade; and that means the destination build of OVMF
> > might be newer (or older, or ...).
> > 
> > Dave
> 
> This is a good point. The migration handler on the source and target must
> have the same memory footprint or bad things will happen. Using the same
> firmware on the source and target is an easy way to guarantee this. Since
> the MH in OVMF is not a contiguous region of memory, but a group of
> functions scattered around OVMF, it is a bit difficult to guarantee that the
> memory footprint is the same if the build is different.

Can you explain what the 'memory footprint' consists of? Can't it just
be the whole of the OVMF rom space if you have no way of nudging the MH
into it's own chunk?

I think it really does have to cope with migration to a new version of
host.

Dave

> -Tobin
> 
> > 
> > 
> > > We start the target like a normal VM rather than
> > > waiting for an incoming migration. The plan is to treat the target like a
> > > normal VM for attestation as well. The guest owner will attest the target VM
> > > just like they would any other VM that is started on their behalf. Secret
> > > injection can be used to establish a shared key for the source and target.
> > > 
> > > -Tobin
> > > 
> > > > --Steve
> > > > 
>
James Bottomley Aug. 19, 2021, 2:06 p.m. UTC | #40
On Thu, 2021-08-19 at 09:22 +0100, Dr. David Alan Gilbert wrote:
> * Tobin Feldman-Fitzthum (tobin@linux.ibm.com) wrote:
> > On 8/18/21 3:04 PM, Dr. David Alan Gilbert wrote:
> > > * Tobin Feldman-Fitzthum (tobin@linux.ibm.com) wrote:
> > > > On 8/17/21 6:04 PM, Steve Rutherford wrote:
> > > > > Ahh, It sounds like you are looking into sidestepping the
> > > > > existing AMD-SP flows for migration. I assume the idea is to
> > > > > spin up a VM on the target side, and have the two VMs attest
> > > > > to each other. How do the two sides know if the other is
> > > > > legitimate? I take it that the source is directing the LAUNCH
> > > > > flows?
> > > >  
> > > > Yeah we don't use PSP migration flows at all. We don't need to
> > > > send the MH code from the source to the target because the MH
> > > > lives in firmware, which is common between the two.
> > >  
> > > Are you relying on the target firmware to be *identical* or
> > > purely for it to be *compatible* ?  It's normal for a migration
> > > to be the result of wanting to do an upgrade; and that means the
> > > destination build of OVMF might be newer (or older, or ...).
> > > 
> > > Dave
> > 
> > This is a good point. The migration handler on the source and
> > target must have the same memory footprint or bad things will
> > happen. Using the same firmware on the source and target is an easy
> > way to guarantee this. Since the MH in OVMF is not a contiguous
> > region of memory, but a group of functions scattered around OVMF,
> > it is a bit difficult to guarantee that the memory footprint is the
> > same if the build is different.
> 
> Can you explain what the 'memory footprint' consists of? Can't it
> just be the whole of the OVMF rom space if you have no way of nudging
> the MH into it's own chunk?

It might be possible depending on how we link it. At the moment it's
using the core OVMF libraries, but it is possible to retool the OVMF
build to copy those libraries into the MH DXE.

> I think it really does have to cope with migration to a new version
> of host.

Well, you're thinking of OVMF as belonging to the host because of the
way it is supplied, but think about the way it works in practice now,
forgetting about confidential computing: OVMF is RAM resident in
ordinary guests, so when you migrate them, the whole of OVMF (or at
least what's left at runtime) goes with the migration, thus it's not
possible to change the guest OVMF by migration.  The above is really
just an extension of that principle, the only difference for
confidential computing being you have to have an image of the current
OVMF ROM in the target to seed migration.

Technically, the problem is we can't overwrite running code and once
the guest is re-sited to the target, the OVMF there has to match
exactly what was on the source for the RT to still function.   Once the
migration has run, the OVMF on the target must be identical to what was
on the source (including internally allocated OVMF memory), and if we
can't copy the MH code, we have to rely on the target image providing
this identical code and we copy the rest.

James
Tobin Feldman-Fitzthum Aug. 19, 2021, 2:07 p.m. UTC | #41
On 8/19/21 4:22 AM, Dr. David Alan Gilbert wrote:
> * Tobin Feldman-Fitzthum (tobin@linux.ibm.com) wrote:
>> On 8/18/21 3:04 PM, Dr. David Alan Gilbert wrote:
>>
>>> Are you relying on the target firmware to be *identical* or purely for
>>> it to be *compatible* ?  It's normal for a migration to be the result of
>>> wanting to do an upgrade; and that means the destination build of OVMF
>>> might be newer (or older, or ...).
>>>
>>> Dave
>> This is a good point. The migration handler on the source and target must
>> have the same memory footprint or bad things will happen. Using the same
>> firmware on the source and target is an easy way to guarantee this. Since
>> the MH in OVMF is not a contiguous region of memory, but a group of
>> functions scattered around OVMF, it is a bit difficult to guarantee that the
>> memory footprint is the same if the build is different.
> Can you explain what the 'memory footprint' consists of? Can't it just
> be the whole of the OVMF rom space if you have no way of nudging the MH
> into it's own chunk?

The footprint is not massive. It is mainly ConfidentialMigrationDxe and 
the OVMF crypto support. It might be feasible to copy these components 
to a fixed location that would be the same across fw builds. It might 
also be feasible to pin these components to certain addresses. OVMF sort 
of supports doing this. We can raise the question in that community.

It also might work to protect the entirety of OVMF as you suggest. 
Currently we don't copy any of the OVMF ROM (as in flash0) over. That 
said, the MH doesn't run from the ROM so we would need to protect the 
memory used by OVMF as well. In some ways it might seem easier to 
protect all of the OVMF memory rather than just a couple of packages, 
but there are some complexities. For one thing, we would only want to 
protect efi runtime memory, as boot memory may be in use by the OS and 
would need to be migrated. The MH could check whether each page is efi 
runtime memory and skip any pages that are. Runtime memory won't be a 
contiguous blob, however, so for this approach the layout of the runtime 
memory would need to be the same on the source and target.

We can sidestep these issues entirely by using identical firmware 
images. That said, there are a number of strategies for developing 
compatible OVMF images and I definitely see the value of doing so.

-Tobin

>
> I think it really does have to cope with migration to a new version of
> host.
>
> Dave
>
>> -Tobin
>>
>>>
>>>> We start the target like a normal VM rather than
>>>> waiting for an incoming migration. The plan is to treat the target like a
>>>> normal VM for attestation as well. The guest owner will attest the target VM
>>>> just like they would any other VM that is started on their behalf. Secret
>>>> injection can be used to establish a shared key for the source and target.
>>>>
>>>> -Tobin
>>>>
>>>>> --Steve
>>>>>
Dr. David Alan Gilbert Aug. 19, 2021, 2:28 p.m. UTC | #42
* James Bottomley (jejb@linux.ibm.com) wrote:
> On Thu, 2021-08-19 at 09:22 +0100, Dr. David Alan Gilbert wrote:
> > * Tobin Feldman-Fitzthum (tobin@linux.ibm.com) wrote:
> > > On 8/18/21 3:04 PM, Dr. David Alan Gilbert wrote:
> > > > * Tobin Feldman-Fitzthum (tobin@linux.ibm.com) wrote:
> > > > > On 8/17/21 6:04 PM, Steve Rutherford wrote:
> > > > > > Ahh, It sounds like you are looking into sidestepping the
> > > > > > existing AMD-SP flows for migration. I assume the idea is to
> > > > > > spin up a VM on the target side, and have the two VMs attest
> > > > > > to each other. How do the two sides know if the other is
> > > > > > legitimate? I take it that the source is directing the LAUNCH
> > > > > > flows?
> > > > >  
> > > > > Yeah we don't use PSP migration flows at all. We don't need to
> > > > > send the MH code from the source to the target because the MH
> > > > > lives in firmware, which is common between the two.
> > > >  
> > > > Are you relying on the target firmware to be *identical* or
> > > > purely for it to be *compatible* ?  It's normal for a migration
> > > > to be the result of wanting to do an upgrade; and that means the
> > > > destination build of OVMF might be newer (or older, or ...).
> > > > 
> > > > Dave
> > > 
> > > This is a good point. The migration handler on the source and
> > > target must have the same memory footprint or bad things will
> > > happen. Using the same firmware on the source and target is an easy
> > > way to guarantee this. Since the MH in OVMF is not a contiguous
> > > region of memory, but a group of functions scattered around OVMF,
> > > it is a bit difficult to guarantee that the memory footprint is the
> > > same if the build is different.
> > 
> > Can you explain what the 'memory footprint' consists of? Can't it
> > just be the whole of the OVMF rom space if you have no way of nudging
> > the MH into it's own chunk?
> 
> It might be possible depending on how we link it. At the moment it's
> using the core OVMF libraries, but it is possible to retool the OVMF
> build to copy those libraries into the MH DXE.
> 
> > I think it really does have to cope with migration to a new version
> > of host.
> 
> Well, you're thinking of OVMF as belonging to the host because of the
> way it is supplied, but think about the way it works in practice now,
> forgetting about confidential computing: OVMF is RAM resident in
> ordinary guests, so when you migrate them, the whole of OVMF (or at
> least what's left at runtime) goes with the migration, thus it's not
> possible to change the guest OVMF by migration.  The above is really
> just an extension of that principle, the only difference for
> confidential computing being you have to have an image of the current
> OVMF ROM in the target to seed migration.
> 
> Technically, the problem is we can't overwrite running code and once
> the guest is re-sited to the target, the OVMF there has to match
> exactly what was on the source for the RT to still function.   Once the
> migration has run, the OVMF on the target must be identical to what was
> on the source (including internally allocated OVMF memory), and if we
> can't copy the MH code, we have to rely on the target image providing
> this identical code and we copy the rest.

I'm OK with the OVMF now being part of the guest image, and having to
exist on both; it's a bit delicate though unless we have a way to check
it (is there an attest of the destination happening here?)

Dave

> James
> 
>
James Bottomley Aug. 19, 2021, 10:10 p.m. UTC | #43
On Thu, 2021-08-19 at 15:28 +0100, Dr. David Alan Gilbert wrote:
> * James Bottomley (jejb@linux.ibm.com) wrote:
> > On Thu, 2021-08-19 at 09:22 +0100, Dr. David Alan Gilbert wrote:
[...]
> > > I think it really does have to cope with migration to a new
> > > version of host.
> > 
> > Well, you're thinking of OVMF as belonging to the host because of
> > the way it is supplied, but think about the way it works in
> > practice now, forgetting about confidential computing: OVMF is RAM
> > resident in ordinary guests, so when you migrate them, the whole of
> > OVMF (or at least what's left at runtime) goes with the migration,
> > thus it's not possible to change the guest OVMF by migration.  The
> > above is really just an extension of that principle, the only
> > difference for confidential computing being you have to have an
> > image of the current OVMF ROM in the target to seed migration.
> > 
> > Technically, the problem is we can't overwrite running code and
> > once the guest is re-sited to the target, the OVMF there has to
> > match exactly what was on the source for the RT to still
> > function.   Once the migration has run, the OVMF on the target must
> > be identical to what was on the source (including internally
> > allocated OVMF memory), and if we can't copy the MH code, we have
> > to rely on the target image providing this identical code and we
> > copy the rest.
> 
> I'm OK with the OVMF now being part of the guest image, and having to
> exist on both; it's a bit delicate though unless we have a way to
> check it (is there an attest of the destination happening here?)

There will be in the final version.  The attestations of the source and
target, being the hash of the OVMF (with the registers in the -ES
case), should be the same (modulo any firmware updates to the PSP,
whose firmware version is also hashed) to guarantee the OVMF is the
same on both sides.  We'll definitely take an action to get QEMU to
verify this ... made a lot easier now we have signed attestations ...

James
Dr. David Alan Gilbert Aug. 23, 2021, 12:26 p.m. UTC | #44
* James Bottomley (jejb@linux.ibm.com) wrote:
> On Thu, 2021-08-19 at 15:28 +0100, Dr. David Alan Gilbert wrote:
> > * James Bottomley (jejb@linux.ibm.com) wrote:
> > > On Thu, 2021-08-19 at 09:22 +0100, Dr. David Alan Gilbert wrote:
> [...]
> > > > I think it really does have to cope with migration to a new
> > > > version of host.
> > > 
> > > Well, you're thinking of OVMF as belonging to the host because of
> > > the way it is supplied, but think about the way it works in
> > > practice now, forgetting about confidential computing: OVMF is RAM
> > > resident in ordinary guests, so when you migrate them, the whole of
> > > OVMF (or at least what's left at runtime) goes with the migration,
> > > thus it's not possible to change the guest OVMF by migration.  The
> > > above is really just an extension of that principle, the only
> > > difference for confidential computing being you have to have an
> > > image of the current OVMF ROM in the target to seed migration.
> > > 
> > > Technically, the problem is we can't overwrite running code and
> > > once the guest is re-sited to the target, the OVMF there has to
> > > match exactly what was on the source for the RT to still
> > > function.   Once the migration has run, the OVMF on the target must
> > > be identical to what was on the source (including internally
> > > allocated OVMF memory), and if we can't copy the MH code, we have
> > > to rely on the target image providing this identical code and we
> > > copy the rest.
> > 
> > I'm OK with the OVMF now being part of the guest image, and having to
> > exist on both; it's a bit delicate though unless we have a way to
> > check it (is there an attest of the destination happening here?)
> 
> There will be in the final version.  The attestations of the source and
> target, being the hash of the OVMF (with the registers in the -ES
> case), should be the same (modulo any firmware updates to the PSP,
> whose firmware version is also hashed) to guarantee the OVMF is the
> same on both sides.  We'll definitely take an action to get QEMU to
> verify this ... made a lot easier now we have signed attestations ...

Hmm; I'm not sure you're allowed to have QEMU verify that - we don't
trust it; you need to have either the firmware say it's OK to migrate
to the destination (using the existing PSP mechanism) or get the source
MH to verify a quote from the destination.

[Somewhere along the line, if you're not using the PSP, I think you also
need to check the guest policy to check it is allowed to migrate].

Dave

> James
> 
>
Tobin Feldman-Fitzthum Aug. 23, 2021, 4:28 p.m. UTC | #45
On 8/23/21 8:26 AM, Dr. David Alan Gilbert wrote:

> * James Bottomley (jejb@linux.ibm.com) wrote:
>
>> (is there an attest of the destination happening here?)
>> There will be in the final version.  The attestations of the source and
>> target, being the hash of the OVMF (with the registers in the -ES
>> case), should be the same (modulo any firmware updates to the PSP,
>> whose firmware version is also hashed) to guarantee the OVMF is the
>> same on both sides.  We'll definitely take an action to get QEMU to
>> verify this ... made a lot easier now we have signed attestations ...
> Hmm; I'm not sure you're allowed to have QEMU verify that - we don't
> trust it; you need to have either the firmware say it's OK to migrate
> to the destination (using the existing PSP mechanism) or get the source
> MH to verify a quote from the destination.

I think the check in QEMU would only be a convenience. The launch 
measurement of the target (verified by the guest owner) is what 
guarantees that the firmware, as well as the policy, of the target is 
what is expected. In PSP-assisted migration the source verifies the 
target, but our plan is to have the guest owner verify both the source 
and the target. The target will only be provisioned with the transport 
key if the measurement checks out. We will have some more details about 
this key agreement scheme soon.

> [Somewhere along the line, if you're not using the PSP, I think you also
> need to check the guest policy to check it is allowed to migrate].

Sources that aren't allowed to migrate won't be provisioned with 
transport key to encrypt pages. A non-migrateable guest could also be 
booted with OvmfPkg firmware, which does not contain the migration handler.

-Tobin

> Dave
>
>> James
>>
>>