diff mbox series

[v2,2/4] mm: x86: Invoke hypercall when page encryption status is changed

Message ID ff68a73e0cdaf89e56add5c8b6e110df881fede1.1619193043.git.ashish.kalra@amd.com (mailing list archive)
State New, archived
Headers show
Series Add guest support for SEV live migration. | expand

Commit Message

Ashish Kalra April 23, 2021, 3:58 p.m. UTC
From: Brijesh Singh <brijesh.singh@amd.com>

Invoke a hypercall when a memory region is changed from encrypted ->
decrypted and vice versa. Hypervisor needs to know the page encryption
status during the guest migration.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Steve Rutherford <srutherford@google.com>
Reviewed-by: Venu Busireddy <venu.busireddy@oracle.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/include/asm/paravirt.h       |  6 +++
 arch/x86/include/asm/paravirt_types.h |  2 +
 arch/x86/include/asm/set_memory.h     |  2 +
 arch/x86/kernel/paravirt.c            |  1 +
 arch/x86/mm/mem_encrypt.c             | 66 +++++++++++++++++++++++----
 arch/x86/mm/pat/set_memory.c          |  7 +++
 6 files changed, 75 insertions(+), 9 deletions(-)

Comments

Borislav Petkov May 12, 2021, 1:15 p.m. UTC | #1
On Fri, Apr 23, 2021 at 03:58:43PM +0000, Ashish Kalra wrote:
> +static inline void notify_page_enc_status_changed(unsigned long pfn,
> +						  int npages, bool enc)
> +{
> +	PVOP_VCALL3(mmu.notify_page_enc_status_changed, pfn, npages, enc);
> +}

Now the question is whether something like that is needed for TDX, and,
if so, could it be shared by both.

Sean?

> +void notify_addr_enc_status_changed(unsigned long vaddr, int npages,
> +				    bool enc)

Let that line stick out.

> +{
> +#ifdef CONFIG_PARAVIRT
> +	unsigned long sz = npages << PAGE_SHIFT;
> +	unsigned long vaddr_end = vaddr + sz;
> +
> +	while (vaddr < vaddr_end) {
> +		int psize, pmask, level;
> +		unsigned long pfn;
> +		pte_t *kpte;
> +
> +		kpte = lookup_address(vaddr, &level);
> +		if (!kpte || pte_none(*kpte))
> +			return;

What does this mean exactly? On the first failure to lookup the address,
you return? Why not continue so that you can notify about the remaining
pages in [vaddr - vaddr_end)?

Also, what does it mean for the current range if the lookup fails?
Innocuous situation or do you need to signal it with a WARN or so?

> +
> +		pfn = pg_level_to_pfn(level, kpte, NULL);
> +		if (!pfn)
> +			continue;

Same here: if it hits the default case, wouldn't it make sense to
WARN_ONCE or so to catch potential misuse? Or better yet, the WARN_ONCE
should be in pg_level_to_pfn().

> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index 16f878c26667..45e65517405a 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -2012,6 +2012,13 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
>  	 */
>  	cpa_flush(&cpa, 0);
>  
> +	/*
> +	 * Notify hypervisor that a given memory range is mapped encrypted
> +	 * or decrypted. The hypervisor will use this information during the
> +	 * VM migration.
> +	 */
> +	notify_addr_enc_status_changed(addr, numpages, enc);

If you notify about a range then that function should be called

	notify_range_enc_status_changed

or so.
Sean Christopherson May 12, 2021, 3:51 p.m. UTC | #2
On Wed, May 12, 2021, Borislav Petkov wrote:
> On Fri, Apr 23, 2021 at 03:58:43PM +0000, Ashish Kalra wrote:
> > +static inline void notify_page_enc_status_changed(unsigned long pfn,
> > +						  int npages, bool enc)
> > +{
> > +	PVOP_VCALL3(mmu.notify_page_enc_status_changed, pfn, npages, enc);
> > +}
> 
> Now the question is whether something like that is needed for TDX, and,
> if so, could it be shared by both.

Yes, TDX needs this same hook, but "can't" reuse the hypercall verbatime.  Ditto
for SEV-SNP.  I wanted to squish everything into a single common hypercall, but
that didn't pan out.

The problem is that both TDX and SNP define their own versions of this so that
any guest kernel that complies with the TDX|SNP specification will run cleanly
on a hypervisor that also complies with the spec.  This KVM-specific hook doesn't
meet those requires because non-Linux guest support will be sketchy at best, and
non-KVM hypervisor support will be non-existent.

The best we can do, short of refusing to support TDX or SNP, is to make this
KVM-specific hypercall compatible with TDX and SNP so that the bulk of the
control logic is identical.  The mechanics of actually invoking the hypercall
will differ, but if done right, everything else should be reusable without
modification.

I had an in-depth analysis of this, but it was all off-list.  Pasted below. 

  TDX uses GPRs to communicate with the host, so it can tunnel "legacy" hypercalls
  from time zero.  SNP could technically do the same (with a revised GHCB spec),
  but it'd be butt ugly.  And of course trying to go that route for either TDX or
  SNP would run into the problem of having to coordinate the ABI for the "legacy"
  hypercall across all guests and hosts.  So yeah, trying to remove any of the
  three (KVM vs. SNP vs. TDX) interfaces is sadly just wishful thinking.

  That being said, I do think we can reuse the KVM specific hypercall for TDX and
  SNP.  Both will still need a {TDX,SNP}-specific GCH{I,B} protocol so that cross-
  vendor compatibility is guaranteed, but that shouldn't preclude a guest that is
  KVM enlightened from switching to the KVM specific hypercall once it can do so.
  More thoughts later on.

  > I guess a common structure could be used along the lines of what is in the
  > GHCB spec today, but that seems like overkill for SEV/SEV-ES, which will
  > only ever really do a single page range at a time (through
  > set_memory_encrypted() and set_memory_decrypted()). The reason for the
  > expanded form for SEV-SNP is that the OS can (proactively) adjust multiple
  > page ranges in advance. Will TDX need to do something similar?

  Yes, TDX needs the exact same thing.  All three (SEV, SNP, and TDX) have more or
  less the exact same hook in the guest (Linux, obviously) kernel.

  > If so, the only real common piece in KVM is a function to track what pages
  > are shared vs private, which would only require a simple interface.

  It's not just KVM, it's also the relevant code in the guest kernel(s) and other
  hypervisors.  And the MMU side of KVM will likely be able to share code, e.g. to
  act on the page size hint.

  > So for SEV/SEV-ES, a simpler hypercall interface to specify a single page
  > range is really all that is needed, but it must be common across
  > hypervisors. I think that was one Sean's points originally, we don't want
  > one hypercall for KVM, one for Hyper-V, one for VMware, one for Xen, etc.

  For the KVM defined interface (required for SEV/SEV-ES), I think it makes sense
  to make it a superset of the SNP and TDX protocols so that it _can_ be used in
  lieu of the SNP/TDX specific protocol.  I don't know for sure whether or not
  that will actually yield better code and/or performance, but it costs us almost
  nothing and at least gives us the option of further optimizing the Linux+KVM
  combination.

  It probably shouldn't be a strict superset, as in practice I don't think SNP
  approach of having individual entries when batching multiple pages will yield
  the best performance.  E.g. the vast majority (maybe all?) of conversions for a
  Linux guest will be physically contiguous and will have the same preferred page
  size, at which point there will be less overhead if the guest specifies a
  massive range as opposed to having to santize and fill a large buffer.

  TL;DR: I think the KVM hypercall should be something like this, so that it can
  be used for SNP and TDX, and possibly for other purposes, e.g. for paravirt
  performance enhancements or something.

    8. KVM_HC_MAP_GPA_RANGE
    -----------------------
    :Architecture: x86
    :Status: active
    :Purpose: Request KVM to map a GPA range with the specified attributes.

    a0: the guest physical address of the start page
    a1: the number of (4kb) pages (must be contiguous in GPA space)
    a2: attributes

  where 'attributes' could be something like:

    bits  3:0 - preferred page size encoding 0 = 4kb, 1 = 2mb, 2 = 1gb, etc...
    bit     4 - plaintext = 0, encrypted = 1
    bits 63:5 - reserved (must be zero)
Borislav Petkov May 12, 2021, 4:23 p.m. UTC | #3
On Wed, May 12, 2021 at 03:51:10PM +0000, Sean Christopherson wrote:
>   TL;DR: I think the KVM hypercall should be something like this, so that it can
>   be used for SNP and TDX, and possibly for other purposes, e.g. for paravirt
>   performance enhancements or something.

Ok, good, I was only making sure this is on people's radar but it
actually is more than that. I'll let Tom and Jörg comment on the meat
of the thing - as always, thanks for the detailed explanation.

From my !virt guy POV, I like the aspect of sharing stuff as much as
possible and it all makes sense to me but what the hell do I know...

>     8. KVM_HC_MAP_GPA_RANGE
>     -----------------------
>     :Architecture: x86
>     :Status: active
>     :Purpose: Request KVM to map a GPA range with the specified attributes.
> 
>     a0: the guest physical address of the start page
>     a1: the number of (4kb) pages (must be contiguous in GPA space)
>     a2: attributes
> 
>   where 'attributes' could be something like:
> 
>     bits  3:0 - preferred page size encoding 0 = 4kb, 1 = 2mb, 2 = 1gb, etc...
>     bit     4 - plaintext = 0, encrypted = 1
>     bits 63:5 - reserved (must be zero)

Yah, nice and simple. I like.

Thx.
Ashish Kalra May 13, 2021, 4:34 a.m. UTC | #4
Hello Boris,

On Wed, May 12, 2021 at 03:15:37PM +0200, Borislav Petkov wrote:
> On Fri, Apr 23, 2021 at 03:58:43PM +0000, Ashish Kalra wrote:
> > +static inline void notify_page_enc_status_changed(unsigned long pfn,
> > +						  int npages, bool enc)
> > +{
> > +	PVOP_VCALL3(mmu.notify_page_enc_status_changed, pfn, npages, enc);
> > +}
> 
> Now the question is whether something like that is needed for TDX, and,
> if so, could it be shared by both.
> 
> Sean?
> 
> > +void notify_addr_enc_status_changed(unsigned long vaddr, int npages,
> > +				    bool enc)
> 
> Let that line stick out.
> 
> > +{
> > +#ifdef CONFIG_PARAVIRT
> > +	unsigned long sz = npages << PAGE_SHIFT;
> > +	unsigned long vaddr_end = vaddr + sz;
> > +
> > +	while (vaddr < vaddr_end) {
> > +		int psize, pmask, level;
> > +		unsigned long pfn;
> > +		pte_t *kpte;
> > +
> > +		kpte = lookup_address(vaddr, &level);
> > +		if (!kpte || pte_none(*kpte))
> > +			return;
> 
> What does this mean exactly? On the first failure to lookup the address,
> you return? Why not continue so that you can notify about the remaining
> pages in [vaddr - vaddr_end)?

What's the use of notification of a partial page list, even a single
incorrect guest page encryption status can crash the guest/migrated
guest.

> Also, what does it mean for the current range if the lookup fails?
> Innocuous situation or do you need to signal it with a WARN or so?
> 

Yes, it makes sense to signal it with a WARN or so.

> > +
> > +		pfn = pg_level_to_pfn(level, kpte, NULL);
> > +		if (!pfn)
> > +			continue;
> 
> Same here: if it hits the default case, wouldn't it make sense to
> WARN_ONCE or so to catch potential misuse? Or better yet, the WARN_ONCE
> should be in pg_level_to_pfn().

Yes, it makes sense to add a WARN_ONCE() in pg_level_to_pfn().
> 
> > diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> > index 16f878c26667..45e65517405a 100644
> > --- a/arch/x86/mm/pat/set_memory.c
> > +++ b/arch/x86/mm/pat/set_memory.c
> > @@ -2012,6 +2012,13 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
> >  	 */
> >  	cpa_flush(&cpa, 0);
> >  
> > +	/*
> > +	 * Notify hypervisor that a given memory range is mapped encrypted
> > +	 * or decrypted. The hypervisor will use this information during the
> > +	 * VM migration.
> > +	 */
> > +	notify_addr_enc_status_changed(addr, numpages, enc);
> 
> If you notify about a range then that function should be called
> 
> 	notify_range_enc_status_changed
> 

Ok. 

Thanks,
Ashish

> or so.
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpeople.kernel.org%2Ftglx%2Fnotes-about-netiquette&amp;data=04%7C01%7CAshish.Kalra%40amd.com%7Cb880e2dae4d24f208c8b08d915480b4a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637564221487050648%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=q%2FOAt%2FQqv0t%2BXDhjvPQAEYj67XQIUWbis0MXGMu4EZY%3D&amp;reserved=0
Ashish Kalra May 13, 2021, 6:57 a.m. UTC | #5
Hello Sean,

On Wed, May 12, 2021 at 03:51:10PM +0000, Sean Christopherson wrote:
> On Wed, May 12, 2021, Borislav Petkov wrote:
> > On Fri, Apr 23, 2021 at 03:58:43PM +0000, Ashish Kalra wrote:
> > > +static inline void notify_page_enc_status_changed(unsigned long pfn,
> > > +						  int npages, bool enc)
> > > +{
> > > +	PVOP_VCALL3(mmu.notify_page_enc_status_changed, pfn, npages, enc);
> > > +}
> > 
> > Now the question is whether something like that is needed for TDX, and,
> > if so, could it be shared by both.
> 
> Yes, TDX needs this same hook, but "can't" reuse the hypercall verbatime.  Ditto
> for SEV-SNP.  I wanted to squish everything into a single common hypercall, but
> that didn't pan out.
> 
> The problem is that both TDX and SNP define their own versions of this so that
> any guest kernel that complies with the TDX|SNP specification will run cleanly
> on a hypervisor that also complies with the spec.  This KVM-specific hook doesn't
> meet those requires because non-Linux guest support will be sketchy at best, and
> non-KVM hypervisor support will be non-existent.
> 
> The best we can do, short of refusing to support TDX or SNP, is to make this
> KVM-specific hypercall compatible with TDX and SNP so that the bulk of the
> control logic is identical.  The mechanics of actually invoking the hypercall
> will differ, but if done right, everything else should be reusable without
> modification.
> 
> I had an in-depth analysis of this, but it was all off-list.  Pasted below. 
> 
>   TDX uses GPRs to communicate with the host, so it can tunnel "legacy" hypercalls
>   from time zero.  SNP could technically do the same (with a revised GHCB spec),
>   but it'd be butt ugly.  And of course trying to go that route for either TDX or
>   SNP would run into the problem of having to coordinate the ABI for the "legacy"
>   hypercall across all guests and hosts.  So yeah, trying to remove any of the
>   three (KVM vs. SNP vs. TDX) interfaces is sadly just wishful thinking.
> 
>   That being said, I do think we can reuse the KVM specific hypercall for TDX and
>   SNP.  Both will still need a {TDX,SNP}-specific GCH{I,B} protocol so that cross-
>   vendor compatibility is guaranteed, but that shouldn't preclude a guest that is
>   KVM enlightened from switching to the KVM specific hypercall once it can do so.
>   More thoughts later on.
> 
>   > I guess a common structure could be used along the lines of what is in the
>   > GHCB spec today, but that seems like overkill for SEV/SEV-ES, which will
>   > only ever really do a single page range at a time (through
>   > set_memory_encrypted() and set_memory_decrypted()). The reason for the
>   > expanded form for SEV-SNP is that the OS can (proactively) adjust multiple
>   > page ranges in advance. Will TDX need to do something similar?
> 
>   Yes, TDX needs the exact same thing.  All three (SEV, SNP, and TDX) have more or
>   less the exact same hook in the guest (Linux, obviously) kernel.
> 
>   > If so, the only real common piece in KVM is a function to track what pages
>   > are shared vs private, which would only require a simple interface.
> 
>   It's not just KVM, it's also the relevant code in the guest kernel(s) and other
>   hypervisors.  And the MMU side of KVM will likely be able to share code, e.g. to
>   act on the page size hint.
> 
>   > So for SEV/SEV-ES, a simpler hypercall interface to specify a single page
>   > range is really all that is needed, but it must be common across
>   > hypervisors. I think that was one Sean's points originally, we don't want
>   > one hypercall for KVM, one for Hyper-V, one for VMware, one for Xen, etc.
> 
>   For the KVM defined interface (required for SEV/SEV-ES), I think it makes sense
>   to make it a superset of the SNP and TDX protocols so that it _can_ be used in
>   lieu of the SNP/TDX specific protocol.  I don't know for sure whether or not
>   that will actually yield better code and/or performance, but it costs us almost
>   nothing and at least gives us the option of further optimizing the Linux+KVM
>   combination.
> 
>   It probably shouldn't be a strict superset, as in practice I don't think SNP
>   approach of having individual entries when batching multiple pages will yield
>   the best performance.  E.g. the vast majority (maybe all?) of conversions for a
>   Linux guest will be physically contiguous and will have the same preferred page
>   size, at which point there will be less overhead if the guest specifies a
>   massive range as opposed to having to santize and fill a large buffer.
> 
>   TL;DR: I think the KVM hypercall should be something like this, so that it can
>   be used for SNP and TDX, and possibly for other purposes, e.g. for paravirt
>   performance enhancements or something.
> 
>     8. KVM_HC_MAP_GPA_RANGE
>     -----------------------
>     :Architecture: x86
>     :Status: active
>     :Purpose: Request KVM to map a GPA range with the specified attributes.
> 
>     a0: the guest physical address of the start page
>     a1: the number of (4kb) pages (must be contiguous in GPA space)
>     a2: attributes
> 
>   where 'attributes' could be something like:
> 
>     bits  3:0 - preferred page size encoding 0 = 4kb, 1 = 2mb, 2 = 1gb, etc...
>     bit     4 - plaintext = 0, encrypted = 1
>     bits 63:5 - reserved (must be zero)
> 

Ok. Will modify page encryption status hypercall to be compatible with
the above defined interface.

Thanks,
Ashish
Paolo Bonzini May 13, 2021, 8:40 a.m. UTC | #6
On 13/05/21 08:57, Ashish Kalra wrote:
>>      8. KVM_HC_MAP_GPA_RANGE
>>      -----------------------
>>      :Architecture: x86
>>      :Status: active
>>      :Purpose: Request KVM to map a GPA range with the specified attributes.
>>
>>      a0: the guest physical address of the start page
>>      a1: the number of (4kb) pages (must be contiguous in GPA space)
>>      a2: attributes
>>
>>    where 'attributes' could be something like:
>>
>>      bits  3:0 - preferred page size encoding 0 = 4kb, 1 = 2mb, 2 = 1gb, etc...
>>      bit     4 - plaintext = 0, encrypted = 1
>>      bits 63:5 - reserved (must be zero)
>>
> 
> Ok. Will modify page encryption status hypercall to be compatible with
> the above defined interface.

Great, this is the current state of the host-side patch (untested):

 From df571861e1d47d81a578b4950c704d01a0ed915e Mon Sep 17 00:00:00 2001
From: Ashish Kalra <ashish.kalra@amd.com>
Date: Thu, 15 Apr 2021 15:57:02 +0000
Subject: [PATCH] KVM: X86: Introduce KVM_HC_PAGE_ENC_STATUS hypercall

This hypercall is used by the SEV guest to notify a change in the page
encryption status to the hypervisor. The hypercall should be invoked
only when the encryption attribute is changed from encrypted -> decrypted
and vice versa. By default all guest pages are considered encrypted.

The hypercall exits to userspace to manage the guest shared regions and
integrate with the userspace VMM's migration code.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Steve Rutherford <srutherford@google.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 7fcb2fd38f42..0d2abcad0565 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6891,3 +6891,22 @@ This capability is always enabled.
  This capability indicates that the KVM virtual PTP service is
  supported in the host. A VMM can check whether the service is
  available to the guest on migration.
+
+8.33 KVM_CAP_EXIT_HYPERCALL
+---------------------------
+
+:Capability: KVM_CAP_EXIT_HYPERCALL
+:Architectures: x86
+:Type: vm
+
+This capability, if enabled, will cause KVM to exit to userspace
+with KVM_EXIT_HYPERCALL exit reason to process some hypercalls.
+
+Calling KVM_CHECK_EXTENSION for this capability will return a bitmask
+of hypercalls that can be configured to exit to userspace.
+Right now, the only such hypercall is KVM_HC_PAGE_ENC_STATUS.
+
+The argument to KVM_ENABLE_CAP is also a bitmask, and must be a subset
+of the result of KVM_CHECK_EXTENSION.  KVM will forward to userspace
+the hypercalls whose corresponding bit is in the argument, and return
+ENOSYS for the others.
diff --git a/Documentation/virt/kvm/cpuid.rst b/Documentation/virt/kvm/cpuid.rst
index cf62162d4be2..1e0013d3c972 100644
--- a/Documentation/virt/kvm/cpuid.rst
+++ b/Documentation/virt/kvm/cpuid.rst
@@ -96,6 +96,14 @@ KVM_FEATURE_MSI_EXT_DEST_ID        15          guest checks this feature bit
                                                 before using extended destination
                                                 ID bits in MSI address bits 11-5.
  
+KVM_FEATURE_HC_PAGE_ENC_STATUS     16          guest checks this feature bit before
+                                               using the page encryption state
+                                               hypercall to notify the page state
+                                               change
+
+KVM_FEATURE_MIGRATION_CONTROL      17          guest checks this feature bit before
+                                               using MSR_KVM_MIGRATION_CONTROL
+
  KVM_FEATURE_CLOCKSOURCE_STABLE_BIT 24          host will warn if no guest-side
                                                 per-cpu warps are expected in
                                                 kvmclock
diff --git a/Documentation/virt/kvm/hypercalls.rst b/Documentation/virt/kvm/hypercalls.rst
index ed4fddd364ea..117ff3b27d3c 100644
--- a/Documentation/virt/kvm/hypercalls.rst
+++ b/Documentation/virt/kvm/hypercalls.rst
@@ -169,3 +169,24 @@ a0: destination APIC ID
  
  :Usage example: When sending a call-function IPI-many to vCPUs, yield if
  	        any of the IPI target vCPUs was preempted.
+
+
+8. KVM_HC_PAGE_ENC_STATUS
+-------------------------
+:Architecture: x86
+:Status: active
+:Purpose: Notify the encryption status changes in guest page table (SEV guest)
+
+a0: the guest physical address of the start page
+a1: the number of pages
+a2: page encryption status
+
+   Where:
+	* 1: Page is encrypted
+	* 0: Page is decrypted
+
+**Implementation note**: this hypercall is implemented in userspace via
+the KVM_CAP_EXIT_HYPERCALL capability.  Userspace must enable that capability
+before advertising KVM_FEATURE_HC_PAGE_ENC_STATUS in the guest CPUID.  In
+addition, if the guest supports KVM_FEATURE_MIGRATION_CONTROL, userspace
+must also set up an MSR filter to process writes to MSR_KVM_MIGRATION_CONTROL.
diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst
index e37a14c323d2..977936176f36 100644
--- a/Documentation/virt/kvm/msr.rst
+++ b/Documentation/virt/kvm/msr.rst
@@ -376,3 +376,16 @@ data:
  	write '1' to bit 0 of the MSR, this causes the host to re-scan its queue
  	and check if there are more notifications pending. The MSR is available
  	if KVM_FEATURE_ASYNC_PF_INT is present in CPUID.
+
+MSR_KVM_MIGRATION_CONTROL:
+        0x4b564d08
+
+data:
+        This MSR is available if KVM_FEATURE_MIGRATION_CONTROL is present in
+        CPUID.  Bit 0 represents whether live migration of the guest is allowed.
+
+        When a guest is started, bit 0 will be 0 if the guest has encrypted
+        memory and 1 if the guest does not have encrypted memory.  If the
+        guest is communicating page encryption status to the host using the
+        ``KVM_HC_PAGE_ENC_STATUS`` hypercall, it can set bit 0 in this MSR to
+        allow live migration of the guest.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 55efbacfc244..5b9bc8b3db20 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1067,6 +1067,8 @@ struct kvm_arch {
  	u32 user_space_msr_mask;
  	struct kvm_x86_msr_filter __rcu *msr_filter;
  
+	u32 hypercall_exit_enabled;
+
  	/* Guest can access the SGX PROVISIONKEY. */
  	bool sgx_provisioning_allowed;
  
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index 950afebfba88..cff18b8b6dec 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -33,6 +33,8 @@
  #define KVM_FEATURE_PV_SCHED_YIELD	13
  #define KVM_FEATURE_ASYNC_PF_INT	14
  #define KVM_FEATURE_MSI_EXT_DEST_ID	15
+#define KVM_FEATURE_HC_PAGE_ENC_STATUS	16
+#define KVM_FEATURE_MIGRATION_CONTROL	17
  
  #define KVM_HINTS_REALTIME      0
  
@@ -54,6 +56,7 @@
  #define MSR_KVM_POLL_CONTROL	0x4b564d05
  #define MSR_KVM_ASYNC_PF_INT	0x4b564d06
  #define MSR_KVM_ASYNC_PF_ACK	0x4b564d07
+#define MSR_KVM_MIGRATION_CONTROL	0x4b564d08
  
  struct kvm_steal_time {
  	__u64 steal;
@@ -90,6 +93,8 @@ struct kvm_clock_pairing {
  /* MSR_KVM_ASYNC_PF_INT */
  #define KVM_ASYNC_PF_VEC_MASK			GENMASK(7, 0)
  
+/* MSR_KVM_MIGRATION_CONTROL */
+#define KVM_MIGRATION_READY		(1 << 0)
  
  /* Operations for KVM_HC_MMU_OP */
  #define KVM_MMU_OP_WRITE_PTE            1
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5bd550eaf683..eab7d50eb4e2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -102,6 +102,8 @@ static u64 __read_mostly efer_reserved_bits = ~((u64)EFER_SCE);
  
  static u64 __read_mostly cr4_reserved_bits = CR4_RESERVED_BITS;
  
+#define KVM_EXIT_HYPERCALL_VALID_MASK (1 << KVM_HC_PAGE_ENC_STATUS)
+
  #define KVM_X2APIC_API_VALID_FLAGS (KVM_X2APIC_API_USE_32BIT_IDS | \
                                      KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK)
  
@@ -3894,6 +3896,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
  	case KVM_CAP_VM_COPY_ENC_CONTEXT_FROM:
  		r = 1;
  		break;
+	case KVM_CAP_EXIT_HYPERCALL:
+		r = KVM_EXIT_HYPERCALL_VALID_MASK;
+		break;
  	case KVM_CAP_SET_GUEST_DEBUG2:
  		return KVM_GUESTDBG_VALID_MASK;
  #ifdef CONFIG_KVM_XEN
@@ -5494,6 +5499,14 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
  		break;
  	}
  #endif
+	case KVM_CAP_EXIT_HYPERCALL:
+		if (cap->args[0] & ~KVM_EXIT_HYPERCALL_VALID_MASK) {
+			r = -EINVAL;
+			break;
+		}
+		kvm->arch.hypercall_exit_enabled = cap->args[0];
+		r = 0;
+		break;
  	case KVM_CAP_VM_COPY_ENC_CONTEXT_FROM:
  		r = -EINVAL;
  		if (kvm_x86_ops.vm_copy_enc_context_from)
@@ -8384,6 +8397,16 @@ static void kvm_sched_yield(struct kvm_vcpu *vcpu, unsigned long dest_id)
  	return;
  }
  
+static int complete_hypercall_exit(struct kvm_vcpu *vcpu)
+{
+	u64 ret = vcpu->run->hypercall.ret;
+	if (!is_64_bit_mode(vcpu))
+		ret = (u32)ret;
+	kvm_rax_write(vcpu, ret);
+	++vcpu->stat.hypercalls;
+	return kvm_skip_emulated_instruction(vcpu);
+}
+
  int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
  {
  	unsigned long nr, a0, a1, a2, a3, ret;
@@ -8449,6 +8472,28 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
  		kvm_sched_yield(vcpu, a0);
  		ret = 0;
  		break;
+	case KVM_HC_PAGE_ENC_STATUS: {
+		u64 gpa = a0, npages = a1, enc = a2;
+
+		ret = -KVM_ENOSYS;
+		if (!(vcpu->kvm->arch.hypercall_exit_enabled & (1 << KVM_HC_PAGE_ENC_STATUS)))
+			break;
+
+		if (!PAGE_ALIGNED(gpa) || !npages ||
+		    gpa_to_gfn(gpa) + npages <= gpa_to_gfn(gpa)) {
+			ret = -KVM_EINVAL;
+			break;
+		}
+
+		vcpu->run->exit_reason        = KVM_EXIT_HYPERCALL;
+		vcpu->run->hypercall.nr       = KVM_HC_PAGE_ENC_STATUS;
+		vcpu->run->hypercall.args[0]  = gpa;
+		vcpu->run->hypercall.args[1]  = npages;
+		vcpu->run->hypercall.args[2]  = enc;
+		vcpu->run->hypercall.longmode = op_64_bit;
+		vcpu->arch.complete_userspace_io = complete_hypercall_exit;
+		return 0;
+	}
  	default:
  		ret = -KVM_ENOSYS;
  		break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 3fd9a7e9d90c..1fb4fd863324 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1082,6 +1082,7 @@ struct kvm_ppc_resize_hpt {
  #define KVM_CAP_SGX_ATTRIBUTE 196
  #define KVM_CAP_VM_COPY_ENC_CONTEXT_FROM 197
  #define KVM_CAP_PTP_KVM 198
+#define KVM_CAP_EXIT_HYPERCALL 199
  
  #ifdef KVM_CAP_IRQ_ROUTING
  
diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
index 8b86609849b9..847b83b75dc8 100644
--- a/include/uapi/linux/kvm_para.h
+++ b/include/uapi/linux/kvm_para.h
@@ -29,6 +29,7 @@
  #define KVM_HC_CLOCK_PAIRING		9
  #define KVM_HC_SEND_IPI		10
  #define KVM_HC_SCHED_YIELD		11
+#define KVM_HC_PAGE_ENC_STATUS		12
  
  /*
   * hypercalls use architecture specific
Tom Lendacky May 13, 2021, 1:49 p.m. UTC | #7
On 5/12/21 10:51 AM, Sean Christopherson wrote:
> On Wed, May 12, 2021, Borislav Petkov wrote:
>> On Fri, Apr 23, 2021 at 03:58:43PM +0000, Ashish Kalra wrote:
>>> +static inline void notify_page_enc_status_changed(unsigned long pfn,
>>> +						  int npages, bool enc)
>>> +{
>>> +	PVOP_VCALL3(mmu.notify_page_enc_status_changed, pfn, npages, enc);
>>> +}
>>
>> Now the question is whether something like that is needed for TDX, and,
>> if so, could it be shared by both.
> 
> Yes, TDX needs this same hook, but "can't" reuse the hypercall verbatime.  Ditto
> for SEV-SNP.  I wanted to squish everything into a single common hypercall, but
> that didn't pan out.
> 
> The problem is that both TDX and SNP define their own versions of this so that
> any guest kernel that complies with the TDX|SNP specification will run cleanly
> on a hypervisor that also complies with the spec.  This KVM-specific hook doesn't
> meet those requires because non-Linux guest support will be sketchy at best, and
> non-KVM hypervisor support will be non-existent.
> 
> The best we can do, short of refusing to support TDX or SNP, is to make this
> KVM-specific hypercall compatible with TDX and SNP so that the bulk of the
> control logic is identical.  The mechanics of actually invoking the hypercall
> will differ, but if done right, everything else should be reusable without
> modification.
> 
> I had an in-depth analysis of this, but it was all off-list.  Pasted below. 
> 
>   TDX uses GPRs to communicate with the host, so it can tunnel "legacy" hypercalls
>   from time zero.  SNP could technically do the same (with a revised GHCB spec),
>   but it'd be butt ugly.  And of course trying to go that route for either TDX or
>   SNP would run into the problem of having to coordinate the ABI for the "legacy"
>   hypercall across all guests and hosts.  So yeah, trying to remove any of the
>   three (KVM vs. SNP vs. TDX) interfaces is sadly just wishful thinking.
> 
>   That being said, I do think we can reuse the KVM specific hypercall for TDX and
>   SNP.  Both will still need a {TDX,SNP}-specific GCH{I,B} protocol so that cross-
>   vendor compatibility is guaranteed, but that shouldn't preclude a guest that is
>   KVM enlightened from switching to the KVM specific hypercall once it can do so.
>   More thoughts later on.
> 
>   > I guess a common structure could be used along the lines of what is in the
>   > GHCB spec today, but that seems like overkill for SEV/SEV-ES, which will
>   > only ever really do a single page range at a time (through
>   > set_memory_encrypted() and set_memory_decrypted()). The reason for the
>   > expanded form for SEV-SNP is that the OS can (proactively) adjust multiple
>   > page ranges in advance. Will TDX need to do something similar?
> 
>   Yes, TDX needs the exact same thing.  All three (SEV, SNP, and TDX) have more or
>   less the exact same hook in the guest (Linux, obviously) kernel.
> 
>   > If so, the only real common piece in KVM is a function to track what pages
>   > are shared vs private, which would only require a simple interface.
> 
>   It's not just KVM, it's also the relevant code in the guest kernel(s) and other
>   hypervisors.  And the MMU side of KVM will likely be able to share code, e.g. to
>   act on the page size hint.
> 
>   > So for SEV/SEV-ES, a simpler hypercall interface to specify a single page
>   > range is really all that is needed, but it must be common across
>   > hypervisors. I think that was one Sean's points originally, we don't want
>   > one hypercall for KVM, one for Hyper-V, one for VMware, one for Xen, etc.
> 
>   For the KVM defined interface (required for SEV/SEV-ES), I think it makes sense
>   to make it a superset of the SNP and TDX protocols so that it _can_ be used in
>   lieu of the SNP/TDX specific protocol.  I don't know for sure whether or not
>   that will actually yield better code and/or performance, but it costs us almost
>   nothing and at least gives us the option of further optimizing the Linux+KVM
>   combination.

Right, for SEV-SNP, as long as we know that the KVM interface is
available, it could be used. But we would have to fall back to the GHCB
specification if it could not be determined.

> 
>   It probably shouldn't be a strict superset, as in practice I don't think SNP
>   approach of having individual entries when batching multiple pages will yield
>   the best performance.  E.g. the vast majority (maybe all?) of conversions for a
>   Linux guest will be physically contiguous and will have the same preferred page
>   size, at which point there will be less overhead if the guest specifies a
>   massive range as opposed to having to santize and fill a large buffer.

Originally, the plan was to use ranges, but based on feedback during the
GHCB spec review it was updated to its current definition.

A concern from other hypervisor vendors was to be able to return back to
the guest in the middle of the hypercall, e.g. to deliver an interrupt or
such, and then allow the guest to resume the hypercall where it left off.
Not sure if you would want to build that into this hypercall, since
depending on the range, the operation could take a while.

Thanks,
Tom

> 
>   TL;DR: I think the KVM hypercall should be something like this, so that it can
>   be used for SNP and TDX, and possibly for other purposes, e.g. for paravirt
>   performance enhancements or something.
> 
>     8. KVM_HC_MAP_GPA_RANGE
>     -----------------------
>     :Architecture: x86
>     :Status: active
>     :Purpose: Request KVM to map a GPA range with the specified attributes.
> 
>     a0: the guest physical address of the start page
>     a1: the number of (4kb) pages (must be contiguous in GPA space)
>     a2: attributes
> 
>   where 'attributes' could be something like:
> 
>     bits  3:0 - preferred page size encoding 0 = 4kb, 1 = 2mb, 2 = 1gb, etc...
>     bit     4 - plaintext = 0, encrypted = 1
>     bits 63:5 - reserved (must be zero)
>
Borislav Petkov May 14, 2021, 7:33 a.m. UTC | #8
On Thu, May 13, 2021 at 04:34:41AM +0000, Ashish Kalra wrote:
> What's the use of notification of a partial page list, even a single
> incorrect guest page encryption status can crash the guest/migrated
> guest.

Ok, so explain to me how this looks from the user standpoint: she starts
migrating the guest, it fails to lookup an address, there's nothing
saying where it failed but the guest crashed.

Do you think this is user-friendly?
Paolo Bonzini May 14, 2021, 8:03 a.m. UTC | #9
On 14/05/21 09:33, Borislav Petkov wrote:
> Ok, so explain to me how this looks from the user standpoint: she starts
> migrating the guest, it fails to lookup an address, there's nothing
> saying where it failed but the guest crashed.
> 
> Do you think this is user-friendly?

Ok, so explain to me how this looks from the submitter standpoint: he 
reads your review of his patch, he acknowledges your point with "Yes, it 
makes sense to signal it with a WARN or so", and still is treated as shit.

Do you think this is friendly?

Paolo
Ashish Kalra May 14, 2021, 9:05 a.m. UTC | #10
Hello Boris, Paolo,

On Fri, May 14, 2021 at 10:03:18AM +0200, Paolo Bonzini wrote:
> On 14/05/21 09:33, Borislav Petkov wrote:
> > Ok, so explain to me how this looks from the user standpoint: she starts
> > migrating the guest, it fails to lookup an address, there's nothing
> > saying where it failed but the guest crashed.
> > 
> > Do you think this is user-friendly?
> 
> Ok, so explain to me how this looks from the submitter standpoint: he reads
> your review of his patch, he acknowledges your point with "Yes, it makes
> sense to signal it with a WARN or so", and still is treated as shit.
> 
> Do you think this is friendly?
> 
> 

I absolutely agree with both of your point of view. But what's the
alternative ?

Ideally we should fail/stop migration even if a single guest page
encryption status cannot be notified and that should be the way to
proceed in this case, the guest kernel should notify the source
userspace VMM to block/stop migration in this case.

From a practical side, i do see Qemu's migrate_add_blocker() interface
but that looks to be a static interface and also i don't think it will
force stop an ongoing migration, is there an existing mechanism
to inform userspace VMM from kernel about blocking/stopping migration ?

Thanks,
Ashish
Borislav Petkov May 14, 2021, 9:24 a.m. UTC | #11
On Fri, May 14, 2021 at 10:03:18AM +0200, Paolo Bonzini wrote:
> Ok, so explain to me how this looks from the submitter standpoint: he reads
> your review of his patch, he acknowledges your point with "Yes, it makes
> sense to signal it with a WARN or so", and still is treated as shit.

How is me asking about the user experience of it all, treating him like
shit?!

How should I have asked this so that it is not making you think I'm
treating him like shit?

Because treating someone like shit is not in my goals.
Ashish Kalra May 14, 2021, 9:33 a.m. UTC | #12
On Fri, May 14, 2021 at 11:24:03AM +0200, Borislav Petkov wrote:
> On Fri, May 14, 2021 at 10:03:18AM +0200, Paolo Bonzini wrote:
> > Ok, so explain to me how this looks from the submitter standpoint: he reads
> > your review of his patch, he acknowledges your point with "Yes, it makes
> > sense to signal it with a WARN or so", and still is treated as shit.
> 
> How is me asking about the user experience of it all, treating him like
> shit?!
> 
> How should I have asked this so that it is not making you think I'm
> treating him like shit?
> 
> Because treating someone like shit is not in my goals.
> 

As i mentioned in my previous reply, this has to be treated as a fatal
error from the user point of view, and the kernel needs to inform the
userspace VMM to block/stop migration as response to this failure.

Thanks,
Ashish
Borislav Petkov May 14, 2021, 9:34 a.m. UTC | #13
On Fri, May 14, 2021 at 09:05:23AM +0000, Ashish Kalra wrote:
> Ideally we should fail/stop migration even if a single guest page
> encryption status cannot be notified and that should be the way to
> proceed in this case, the guest kernel should notify the source
> userspace VMM to block/stop migration in this case.

Yap, and what I'm trying to point out here is that if the low level
machinery fails for whatever reason and it cannot recover, we should
propagate that error up the chain so that the user is aware as to why it
failed.

WARN is a good first start but in some configurations those splats don't
even get shown as people don't look at dmesg, etc.

And I think it is very important to propagate those errors properly
because there's a *lot* of moving parts involved in a guest migration
and you have encrypted memory which makes debugging this probably even
harder, etc, etc.

I hope this makes more sense.

> From a practical side, i do see Qemu's migrate_add_blocker() interface
> but that looks to be a static interface and also i don't think it will
> force stop an ongoing migration, is there an existing mechanism
> to inform userspace VMM from kernel about blocking/stopping migration ?

Hmm, so __set_memory_enc_dec() which calls
notify_addr_enc_status_changed() is called by the guest, right, when it
starts migrating.

Can an error value from it be propagated up the callchain so it can be
turned into an error messsage for the guest owner to see?

(I might be way off base here as I have no clue how the whole migration
 machinery is kicked into gear...)

Thx.
Paolo Bonzini May 14, 2021, 9:57 a.m. UTC | #14
On 14/05/21 11:05, Ashish Kalra wrote:
> I absolutely agree with both of your point of view. But what's the
> alternative ?
> 
> Ideally we should fail/stop migration even if a single guest page
> encryption status cannot be notified and that should be the way to
> proceed in this case, the guest kernel should notify the source
> userspace VMM to block/stop migration in this case.
> 
>  From a practical side, i do see Qemu's migrate_add_blocker() interface
> but that looks to be a static interface and also i don't think it will
> force stop an ongoing migration, is there an existing mechanism
> to inform userspace VMM from kernel about blocking/stopping migration ?

On the Linux side, all you need to do is WARN and write 0 to the 
MIGRATION_CONTROL MSR.

QEMU can check the MSR value when migrating the CPU registers at the 
end, and fail migration if the MSR value is 0.

Paolo
Ashish Kalra May 14, 2021, 10:05 a.m. UTC | #15
On Fri, May 14, 2021 at 11:34:32AM +0200, Borislav Petkov wrote:
> On Fri, May 14, 2021 at 09:05:23AM +0000, Ashish Kalra wrote:
> > Ideally we should fail/stop migration even if a single guest page
> > encryption status cannot be notified and that should be the way to
> > proceed in this case, the guest kernel should notify the source
> > userspace VMM to block/stop migration in this case.
> 
> Yap, and what I'm trying to point out here is that if the low level
> machinery fails for whatever reason and it cannot recover, we should
> propagate that error up the chain so that the user is aware as to why it
> failed.
> 

I totally agree.

> WARN is a good first start but in some configurations those splats don't
> even get shown as people don't look at dmesg, etc.
> 
> And I think it is very important to propagate those errors properly
> because there's a *lot* of moving parts involved in a guest migration
> and you have encrypted memory which makes debugging this probably even
> harder, etc, etc.
> 
> I hope this makes more sense.
> 
> > From a practical side, i do see Qemu's migrate_add_blocker() interface
> > but that looks to be a static interface and also i don't think it will
> > force stop an ongoing migration, is there an existing mechanism
> > to inform userspace VMM from kernel about blocking/stopping migration ?
> 
> Hmm, so __set_memory_enc_dec() which calls
> notify_addr_enc_status_changed() is called by the guest, right, when it
> starts migrating.
> 

No, actually notify_addr_enc_status_changed() is called whenever a range
of memory is marked as encrypted or decrypted, so it has nothing to do
with migration as such. 

This is basically modifying the encryption attributes on the page tables
and correspondingly also making the hypercall to inform the hypervisor about
page status encryption changes. The hypervisor will use this information
during an ongoing or future migration, so this information is maintained
even though migration might never be initiated here.

> Can an error value from it be propagated up the callchain so it can be
> turned into an error messsage for the guest owner to see?
> 

The error value cannot be propogated up the callchain directly
here, but one possibility is to leverage the hypercall and use Sean's
proposed hypercall interface to notify the host/hypervisor to block/stop
any future/ongoing migration.

Or as from Paolo's response, writing 0 to MIGRATION_CONTROL MSR seems
more ideal.

Thanks,
Ashish
Borislav Petkov May 14, 2021, 10:38 a.m. UTC | #16
On Fri, May 14, 2021 at 10:05:19AM +0000, Ashish Kalra wrote:
> No, actually notify_addr_enc_status_changed() is called whenever a range
> of memory is marked as encrypted or decrypted, so it has nothing to do
> with migration as such. 
> 
> This is basically modifying the encryption attributes on the page tables
> and correspondingly also making the hypercall to inform the hypervisor about
> page status encryption changes. The hypervisor will use this information
> during an ongoing or future migration, so this information is maintained
> even though migration might never be initiated here.

Doh, ofcourse. This doesn't make it easier.

> The error value cannot be propogated up the callchain directly
> here,

Yeah, my thinking was way wrong here - sorry about that.

> but one possibility is to leverage the hypercall and use Sean's
> proposed hypercall interface to notify the host/hypervisor to block/stop
> any future/ongoing migration.
>
> Or as from Paolo's response, writing 0 to MIGRATION_CONTROL MSR seems
> more ideal.

Ok.

So to sum up: notify_addr_enc_status_changed() should warn but not
because of migration but because regardless, we should tell the users
when page enc attributes updating fails as that is potentially hinting
at a bigger problem so we better make sufficient noise here.

Thx.
Steve Rutherford May 18, 2021, 2:01 a.m. UTC | #17
On Fri, May 14, 2021 at 3:05 AM Ashish Kalra <ashish.kalra@amd.com> wrote:
>
> On Fri, May 14, 2021 at 11:34:32AM +0200, Borislav Petkov wrote:
> > On Fri, May 14, 2021 at 09:05:23AM +0000, Ashish Kalra wrote:
> > > Ideally we should fail/stop migration even if a single guest page
> > > encryption status cannot be notified and that should be the way to
> > > proceed in this case, the guest kernel should notify the source
> > > userspace VMM to block/stop migration in this case.
> >
> > Yap, and what I'm trying to point out here is that if the low level
> > machinery fails for whatever reason and it cannot recover, we should
> > propagate that error up the chain so that the user is aware as to why it
> > failed.
> >
>
> I totally agree.
>
> > WARN is a good first start but in some configurations those splats don't
> > even get shown as people don't look at dmesg, etc.
> >
> > And I think it is very important to propagate those errors properly
> > because there's a *lot* of moving parts involved in a guest migration
> > and you have encrypted memory which makes debugging this probably even
> > harder, etc, etc.
> >
> > I hope this makes more sense.
> >
> > > From a practical side, i do see Qemu's migrate_add_blocker() interface
> > > but that looks to be a static interface and also i don't think it will
> > > force stop an ongoing migration, is there an existing mechanism
> > > to inform userspace VMM from kernel about blocking/stopping migration ?
> >
> > Hmm, so __set_memory_enc_dec() which calls
> > notify_addr_enc_status_changed() is called by the guest, right, when it
> > starts migrating.
> >
>
> No, actually notify_addr_enc_status_changed() is called whenever a range
> of memory is marked as encrypted or decrypted, so it has nothing to do
> with migration as such.
>
> This is basically modifying the encryption attributes on the page tables
> and correspondingly also making the hypercall to inform the hypervisor about
> page status encryption changes. The hypervisor will use this information
> during an ongoing or future migration, so this information is maintained
> even though migration might never be initiated here.
>
> > Can an error value from it be propagated up the callchain so it can be
> > turned into an error messsage for the guest owner to see?
> >
>
> The error value cannot be propogated up the callchain directly
> here, but one possibility is to leverage the hypercall and use Sean's
> proposed hypercall interface to notify the host/hypervisor to block/stop
> any future/ongoing migration.
>
> Or as from Paolo's response, writing 0 to MIGRATION_CONTROL MSR seems
> more ideal.
>
> Thanks,
> Ashish

How realistic is this type of failure? If you've gotten this deep, it
seems like something has gone very wrong if the memory you are about
to mark as shared (or encrypted) doesn't exist and isn't mapped. In
particular, is the kernel going to page fault when it tries to
reinitialize the page it's currently changing the c-bit of? From what
I recall, most paths that do either set_memory_encrypted or
set_memory_decrypted memset the region being toggled. Note: dma_pool
doesn't immediately memset, but the VA it's providing to set_decrypted
is freshly fetched from a recently allocated region (something has to
have gone pretty wrong if this is invalid if I'm not mistaken. No one
would think twice if you wrote to that freshly allocated page).

The reason I mention this is that SEV migration is going to be the
least of your concerns if you are already on a one-way train towards a
Kernel oops. I'm not certain I would go so far as to say this should
BUG() instead (I think the page fault on access might be easier to
debug a BUG here), but I'm pretty skeptical that the kernel is going
to do too well if it doesn't know if its kernel VAs are valid.

If, despite the above, we expect to infrequently-but-not-never disable
migration with no intention of reenabling it, we should signal it
differently than we currently signal migration enablement. Currently,
if you toggle migration from on to off there is an implication that
you are about to reboot, and you are only ephemerally unable to
migrate. Having permanent disablement be indistinguishable from a
really long reboot is a recipe for a really sad long timeout in
userspace.
Ashish Kalra May 19, 2021, 12:06 p.m. UTC | #18
On Mon, May 17, 2021 at 07:01:09PM -0700, Steve Rutherford wrote:
> On Fri, May 14, 2021 at 3:05 AM Ashish Kalra <ashish.kalra@amd.com> wrote:
> >
> > On Fri, May 14, 2021 at 11:34:32AM +0200, Borislav Petkov wrote:
> > > On Fri, May 14, 2021 at 09:05:23AM +0000, Ashish Kalra wrote:
> > > > Ideally we should fail/stop migration even if a single guest page
> > > > encryption status cannot be notified and that should be the way to
> > > > proceed in this case, the guest kernel should notify the source
> > > > userspace VMM to block/stop migration in this case.
> > >
> > > Yap, and what I'm trying to point out here is that if the low level
> > > machinery fails for whatever reason and it cannot recover, we should
> > > propagate that error up the chain so that the user is aware as to why it
> > > failed.
> > >
> >
> > I totally agree.
> >
> > > WARN is a good first start but in some configurations those splats don't
> > > even get shown as people don't look at dmesg, etc.
> > >
> > > And I think it is very important to propagate those errors properly
> > > because there's a *lot* of moving parts involved in a guest migration
> > > and you have encrypted memory which makes debugging this probably even
> > > harder, etc, etc.
> > >
> > > I hope this makes more sense.
> > >
> > > > From a practical side, i do see Qemu's migrate_add_blocker() interface
> > > > but that looks to be a static interface and also i don't think it will
> > > > force stop an ongoing migration, is there an existing mechanism
> > > > to inform userspace VMM from kernel about blocking/stopping migration ?
> > >
> > > Hmm, so __set_memory_enc_dec() which calls
> > > notify_addr_enc_status_changed() is called by the guest, right, when it
> > > starts migrating.
> > >
> >
> > No, actually notify_addr_enc_status_changed() is called whenever a range
> > of memory is marked as encrypted or decrypted, so it has nothing to do
> > with migration as such.
> >
> > This is basically modifying the encryption attributes on the page tables
> > and correspondingly also making the hypercall to inform the hypervisor about
> > page status encryption changes. The hypervisor will use this information
> > during an ongoing or future migration, so this information is maintained
> > even though migration might never be initiated here.
> >
> > > Can an error value from it be propagated up the callchain so it can be
> > > turned into an error messsage for the guest owner to see?
> > >
> >
> > The error value cannot be propogated up the callchain directly
> > here, but one possibility is to leverage the hypercall and use Sean's
> > proposed hypercall interface to notify the host/hypervisor to block/stop
> > any future/ongoing migration.
> >
> > Or as from Paolo's response, writing 0 to MIGRATION_CONTROL MSR seems
> > more ideal.
> >
> > Thanks,
> > Ashish
> 
> How realistic is this type of failure? If you've gotten this deep, it
> seems like something has gone very wrong if the memory you are about
> to mark as shared (or encrypted) doesn't exist and isn't mapped. In
> particular, is the kernel going to page fault when it tries to
> reinitialize the page it's currently changing the c-bit of? From what
> I recall, most paths that do either set_memory_encrypted or
> set_memory_decrypted memset the region being toggled. Note: dma_pool
> doesn't immediately memset, but the VA it's providing to set_decrypted
> is freshly fetched from a recently allocated region (something has to
> have gone pretty wrong if this is invalid if I'm not mistaken. No one
> would think twice if you wrote to that freshly allocated page).
> 
> The reason I mention this is that SEV migration is going to be the
> least of your concerns if you are already on a one-way train towards a
> Kernel oops. I'm not certain I would go so far as to say this should
> BUG() instead (I think the page fault on access might be easier to
> debug a BUG here), but I'm pretty skeptical that the kernel is going
> to do too well if it doesn't know if its kernel VAs are valid.
> 
> If, despite the above, we expect to infrequently-but-not-never disable
> migration with no intention of reenabling it, we should signal it
> differently than we currently signal migration enablement. Currently,
> if you toggle migration from on to off there is an implication that
> you are about to reboot, and you are only ephemerally unable to
> migrate. Having permanent disablement be indistinguishable from a
> really long reboot is a recipe for a really sad long timeout in
> userspace.

Also looking at set_memory_encrypted and set_memory_decrypted usage
patterns, the persistent decrypted regions like the dma pool,
bss_decrypted, etc., will be setup at boot time. Later the unused
bss_decrypted section will be freed and set_memory_encrypted called on
the freed memory at end of kernel boot.

Most of the runtime set_memory_encrypted and set_memory_decrypted calls
will be on dynamically allocated dma buffers via dma_alloc_coherent()
and dma_free_coherent().

For example, A dma_alloc_coherent() request will allocate pages and then call
set_memory_decrypted() against them. When dma_free_coherent() is called,
set_memory_encrypted() is called against the pages about to be freed 
before they are actually freed.

Now these buffers have very short life and only used for immediate I/O
and then freed, so they may not be of major concern for SEV
migration ?

So disabling migration for failure of address lookup or mapping failures
on such pages will really be an overkill. 

Might be in favor of Steve's thoughts above of doing a BUG() here
instead.

Thanks,
Ashish
Paolo Bonzini May 19, 2021, 1:44 p.m. UTC | #19
On 19/05/21 14:06, Ashish Kalra wrote:
> Now these buffers have very short life and only used for immediate I/O
> and then freed, so they may not be of major concern for SEV
> migration ?

Well, they are a concern because they do break migration.  But it may be 
indeed good enough to just have a WARN ("bad things may happen and you 
get to keep both pieces") and not disable future migration.

A BUG must always be avoided unless you're sure that something *worse* 
will happen in the future, e.g. a BUG is acceptable if you have detected 
a use-after-free or a dangling pointer.  This is not the case.

Paolo

> So disabling migration for failure of address lookup or mapping failures
> on such pages will really be an overkill.
> Might be in favor of Steve's thoughts above of doing a BUG() here
> instead.
Andy Lutomirski May 19, 2021, 11:29 p.m. UTC | #20
On Wed, May 12, 2021, at 6:15 AM, Borislav Petkov wrote:
> On Fri, Apr 23, 2021 at 03:58:43PM +0000, Ashish Kalra wrote:
> > +static inline void notify_page_enc_status_changed(unsigned long pfn,
> > +						  int npages, bool enc)
> > +{
> > +	PVOP_VCALL3(mmu.notify_page_enc_status_changed, pfn, npages, enc);
> > +}
> 
> Now the question is whether something like that is needed for TDX, and,
> if so, could it be shared by both.

The TDX MapGPA call can fail, and presumably it will fail if the page is not sufficiently quiescent from the host's perspective.  It seems like a mistake to me to have a KVM-specific hypercall for this that cannot cleanly fail.
Sean Christopherson May 19, 2021, 11:44 p.m. UTC | #21
On Wed, May 19, 2021, Andy Lutomirski wrote:
> On Wed, May 12, 2021, at 6:15 AM, Borislav Petkov wrote:
> > On Fri, Apr 23, 2021 at 03:58:43PM +0000, Ashish Kalra wrote:
> > > +static inline void notify_page_enc_status_changed(unsigned long pfn,
> > > +						  int npages, bool enc)
> > > +{
> > > +	PVOP_VCALL3(mmu.notify_page_enc_status_changed, pfn, npages, enc);
> > > +}
> > 
> > Now the question is whether something like that is needed for TDX, and,
> > if so, could it be shared by both.
> 
> The TDX MapGPA call can fail, and presumably it will fail if the page is not
> sufficiently quiescent from the host's perspective.

Barring a guest bug, e.g. requesting a completely non-existent page, MapGPA
shouldn't fail.  The example in the the GHCI:

  Invalid operand – for example, the GPA may be already mapped as a shared page.

makes no sense to me.  An already-mapped page would be an -EBUSY style error,
not an invalid operand, and IIRC, I explicitly lobbied against allowing the VMM
to return "try again" precisely because it's impossible for the guest to handle
in a sane manner.  If the physical page is in a state that requires stalling the
vCPU, then the VMM is supposed to do exactly that, not punt the problem to the
guest.

Maybe we should get stronger language into the GHCI?

> It seems like a mistake to me to have a KVM-specific hypercall for this that
> cannot cleanly fail.
diff mbox series

Patch

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 4abf110e2243..c0ef13e97f5a 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -84,6 +84,12 @@  static inline void paravirt_arch_exit_mmap(struct mm_struct *mm)
 	PVOP_VCALL1(mmu.exit_mmap, mm);
 }
 
+static inline void notify_page_enc_status_changed(unsigned long pfn,
+						  int npages, bool enc)
+{
+	PVOP_VCALL3(mmu.notify_page_enc_status_changed, pfn, npages, enc);
+}
+
 #ifdef CONFIG_PARAVIRT_XXL
 static inline void load_sp0(unsigned long sp0)
 {
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index de87087d3bde..7245d08f5500 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -195,6 +195,8 @@  struct pv_mmu_ops {
 
 	/* Hook for intercepting the destruction of an mm_struct. */
 	void (*exit_mmap)(struct mm_struct *mm);
+	void (*notify_page_enc_status_changed)(unsigned long pfn, int npages,
+					       bool enc);
 
 #ifdef CONFIG_PARAVIRT_XXL
 	struct paravirt_callee_save read_cr2;
diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index 4352f08bfbb5..ed9cfe062634 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -83,6 +83,8 @@  int set_pages_rw(struct page *page, int numpages);
 int set_direct_map_invalid_noflush(struct page *page);
 int set_direct_map_default_noflush(struct page *page);
 bool kernel_page_present(struct page *page);
+void notify_addr_enc_status_changed(unsigned long vaddr, int npages,
+				    bool enc);
 
 extern int kernel_set_to_readonly;
 
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index c60222ab8ab9..192230247ad7 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -335,6 +335,7 @@  struct paravirt_patch_template pv_ops = {
 			(void (*)(struct mmu_gather *, void *))tlb_remove_page,
 
 	.mmu.exit_mmap		= paravirt_nop,
+	.mmu.notify_page_enc_status_changed	= paravirt_nop,
 
 #ifdef CONFIG_PARAVIRT_XXL
 	.mmu.read_cr2		= __PV_IS_CALLEE_SAVE(native_read_cr2),
diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index 4b01f7dbaf30..e4b94099645b 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -229,29 +229,74 @@  void __init sev_setup_arch(void)
 	swiotlb_adjust_size(size);
 }
 
-static void __init __set_clr_pte_enc(pte_t *kpte, int level, bool enc)
+static unsigned long pg_level_to_pfn(int level, pte_t *kpte, pgprot_t *ret_prot)
 {
-	pgprot_t old_prot, new_prot;
-	unsigned long pfn, pa, size;
-	pte_t new_pte;
+	unsigned long pfn = 0;
+	pgprot_t prot;
 
 	switch (level) {
 	case PG_LEVEL_4K:
 		pfn = pte_pfn(*kpte);
-		old_prot = pte_pgprot(*kpte);
+		prot = pte_pgprot(*kpte);
 		break;
 	case PG_LEVEL_2M:
 		pfn = pmd_pfn(*(pmd_t *)kpte);
-		old_prot = pmd_pgprot(*(pmd_t *)kpte);
+		prot = pmd_pgprot(*(pmd_t *)kpte);
 		break;
 	case PG_LEVEL_1G:
 		pfn = pud_pfn(*(pud_t *)kpte);
-		old_prot = pud_pgprot(*(pud_t *)kpte);
+		prot = pud_pgprot(*(pud_t *)kpte);
 		break;
 	default:
-		return;
+		return 0;
 	}
 
+	if (ret_prot)
+		*ret_prot = prot;
+
+	return pfn;
+}
+
+void notify_addr_enc_status_changed(unsigned long vaddr, int npages,
+				    bool enc)
+{
+#ifdef CONFIG_PARAVIRT
+	unsigned long sz = npages << PAGE_SHIFT;
+	unsigned long vaddr_end = vaddr + sz;
+
+	while (vaddr < vaddr_end) {
+		int psize, pmask, level;
+		unsigned long pfn;
+		pte_t *kpte;
+
+		kpte = lookup_address(vaddr, &level);
+		if (!kpte || pte_none(*kpte))
+			return;
+
+		pfn = pg_level_to_pfn(level, kpte, NULL);
+		if (!pfn)
+			continue;
+
+		psize = page_level_size(level);
+		pmask = page_level_mask(level);
+
+		notify_page_enc_status_changed(pfn, psize >> PAGE_SHIFT, enc);
+
+		vaddr = (vaddr & pmask) + psize;
+	}
+#endif
+}
+
+static void __init __set_clr_pte_enc(pte_t *kpte, int level, bool enc)
+{
+	pgprot_t old_prot, new_prot;
+	unsigned long pfn, pa, size;
+	pte_t new_pte;
+
+	pfn = pg_level_to_pfn(level, kpte, &old_prot);
+	if (!pfn)
+		return;
+
 	new_prot = old_prot;
 	if (enc)
 		pgprot_val(new_prot) |= _PAGE_ENC;
@@ -286,12 +331,13 @@  static void __init __set_clr_pte_enc(pte_t *kpte, int level, bool enc)
 static int __init early_set_memory_enc_dec(unsigned long vaddr,
 					   unsigned long size, bool enc)
 {
-	unsigned long vaddr_end, vaddr_next;
+	unsigned long vaddr_end, vaddr_next, start;
 	unsigned long psize, pmask;
 	int split_page_size_mask;
 	int level, ret;
 	pte_t *kpte;
 
+	start = vaddr;
 	vaddr_next = vaddr;
 	vaddr_end = vaddr + size;
 
@@ -346,6 +392,8 @@  static int __init early_set_memory_enc_dec(unsigned long vaddr,
 
 	ret = 0;
 
+	notify_addr_enc_status_changed(start, PAGE_ALIGN(size) >> PAGE_SHIFT,
+					enc);
 out:
 	__flush_tlb_all();
 	return ret;
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 16f878c26667..45e65517405a 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2012,6 +2012,13 @@  static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
 	 */
 	cpa_flush(&cpa, 0);
 
+	/*
+	 * Notify hypervisor that a given memory range is mapped encrypted
+	 * or decrypted. The hypervisor will use this information during the
+	 * VM migration.
+	 */
+	notify_addr_enc_status_changed(addr, numpages, enc);
+
 	return ret;
 }