diff mbox series

[v3,3/3] x86/hyperv: Make encrypted/decrypted changes safe for load_unaligned_zeropad()

Message ID 20240105183025.225972-4-mhklinux@outlook.com (mailing list archive)
State New
Headers show
Series x86/hyperv: Mark CoCo VM pages not present when changing encrypted state | expand

Commit Message

Michael Kelley Jan. 5, 2024, 6:30 p.m. UTC
From: Michael Kelley <mhklinux@outlook.com>

In a CoCo VM, when transitioning memory from encrypted to decrypted, or
vice versa, the caller of set_memory_encrypted() or set_memory_decrypted()
is responsible for ensuring the memory isn't in use and isn't referenced
while the transition is in progress.  The transition has multiple steps,
and the memory is in an inconsistent state until all steps are complete.
A reference while the state is inconsistent could result in an exception
that can't be cleanly fixed up.

However, the kernel load_unaligned_zeropad() mechanism could cause a stray
reference that can't be prevented by the caller of set_memory_encrypted()
or set_memory_decrypted(), so there's specific code to handle this case.
But a CoCo VM running on Hyper-V may be configured to run with a paravisor,
with the #VC or #VE exception routed to the paravisor. There's no
architectural way to forward the exceptions back to the guest kernel, and
in such a case, the load_unaligned_zeropad() specific code doesn't work.

To avoid this problem, mark pages as "not present" while a transition
is in progress. If load_unaligned_zeropad() causes a stray reference, a
normal page fault is generated instead of #VC or #VE, and the
page-fault-based fixup handlers for load_unaligned_zeropad() resolve the
reference. When the encrypted/decrypted transition is complete, mark the
pages as "present" again.

Signed-off-by: Michael Kelley <mhklinux@outlook.com>
---
 arch/x86/hyperv/ivm.c | 49 ++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 46 insertions(+), 3 deletions(-)

Comments

Kuppuswamy Sathyanarayanan Jan. 8, 2024, 6:37 p.m. UTC | #1
On 1/5/2024 10:30 AM, mhkelley58@gmail.com wrote:
> From: Michael Kelley <mhklinux@outlook.com>
> 
> In a CoCo VM, when transitioning memory from encrypted to decrypted, or
> vice versa, the caller of set_memory_encrypted() or set_memory_decrypted()
> is responsible for ensuring the memory isn't in use and isn't referenced
> while the transition is in progress.  The transition has multiple steps,
> and the memory is in an inconsistent state until all steps are complete.
> A reference while the state is inconsistent could result in an exception
> that can't be cleanly fixed up.
> 
> However, the kernel load_unaligned_zeropad() mechanism could cause a stray
> reference that can't be prevented by the caller of set_memory_encrypted()
> or set_memory_decrypted(), so there's specific code to handle this case.
> But a CoCo VM running on Hyper-V may be configured to run with a paravisor,
> with the #VC or #VE exception routed to the paravisor. There's no
> architectural way to forward the exceptions back to the guest kernel, and
> in such a case, the load_unaligned_zeropad() specific code doesn't work.
> 
> To avoid this problem, mark pages as "not present" while a transition
> is in progress. If load_unaligned_zeropad() causes a stray reference, a
> normal page fault is generated instead of #VC or #VE, and the
> page-fault-based fixup handlers for load_unaligned_zeropad() resolve the
> reference. When the encrypted/decrypted transition is complete, mark the
> pages as "present" again.

Change looks good to me. But I am wondering why are adding it part of prepare
and finish callbacks instead of directly in set_memory_encrypted() function.

Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>


> 
> Signed-off-by: Michael Kelley <mhklinux@outlook.com>
> ---
>  arch/x86/hyperv/ivm.c | 49 ++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 46 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
> index 8ba18635e338..5ad39256a5d2 100644
> --- a/arch/x86/hyperv/ivm.c
> +++ b/arch/x86/hyperv/ivm.c
> @@ -15,6 +15,7 @@
>  #include <asm/io.h>
>  #include <asm/coco.h>
>  #include <asm/mem_encrypt.h>
> +#include <asm/set_memory.h>
>  #include <asm/mshyperv.h>
>  #include <asm/hypervisor.h>
>  #include <asm/mtrr.h>
> @@ -502,6 +503,31 @@ static int hv_mark_gpa_visibility(u16 count, const u64 pfn[],
>  		return -EFAULT;
>  }
>  
> +/*
> + * When transitioning memory between encrypted and decrypted, the caller
> + * of set_memory_encrypted() or set_memory_decrypted() is responsible for
> + * ensuring that the memory isn't in use and isn't referenced while the
> + * transition is in progress.  The transition has multiple steps, and the
> + * memory is in an inconsistent state until all steps are complete. A
> + * reference while the state is inconsistent could result in an exception
> + * that can't be cleanly fixed up.
> + *
> + * But the Linux kernel load_unaligned_zeropad() mechanism could cause a
> + * stray reference that can't be prevented by the caller, so Linux has
> + * specific code to handle this case. But when the #VC and #VE exceptions
> + * routed to a paravisor, the specific code doesn't work. To avoid this
> + * problem, mark the pages as "not present" while the transition is in
> + * progress. If load_unaligned_zeropad() causes a stray reference, a normal
> + * page fault is generated instead of #VC or #VE, and the page-fault-based
> + * handlers for load_unaligned_zeropad() resolve the reference.  When the
> + * transition is complete, hv_vtom_set_host_visibility() marks the pages
> + * as "present" again.
> + */
> +static bool hv_vtom_clear_present(unsigned long kbuffer, int pagecount, bool enc)
> +{
> +	return !set_memory_np(kbuffer, pagecount);
> +}
> +
>  /*
>   * hv_vtom_set_host_visibility - Set specified memory visible to host.
>   *
> @@ -521,7 +547,7 @@ static bool hv_vtom_set_host_visibility(unsigned long kbuffer, int pagecount, bo
>  
>  	pfn_array = kmalloc(HV_HYP_PAGE_SIZE, GFP_KERNEL);
>  	if (!pfn_array)
> -		return false;
> +		goto err_set_memory_p;
>  
>  	for (i = 0, pfn = 0; i < pagecount; i++) {
>  		/*
> @@ -545,14 +571,30 @@ static bool hv_vtom_set_host_visibility(unsigned long kbuffer, int pagecount, bo
>  		}
>  	}
>  
> - err_free_pfn_array:
> +err_free_pfn_array:
>  	kfree(pfn_array);
> +
> +err_set_memory_p:
> +	/*
> +	 * Set the PTE PRESENT bits again to revert what hv_vtom_clear_present()
> +	 * did. Do this even if there is an error earlier in this function in
> +	 * order to avoid leaving the memory range in a "broken" state. Setting
> +	 * the PRESENT bits shouldn't fail, but return an error if it does.
> +	 */
> +	if (set_memory_p(kbuffer, pagecount))
> +		result = false;
> +
>  	return result;
>  }
>  
>  static bool hv_vtom_tlb_flush_required(bool private)
>  {
> -	return true;
> +	/*
> +	 * Since hv_vtom_clear_present() marks the PTEs as "not present"
> +	 * and flushes the TLB, they can't be in the TLB. That makes the
> +	 * flush controlled by this function redundant, so return "false".
> +	 */
> +	return false;
>  }
>  
>  static bool hv_vtom_cache_flush_required(void)
> @@ -615,6 +657,7 @@ void __init hv_vtom_init(void)
>  	x86_platform.hyper.is_private_mmio = hv_is_private_mmio;
>  	x86_platform.guest.enc_cache_flush_required = hv_vtom_cache_flush_required;
>  	x86_platform.guest.enc_tlb_flush_required = hv_vtom_tlb_flush_required;
> +	x86_platform.guest.enc_status_change_prepare = hv_vtom_clear_present;
>  	x86_platform.guest.enc_status_change_finish = hv_vtom_set_host_visibility;
>  
>  	/* Set WB as the default cache mode. */
Michael Kelley Jan. 8, 2024, 7:13 p.m. UTC | #2
From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Sent: Monday, January 8, 2024 10:37 AM
> 
> On 1/5/2024 10:30 AM, mhkelley58@gmail.com wrote:
> > From: Michael Kelley <mhklinux@outlook.com>
> >
> > In a CoCo VM, when transitioning memory from encrypted to decrypted, or
> > vice versa, the caller of set_memory_encrypted() or set_memory_decrypted()
> > is responsible for ensuring the memory isn't in use and isn't referenced
> > while the transition is in progress.  The transition has multiple steps,
> > and the memory is in an inconsistent state until all steps are complete.
> > A reference while the state is inconsistent could result in an exception
> > that can't be cleanly fixed up.
> >
> > However, the kernel load_unaligned_zeropad() mechanism could cause a stray
> > reference that can't be prevented by the caller of set_memory_encrypted()
> > or set_memory_decrypted(), so there's specific code to handle this case.
> > But a CoCo VM running on Hyper-V may be configured to run with a paravisor,
> > with the #VC or #VE exception routed to the paravisor. There's no
> > architectural way to forward the exceptions back to the guest kernel, and
> > in such a case, the load_unaligned_zeropad() specific code doesn't work.
> >
> > To avoid this problem, mark pages as "not present" while a transition
> > is in progress. If load_unaligned_zeropad() causes a stray reference, a
> > normal page fault is generated instead of #VC or #VE, and the
> > page-fault-based fixup handlers for load_unaligned_zeropad() resolve the
> > reference. When the encrypted/decrypted transition is complete, mark the
> > pages as "present" again.
> 
> Change looks good to me. But I am wondering why are adding it part of
> prepare and finish callbacks instead of directly in set_memory_encrypted() function.
> 

The prepare/finish callbacks are different for TDX, SEV-SNP, and
Hyper-V CoCo guests running with a paravisor -- so there are three sets
of callbacks.  As described in the cover letter, I've given up on using this
scheme for the TDX and SEV-SNP cases, because of the difficulty with
the SEV-SNP callbacks needing a valid virtual address (whereas TDX and
Hyper-V paravisor need only a physical address).  So it seems like the
callbacks specific to the Hyper-V paravisor are the natural place for the
code.  That leaves the TDX and SEV-SNP code paths unchanged, which
was my intent.

Or maybe I'm not understanding your comment?  If that's the case,
please elaborate.

Michael

> Reviewed-by: Kuppuswamy Sathyanarayanan
> <sathyanarayanan.kuppuswamy@linux.intel.com>
>
Kuppuswamy Sathyanarayanan Jan. 8, 2024, 7:24 p.m. UTC | #3
On 1/8/2024 11:13 AM, Michael Kelley wrote:
> From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Sent: Monday, January 8, 2024 10:37 AM
>>
>> On 1/5/2024 10:30 AM, mhkelley58@gmail.com wrote:
>>> From: Michael Kelley <mhklinux@outlook.com>
>>>
>>> In a CoCo VM, when transitioning memory from encrypted to decrypted, or
>>> vice versa, the caller of set_memory_encrypted() or set_memory_decrypted()
>>> is responsible for ensuring the memory isn't in use and isn't referenced
>>> while the transition is in progress.  The transition has multiple steps,
>>> and the memory is in an inconsistent state until all steps are complete.
>>> A reference while the state is inconsistent could result in an exception
>>> that can't be cleanly fixed up.
>>>
>>> However, the kernel load_unaligned_zeropad() mechanism could cause a stray
>>> reference that can't be prevented by the caller of set_memory_encrypted()
>>> or set_memory_decrypted(), so there's specific code to handle this case.
>>> But a CoCo VM running on Hyper-V may be configured to run with a paravisor,
>>> with the #VC or #VE exception routed to the paravisor. There's no
>>> architectural way to forward the exceptions back to the guest kernel, and
>>> in such a case, the load_unaligned_zeropad() specific code doesn't work.
>>>
>>> To avoid this problem, mark pages as "not present" while a transition
>>> is in progress. If load_unaligned_zeropad() causes a stray reference, a
>>> normal page fault is generated instead of #VC or #VE, and the
>>> page-fault-based fixup handlers for load_unaligned_zeropad() resolve the
>>> reference. When the encrypted/decrypted transition is complete, mark the
>>> pages as "present" again.
>>
>> Change looks good to me. But I am wondering why are adding it part of
>> prepare and finish callbacks instead of directly in set_memory_encrypted() function.
>>
> 
> The prepare/finish callbacks are different for TDX, SEV-SNP, and
> Hyper-V CoCo guests running with a paravisor -- so there are three sets
> of callbacks.  As described in the cover letter, I've given up on using this
> scheme for the TDX and SEV-SNP cases, because of the difficulty with
> the SEV-SNP callbacks needing a valid virtual address (whereas TDX and
> Hyper-V paravisor need only a physical address).  So it seems like the
> callbacks specific to the Hyper-V paravisor are the natural place for the
> code.  That leaves the TDX and SEV-SNP code paths unchanged, which
> was my intent.
> 

Got it. Thanks for clarifying it.

> Or maybe I'm not understanding your comment?  If that's the case,
> please elaborate.
> 
> Michael
> 
>> Reviewed-by: Kuppuswamy Sathyanarayanan
>> <sathyanarayanan.kuppuswamy@linux.intel.com>
>>
Edgecombe, Rick P Jan. 12, 2024, 12:26 a.m. UTC | #4
On Fri, 2024-01-05 at 10:30 -0800, mhkelley58@gmail.com wrote:
>   * hv_vtom_set_host_visibility - Set specified memory visible to
> host.
>   *
> @@ -521,7 +547,7 @@ static bool hv_vtom_set_host_visibility(unsigned
> long kbuffer, int pagecount, bo
>  
>         pfn_array = kmalloc(HV_HYP_PAGE_SIZE, GFP_KERNEL);
>         if (!pfn_array)
> -               return false;
> +               goto err_set_memory_p;
>  

If kmalloc() fails here, and set_memory_p() succeeds below, then
hv_vtom_set_host_visibility() returns true, but skips all the
hv_mark_gpa_visibility() work. Shouldn't it return false?
Michael Kelley Jan. 12, 2024, 3:19 a.m. UTC | #5
From: Edgecombe, Rick P <rick.p.edgecombe@intel.com> Sent: Thursday, January 11, 2024 4:27 PM
> 
> On Fri, 2024-01-05 at 10:30 -0800, mhkelley58@gmail.com wrote:
> >   * hv_vtom_set_host_visibility - Set specified memory visible to host.
> >   *
> > @@ -521,7 +547,7 @@ static bool hv_vtom_set_host_visibility(unsigned long kbuffer, int pagecount, bo
> >
> >         pfn_array = kmalloc(HV_HYP_PAGE_SIZE, GFP_KERNEL);
> >         if (!pfn_array)
> > -               return false;
> > +               goto err_set_memory_p;
> >
> 
> If kmalloc() fails here, and set_memory_p() succeeds below, then
> hv_vtom_set_host_visibility() returns true, but skips all the
> hv_mark_gpa_visibility() work. Shouldn't it return false?

Ooops.  Yes.  If the kmalloc() fails, need to set "result" to false before
doing the goto.  Will fix in the next version.

Michael
diff mbox series

Patch

diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
index 8ba18635e338..5ad39256a5d2 100644
--- a/arch/x86/hyperv/ivm.c
+++ b/arch/x86/hyperv/ivm.c
@@ -15,6 +15,7 @@ 
 #include <asm/io.h>
 #include <asm/coco.h>
 #include <asm/mem_encrypt.h>
+#include <asm/set_memory.h>
 #include <asm/mshyperv.h>
 #include <asm/hypervisor.h>
 #include <asm/mtrr.h>
@@ -502,6 +503,31 @@  static int hv_mark_gpa_visibility(u16 count, const u64 pfn[],
 		return -EFAULT;
 }
 
+/*
+ * When transitioning memory between encrypted and decrypted, the caller
+ * of set_memory_encrypted() or set_memory_decrypted() is responsible for
+ * ensuring that the memory isn't in use and isn't referenced while the
+ * transition is in progress.  The transition has multiple steps, and the
+ * memory is in an inconsistent state until all steps are complete. A
+ * reference while the state is inconsistent could result in an exception
+ * that can't be cleanly fixed up.
+ *
+ * But the Linux kernel load_unaligned_zeropad() mechanism could cause a
+ * stray reference that can't be prevented by the caller, so Linux has
+ * specific code to handle this case. But when the #VC and #VE exceptions
+ * routed to a paravisor, the specific code doesn't work. To avoid this
+ * problem, mark the pages as "not present" while the transition is in
+ * progress. If load_unaligned_zeropad() causes a stray reference, a normal
+ * page fault is generated instead of #VC or #VE, and the page-fault-based
+ * handlers for load_unaligned_zeropad() resolve the reference.  When the
+ * transition is complete, hv_vtom_set_host_visibility() marks the pages
+ * as "present" again.
+ */
+static bool hv_vtom_clear_present(unsigned long kbuffer, int pagecount, bool enc)
+{
+	return !set_memory_np(kbuffer, pagecount);
+}
+
 /*
  * hv_vtom_set_host_visibility - Set specified memory visible to host.
  *
@@ -521,7 +547,7 @@  static bool hv_vtom_set_host_visibility(unsigned long kbuffer, int pagecount, bo
 
 	pfn_array = kmalloc(HV_HYP_PAGE_SIZE, GFP_KERNEL);
 	if (!pfn_array)
-		return false;
+		goto err_set_memory_p;
 
 	for (i = 0, pfn = 0; i < pagecount; i++) {
 		/*
@@ -545,14 +571,30 @@  static bool hv_vtom_set_host_visibility(unsigned long kbuffer, int pagecount, bo
 		}
 	}
 
- err_free_pfn_array:
+err_free_pfn_array:
 	kfree(pfn_array);
+
+err_set_memory_p:
+	/*
+	 * Set the PTE PRESENT bits again to revert what hv_vtom_clear_present()
+	 * did. Do this even if there is an error earlier in this function in
+	 * order to avoid leaving the memory range in a "broken" state. Setting
+	 * the PRESENT bits shouldn't fail, but return an error if it does.
+	 */
+	if (set_memory_p(kbuffer, pagecount))
+		result = false;
+
 	return result;
 }
 
 static bool hv_vtom_tlb_flush_required(bool private)
 {
-	return true;
+	/*
+	 * Since hv_vtom_clear_present() marks the PTEs as "not present"
+	 * and flushes the TLB, they can't be in the TLB. That makes the
+	 * flush controlled by this function redundant, so return "false".
+	 */
+	return false;
 }
 
 static bool hv_vtom_cache_flush_required(void)
@@ -615,6 +657,7 @@  void __init hv_vtom_init(void)
 	x86_platform.hyper.is_private_mmio = hv_is_private_mmio;
 	x86_platform.guest.enc_cache_flush_required = hv_vtom_cache_flush_required;
 	x86_platform.guest.enc_tlb_flush_required = hv_vtom_tlb_flush_required;
+	x86_platform.guest.enc_status_change_prepare = hv_vtom_clear_present;
 	x86_platform.guest.enc_status_change_finish = hv_vtom_set_host_visibility;
 
 	/* Set WB as the default cache mode. */