Message ID | 20240105183025.225972-4-mhklinux@outlook.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | x86/hyperv: Mark CoCo VM pages not present when changing encrypted state | expand |
On 1/5/2024 10:30 AM, mhkelley58@gmail.com wrote: > From: Michael Kelley <mhklinux@outlook.com> > > In a CoCo VM, when transitioning memory from encrypted to decrypted, or > vice versa, the caller of set_memory_encrypted() or set_memory_decrypted() > is responsible for ensuring the memory isn't in use and isn't referenced > while the transition is in progress. The transition has multiple steps, > and the memory is in an inconsistent state until all steps are complete. > A reference while the state is inconsistent could result in an exception > that can't be cleanly fixed up. > > However, the kernel load_unaligned_zeropad() mechanism could cause a stray > reference that can't be prevented by the caller of set_memory_encrypted() > or set_memory_decrypted(), so there's specific code to handle this case. > But a CoCo VM running on Hyper-V may be configured to run with a paravisor, > with the #VC or #VE exception routed to the paravisor. There's no > architectural way to forward the exceptions back to the guest kernel, and > in such a case, the load_unaligned_zeropad() specific code doesn't work. > > To avoid this problem, mark pages as "not present" while a transition > is in progress. If load_unaligned_zeropad() causes a stray reference, a > normal page fault is generated instead of #VC or #VE, and the > page-fault-based fixup handlers for load_unaligned_zeropad() resolve the > reference. When the encrypted/decrypted transition is complete, mark the > pages as "present" again. Change looks good to me. But I am wondering why are adding it part of prepare and finish callbacks instead of directly in set_memory_encrypted() function. Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> > > Signed-off-by: Michael Kelley <mhklinux@outlook.com> > --- > arch/x86/hyperv/ivm.c | 49 ++++++++++++++++++++++++++++++++++++++++--- > 1 file changed, 46 insertions(+), 3 deletions(-) > > diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c > index 8ba18635e338..5ad39256a5d2 100644 > --- a/arch/x86/hyperv/ivm.c > +++ b/arch/x86/hyperv/ivm.c > @@ -15,6 +15,7 @@ > #include <asm/io.h> > #include <asm/coco.h> > #include <asm/mem_encrypt.h> > +#include <asm/set_memory.h> > #include <asm/mshyperv.h> > #include <asm/hypervisor.h> > #include <asm/mtrr.h> > @@ -502,6 +503,31 @@ static int hv_mark_gpa_visibility(u16 count, const u64 pfn[], > return -EFAULT; > } > > +/* > + * When transitioning memory between encrypted and decrypted, the caller > + * of set_memory_encrypted() or set_memory_decrypted() is responsible for > + * ensuring that the memory isn't in use and isn't referenced while the > + * transition is in progress. The transition has multiple steps, and the > + * memory is in an inconsistent state until all steps are complete. A > + * reference while the state is inconsistent could result in an exception > + * that can't be cleanly fixed up. > + * > + * But the Linux kernel load_unaligned_zeropad() mechanism could cause a > + * stray reference that can't be prevented by the caller, so Linux has > + * specific code to handle this case. But when the #VC and #VE exceptions > + * routed to a paravisor, the specific code doesn't work. To avoid this > + * problem, mark the pages as "not present" while the transition is in > + * progress. If load_unaligned_zeropad() causes a stray reference, a normal > + * page fault is generated instead of #VC or #VE, and the page-fault-based > + * handlers for load_unaligned_zeropad() resolve the reference. When the > + * transition is complete, hv_vtom_set_host_visibility() marks the pages > + * as "present" again. > + */ > +static bool hv_vtom_clear_present(unsigned long kbuffer, int pagecount, bool enc) > +{ > + return !set_memory_np(kbuffer, pagecount); > +} > + > /* > * hv_vtom_set_host_visibility - Set specified memory visible to host. > * > @@ -521,7 +547,7 @@ static bool hv_vtom_set_host_visibility(unsigned long kbuffer, int pagecount, bo > > pfn_array = kmalloc(HV_HYP_PAGE_SIZE, GFP_KERNEL); > if (!pfn_array) > - return false; > + goto err_set_memory_p; > > for (i = 0, pfn = 0; i < pagecount; i++) { > /* > @@ -545,14 +571,30 @@ static bool hv_vtom_set_host_visibility(unsigned long kbuffer, int pagecount, bo > } > } > > - err_free_pfn_array: > +err_free_pfn_array: > kfree(pfn_array); > + > +err_set_memory_p: > + /* > + * Set the PTE PRESENT bits again to revert what hv_vtom_clear_present() > + * did. Do this even if there is an error earlier in this function in > + * order to avoid leaving the memory range in a "broken" state. Setting > + * the PRESENT bits shouldn't fail, but return an error if it does. > + */ > + if (set_memory_p(kbuffer, pagecount)) > + result = false; > + > return result; > } > > static bool hv_vtom_tlb_flush_required(bool private) > { > - return true; > + /* > + * Since hv_vtom_clear_present() marks the PTEs as "not present" > + * and flushes the TLB, they can't be in the TLB. That makes the > + * flush controlled by this function redundant, so return "false". > + */ > + return false; > } > > static bool hv_vtom_cache_flush_required(void) > @@ -615,6 +657,7 @@ void __init hv_vtom_init(void) > x86_platform.hyper.is_private_mmio = hv_is_private_mmio; > x86_platform.guest.enc_cache_flush_required = hv_vtom_cache_flush_required; > x86_platform.guest.enc_tlb_flush_required = hv_vtom_tlb_flush_required; > + x86_platform.guest.enc_status_change_prepare = hv_vtom_clear_present; > x86_platform.guest.enc_status_change_finish = hv_vtom_set_host_visibility; > > /* Set WB as the default cache mode. */
From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Sent: Monday, January 8, 2024 10:37 AM > > On 1/5/2024 10:30 AM, mhkelley58@gmail.com wrote: > > From: Michael Kelley <mhklinux@outlook.com> > > > > In a CoCo VM, when transitioning memory from encrypted to decrypted, or > > vice versa, the caller of set_memory_encrypted() or set_memory_decrypted() > > is responsible for ensuring the memory isn't in use and isn't referenced > > while the transition is in progress. The transition has multiple steps, > > and the memory is in an inconsistent state until all steps are complete. > > A reference while the state is inconsistent could result in an exception > > that can't be cleanly fixed up. > > > > However, the kernel load_unaligned_zeropad() mechanism could cause a stray > > reference that can't be prevented by the caller of set_memory_encrypted() > > or set_memory_decrypted(), so there's specific code to handle this case. > > But a CoCo VM running on Hyper-V may be configured to run with a paravisor, > > with the #VC or #VE exception routed to the paravisor. There's no > > architectural way to forward the exceptions back to the guest kernel, and > > in such a case, the load_unaligned_zeropad() specific code doesn't work. > > > > To avoid this problem, mark pages as "not present" while a transition > > is in progress. If load_unaligned_zeropad() causes a stray reference, a > > normal page fault is generated instead of #VC or #VE, and the > > page-fault-based fixup handlers for load_unaligned_zeropad() resolve the > > reference. When the encrypted/decrypted transition is complete, mark the > > pages as "present" again. > > Change looks good to me. But I am wondering why are adding it part of > prepare and finish callbacks instead of directly in set_memory_encrypted() function. > The prepare/finish callbacks are different for TDX, SEV-SNP, and Hyper-V CoCo guests running with a paravisor -- so there are three sets of callbacks. As described in the cover letter, I've given up on using this scheme for the TDX and SEV-SNP cases, because of the difficulty with the SEV-SNP callbacks needing a valid virtual address (whereas TDX and Hyper-V paravisor need only a physical address). So it seems like the callbacks specific to the Hyper-V paravisor are the natural place for the code. That leaves the TDX and SEV-SNP code paths unchanged, which was my intent. Or maybe I'm not understanding your comment? If that's the case, please elaborate. Michael > Reviewed-by: Kuppuswamy Sathyanarayanan > <sathyanarayanan.kuppuswamy@linux.intel.com> >
On 1/8/2024 11:13 AM, Michael Kelley wrote: > From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> > Sent: Monday, January 8, 2024 10:37 AM >> >> On 1/5/2024 10:30 AM, mhkelley58@gmail.com wrote: >>> From: Michael Kelley <mhklinux@outlook.com> >>> >>> In a CoCo VM, when transitioning memory from encrypted to decrypted, or >>> vice versa, the caller of set_memory_encrypted() or set_memory_decrypted() >>> is responsible for ensuring the memory isn't in use and isn't referenced >>> while the transition is in progress. The transition has multiple steps, >>> and the memory is in an inconsistent state until all steps are complete. >>> A reference while the state is inconsistent could result in an exception >>> that can't be cleanly fixed up. >>> >>> However, the kernel load_unaligned_zeropad() mechanism could cause a stray >>> reference that can't be prevented by the caller of set_memory_encrypted() >>> or set_memory_decrypted(), so there's specific code to handle this case. >>> But a CoCo VM running on Hyper-V may be configured to run with a paravisor, >>> with the #VC or #VE exception routed to the paravisor. There's no >>> architectural way to forward the exceptions back to the guest kernel, and >>> in such a case, the load_unaligned_zeropad() specific code doesn't work. >>> >>> To avoid this problem, mark pages as "not present" while a transition >>> is in progress. If load_unaligned_zeropad() causes a stray reference, a >>> normal page fault is generated instead of #VC or #VE, and the >>> page-fault-based fixup handlers for load_unaligned_zeropad() resolve the >>> reference. When the encrypted/decrypted transition is complete, mark the >>> pages as "present" again. >> >> Change looks good to me. But I am wondering why are adding it part of >> prepare and finish callbacks instead of directly in set_memory_encrypted() function. >> > > The prepare/finish callbacks are different for TDX, SEV-SNP, and > Hyper-V CoCo guests running with a paravisor -- so there are three sets > of callbacks. As described in the cover letter, I've given up on using this > scheme for the TDX and SEV-SNP cases, because of the difficulty with > the SEV-SNP callbacks needing a valid virtual address (whereas TDX and > Hyper-V paravisor need only a physical address). So it seems like the > callbacks specific to the Hyper-V paravisor are the natural place for the > code. That leaves the TDX and SEV-SNP code paths unchanged, which > was my intent. > Got it. Thanks for clarifying it. > Or maybe I'm not understanding your comment? If that's the case, > please elaborate. > > Michael > >> Reviewed-by: Kuppuswamy Sathyanarayanan >> <sathyanarayanan.kuppuswamy@linux.intel.com> >>
On Fri, 2024-01-05 at 10:30 -0800, mhkelley58@gmail.com wrote: > * hv_vtom_set_host_visibility - Set specified memory visible to > host. > * > @@ -521,7 +547,7 @@ static bool hv_vtom_set_host_visibility(unsigned > long kbuffer, int pagecount, bo > > pfn_array = kmalloc(HV_HYP_PAGE_SIZE, GFP_KERNEL); > if (!pfn_array) > - return false; > + goto err_set_memory_p; > If kmalloc() fails here, and set_memory_p() succeeds below, then hv_vtom_set_host_visibility() returns true, but skips all the hv_mark_gpa_visibility() work. Shouldn't it return false?
From: Edgecombe, Rick P <rick.p.edgecombe@intel.com> Sent: Thursday, January 11, 2024 4:27 PM > > On Fri, 2024-01-05 at 10:30 -0800, mhkelley58@gmail.com wrote: > > * hv_vtom_set_host_visibility - Set specified memory visible to host. > > * > > @@ -521,7 +547,7 @@ static bool hv_vtom_set_host_visibility(unsigned long kbuffer, int pagecount, bo > > > > pfn_array = kmalloc(HV_HYP_PAGE_SIZE, GFP_KERNEL); > > if (!pfn_array) > > - return false; > > + goto err_set_memory_p; > > > > If kmalloc() fails here, and set_memory_p() succeeds below, then > hv_vtom_set_host_visibility() returns true, but skips all the > hv_mark_gpa_visibility() work. Shouldn't it return false? Ooops. Yes. If the kmalloc() fails, need to set "result" to false before doing the goto. Will fix in the next version. Michael
diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c index 8ba18635e338..5ad39256a5d2 100644 --- a/arch/x86/hyperv/ivm.c +++ b/arch/x86/hyperv/ivm.c @@ -15,6 +15,7 @@ #include <asm/io.h> #include <asm/coco.h> #include <asm/mem_encrypt.h> +#include <asm/set_memory.h> #include <asm/mshyperv.h> #include <asm/hypervisor.h> #include <asm/mtrr.h> @@ -502,6 +503,31 @@ static int hv_mark_gpa_visibility(u16 count, const u64 pfn[], return -EFAULT; } +/* + * When transitioning memory between encrypted and decrypted, the caller + * of set_memory_encrypted() or set_memory_decrypted() is responsible for + * ensuring that the memory isn't in use and isn't referenced while the + * transition is in progress. The transition has multiple steps, and the + * memory is in an inconsistent state until all steps are complete. A + * reference while the state is inconsistent could result in an exception + * that can't be cleanly fixed up. + * + * But the Linux kernel load_unaligned_zeropad() mechanism could cause a + * stray reference that can't be prevented by the caller, so Linux has + * specific code to handle this case. But when the #VC and #VE exceptions + * routed to a paravisor, the specific code doesn't work. To avoid this + * problem, mark the pages as "not present" while the transition is in + * progress. If load_unaligned_zeropad() causes a stray reference, a normal + * page fault is generated instead of #VC or #VE, and the page-fault-based + * handlers for load_unaligned_zeropad() resolve the reference. When the + * transition is complete, hv_vtom_set_host_visibility() marks the pages + * as "present" again. + */ +static bool hv_vtom_clear_present(unsigned long kbuffer, int pagecount, bool enc) +{ + return !set_memory_np(kbuffer, pagecount); +} + /* * hv_vtom_set_host_visibility - Set specified memory visible to host. * @@ -521,7 +547,7 @@ static bool hv_vtom_set_host_visibility(unsigned long kbuffer, int pagecount, bo pfn_array = kmalloc(HV_HYP_PAGE_SIZE, GFP_KERNEL); if (!pfn_array) - return false; + goto err_set_memory_p; for (i = 0, pfn = 0; i < pagecount; i++) { /* @@ -545,14 +571,30 @@ static bool hv_vtom_set_host_visibility(unsigned long kbuffer, int pagecount, bo } } - err_free_pfn_array: +err_free_pfn_array: kfree(pfn_array); + +err_set_memory_p: + /* + * Set the PTE PRESENT bits again to revert what hv_vtom_clear_present() + * did. Do this even if there is an error earlier in this function in + * order to avoid leaving the memory range in a "broken" state. Setting + * the PRESENT bits shouldn't fail, but return an error if it does. + */ + if (set_memory_p(kbuffer, pagecount)) + result = false; + return result; } static bool hv_vtom_tlb_flush_required(bool private) { - return true; + /* + * Since hv_vtom_clear_present() marks the PTEs as "not present" + * and flushes the TLB, they can't be in the TLB. That makes the + * flush controlled by this function redundant, so return "false". + */ + return false; } static bool hv_vtom_cache_flush_required(void) @@ -615,6 +657,7 @@ void __init hv_vtom_init(void) x86_platform.hyper.is_private_mmio = hv_is_private_mmio; x86_platform.guest.enc_cache_flush_required = hv_vtom_cache_flush_required; x86_platform.guest.enc_tlb_flush_required = hv_vtom_tlb_flush_required; + x86_platform.guest.enc_status_change_prepare = hv_vtom_clear_present; x86_platform.guest.enc_status_change_finish = hv_vtom_set_host_visibility; /* Set WB as the default cache mode. */