diff mbox series

[v2,4/8] x86/sev: Enable PVALIDATE for PFNs without a valid virtual address

Message ID 20231121212016.1154303-5-mhklinux@outlook.com (mailing list archive)
State New
Headers show
Series x86/coco: Mark CoCo VM pages not present when changing encrypted state | expand

Commit Message

Michael Kelley Nov. 21, 2023, 9:20 p.m. UTC
From: Michael Kelley <mhklinux@outlook.com>

For SEV-SNP, the PVALIDATE instruction requires a valid virtual
address that it translates to the PFN that it operates on.  Per
the spec, it translates the virtual address as if it were doing a
single byte read.

In transitioning a page between encrypted and decrypted, the direct
map virtual address of the page may be temporarily marked invalid
(i.e., PRESENT is cleared in the PTE) to prevent interference from
load_unaligned_zeropad(). In such a case, the PVALIDATE that is
required for the encrypted<->decrypted transition fails due to
an invalid virtual address.

Fix this by providing a temporary virtual address that is mapped to the
target PFN just before executing PVALIDATE. Have PVALIDATE use this
temp virtual address instead of the direct map virtual address. Unmap
the temp virtual address after PVALIDATE completes.

The temp virtual address must be aligned on a 2 Mbyte boundary
to meet PVALIDATE requirements for operating on 2 Meg large pages,
though the temp mapping need only be a 4K mapping.  Also, the temp
virtual address must be preceded by a 4K invalid page so it can't
be accessed by load_unaligned_zeropad().

This mechanism is used only for pages transitioning between encrypted
and decrypted.  When PVALIDATE is done for initial page acceptance,
a temp virtual address is not provided, and PVALIDATE uses the
direct map virtual address.

Signed-off-by: Michael Kelley <mhklinux@outlook.com>
---
 arch/x86/boot/compressed/sev.c |  2 +-
 arch/x86/kernel/sev-shared.c   | 57 +++++++++++++++++++++++++++-------
 arch/x86/kernel/sev.c          | 32 ++++++++++++-------
 3 files changed, 67 insertions(+), 24 deletions(-)

Comments

Edgecombe, Rick P Nov. 27, 2023, 9:38 p.m. UTC | #1
On Tue, 2023-11-21 at 13:20 -0800, mhkelley58@gmail.com wrote:
> +static int pvalidate_pfn(unsigned long vaddr, unsigned int size,
> +                        unsigned long pfn, bool validate, int *rc2)
> +{
> +       int rc;
> +       struct page *page = pfn_to_page(pfn);
> +
> +       *rc2 = vmap_pages_range(vaddr, vaddr + PAGE_SIZE,
> +                       PAGE_KERNEL, &page, PAGE_SHIFT);

Can't this fail and then the pvalidate below would encounter trouble?

Sort of separately, if those vmalloc objections can't be worked
through, did you consider doing something like text_poke() does (create
the temporary mapping in a temporary MM) for pvalidate purposes? I
don't know enough about what kind of special exceptions might popup
during that operation though, might be playing with fire...

> +       rc = pvalidate(vaddr, size, validate);
> +       vunmap_range(vaddr, vaddr + PAGE_SIZE);
> +
> +       return rc;
> +}
Michael Kelley Nov. 28, 2023, 6:08 p.m. UTC | #2
From: Edgecombe, Rick P <rick.p.edgecombe@intel.com> Sent: Monday, November 27, 2023 1:39 PM
> 
> On Tue, 2023-11-21 at 13:20 -0800, mhkelley58@gmail.com wrote:
> > +static int pvalidate_pfn(unsigned long vaddr, unsigned int size,
> > +                        unsigned long pfn, bool validate, int *rc2)
> > +{
> > +       int rc;
> > +       struct page *page = pfn_to_page(pfn);
> > +
> > +       *rc2 = vmap_pages_range(vaddr, vaddr + PAGE_SIZE,
> > +                       PAGE_KERNEL, &page, PAGE_SHIFT);
> 
> Can't this fail and then the pvalidate below would encounter trouble?

Yes.  My intent was to just let the PVALIDATE fail because of operating
on a vaddr that's invalid.  But that would be worth a comment.

> 
> Sort of separately, if those vmalloc objections can't be worked
> through, did you consider doing something like text_poke() does (create
> the temporary mapping in a temporary MM) for pvalidate purposes? I
> don't know enough about what kind of special exceptions might popup
> during that operation though, might be playing with fire...

Interesting idea.  But from a quick glance at the text_poke() code,
such an approach seems somewhat complex, and I suspect it will have 
the same perf issues (or worse) as creating a new vmalloc area for
each PVALIDATE invocation.

At this point, the complexity of creating the temp mapping for
PVALIDATE is seeming excessive.  On balance it seems simpler to
revert to an approach where the use of set_memory_np() and
set_memory_p() is conditional.  It would be necessary when #VC
and #VE exceptions are directed to a paravisor.  (This assumes the
paravisor interface in the hypervisor callbacks does the natural thing
of working with physical addresses, so there's no need for a temp
mapping.)

Optionally, the set_memory_np()/set_memory_p() approach could
be used in other cases where the hypervisor callbacks work with
physical addresses.  But it can't be used with cases where the hypervisor
callbacks need valid virtual addresses.

So on net, set_memory_np()/set_memory_p() would be used in
the Hyper-V cases of TDX and SEV-SNP with a paravisor.   It could
optionally be used with TDX with no paravisor, but my sense is
that Kirill wants to keep TDX "as is" and let the exception handlers
do the load_unaligned_zeropad() fixup.

It could not be used with SEV-SNP with no paravisor.   Additional fixes
may be needed on the SEV-SNP side to properly fixup
load_unaligned_zeropad() accesses to a page that's in transition
between encrypted and decrypted.

I'll work on a patch series that takes the conditional approach.

Michael

> 
> > +       rc = pvalidate(vaddr, size, validate);
> > +       vunmap_range(vaddr, vaddr + PAGE_SIZE);
> > +
> > +       return rc;
> > +}
Edgecombe, Rick P Nov. 28, 2023, 6:59 p.m. UTC | #3
On Tue, 2023-11-28 at 18:08 +0000, Michael Kelley wrote:
> > 
> > Sort of separately, if those vmalloc objections can't be worked
> > through, did you consider doing something like text_poke() does
> > (create
> > the temporary mapping in a temporary MM) for pvalidate purposes? I
> > don't know enough about what kind of special exceptions might popup
> > during that operation though, might be playing with fire...
> 
> Interesting idea.  But from a quick glance at the text_poke() code,
> such an approach seems somewhat complex, and I suspect it will have 
> the same perf issues (or worse) as creating a new vmalloc area for
> each PVALIDATE invocation.

Using new vmalloc area's will eventually result in a kernel shootdown,
but usually have no flushes. text_poke will always result in a local-
only flush. So at least whatever slowdown there is would only affect
the calling thread.

As for complexity, I think it might be simple to implement actually.
What kind of special exceptions could come out of pvalidate, I'm not so
sure. But the kernel terminates the VM on failure anyway, so maybe it's
not an issue?
 
diff --git a/arch/x86/kernel/alternative.c
b/arch/x86/kernel/alternative.c
index 73be3931e4f0..a13293564eeb 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1905,6 +1905,16 @@ void *text_poke(void *addr, const void *opcode,
size_t len)
        return __text_poke(text_poke_memcpy, addr, opcode, len);
 }
 
+static void text_poke_pvalidate(void *dst, const void *src, size_t
len)
+{
+       pvalidate(dst, len, true); // if fail, terminate
+}
+
+void *pvalidated_poke(void *addr)
+{
+       return __text_poke(text_poke_pvalidate, addr, NULL, PAGE_SIZE);
+}
+
 /**
  * text_poke_kgdb - Update instructions on a live kernel by kgdb
  * @addr: address to modify



> 
> At this point, the complexity of creating the temp mapping for
> PVALIDATE is seeming excessive.  On balance it seems simpler to
> revert to an approach where the use of set_memory_np() and
> set_memory_p() is conditional.  It would be necessary when #VC
> and #VE exceptions are directed to a paravisor.  (This assumes the
> paravisor interface in the hypervisor callbacks does the natural
> thing
> of working with physical addresses, so there's no need for a temp
> mapping.)
> 
> Optionally, the set_memory_np()/set_memory_p() approach could
> be used in other cases where the hypervisor callbacks work with
> physical addresses.  But it can't be used with cases where the
> hypervisor
> callbacks need valid virtual addresses.
> 
> So on net, set_memory_np()/set_memory_p() would be used in
> the Hyper-V cases of TDX and SEV-SNP with a paravisor.   It could
> optionally be used with TDX with no paravisor, but my sense is
> that Kirill wants to keep TDX "as is" and let the exception handlers
> do the load_unaligned_zeropad() fixup.
> 
> It could not be used with SEV-SNP with no paravisor.   Additional
> fixes
> may be needed on the SEV-SNP side to properly fixup
> load_unaligned_zeropad() accesses to a page that's in transition
> between encrypted and decrypted.
> 

Yea, I don't know about this paravisor/exception stuff.
Michael Kelley Dec. 12, 2023, 6:35 p.m. UTC | #4
From: Edgecombe, Rick P <rick.p.edgecombe@intel.com> Sent: Tuesday, November 28, 2023 10:59 AM
> 
> On Tue, 2023-11-28 at 18:08 +0000, Michael Kelley wrote:
> > >
> > > Sort of separately, if those vmalloc objections can't be worked
> > > through, did you consider doing something like text_poke() does
> > > (create
> > > the temporary mapping in a temporary MM) for pvalidate purposes? I
> > > don't know enough about what kind of special exceptions might popup
> > > during that operation though, might be playing with fire...
> >
> > Interesting idea.  But from a quick glance at the text_poke() code,
> > such an approach seems somewhat complex, and I suspect it will have
> > the same perf issues (or worse) as creating a new vmalloc area for
> > each PVALIDATE invocation.
> 
> Using new vmalloc area's will eventually result in a kernel shootdown,
> but usually have no flushes. text_poke will always result in a local-
> only flush. So at least whatever slowdown there is would only affect
> the calling thread.
> 
> As for complexity, I think it might be simple to implement actually.
> What kind of special exceptions could come out of pvalidate, I'm not so
> sure. But the kernel terminates the VM on failure anyway, so maybe it's
> not an issue?

Sorry for the delay in getting back to this topic.

OK, I see now what you are suggesting. For each page that needs to
be PVALIDATE'd, use __text_poke() to create the temp mapping and
run PVALIDATE.   However, there are some problems.  __text_poke()
runs vmalloc_to_page() for addresses that aren't core kernel text,
and vmalloc_to_page() will fail if the PTE "present" bit has been
cleared.  That could be easily addressed by changing it to use
slow_virt_to_phys().  But PVALIDATE also needs to be able to return
the PVALIDATE_FAIL_SIZEMISMATCH error code, which is tested for
in pvalidate_pages() and does not terminate the VM.  __text_poke()
doesn’t have the machinery to return such an error code, and
that's harder to fix.

There's also the conceptual issue.  The PVALIDATE use case isn't
working with a text area, so that case would be abusing "text_poke"
a bit.  I could imagine __text_poke() having code to verify that it's
working on a text area, even if that code isn't there now.

To get a sense of performance, I hacked the equivalent of text_poke()
to work with PVALIDATE.  The biggest case of transitioning pages from
encrypted to decrypted is the swiotlb area, which is 1 Gbyte for a
VM with 16 Gbytes or more of memory.  In a Hyper-V CoCo VM, current
code takes about 270 milliseconds to transition that 1 Gbyte swiotlb area.
With my initial approach using vmap_pages_range(), that 270 ms went to
319 ms, which is fairly negligible in the overall VM boot time.  Using the
text_poke() approach increased the time to 368 ms, which is bigger but
still probably not a show-stopper.  It's definitely faster than creating a
new vmalloc area for each page that needs to be PVALIDATE'd, which
adds about 6 seconds to the boot time.

All-in-all,  I'm back to my Plan B, which is to mark the pages "not
present" only in configurations where the hypervisor callbacks operate
on physical addresses instead of virtual addresses.   Since SEV-SNP
needs virtual addresses, it will need to handle exceptions generated
by load_unaligned_zeropad() and do the appropriate fixup.

I'll try to get my "Plan B" patch set posted in a new few days.

Michael

> 
> diff --git a/arch/x86/kernel/alternative.c
> b/arch/x86/kernel/alternative.c
> index 73be3931e4f0..a13293564eeb 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -1905,6 +1905,16 @@ void *text_poke(void *addr, const void *opcode,
> size_t len)
>         return __text_poke(text_poke_memcpy, addr, opcode, len);
>  }
> 
> +static void text_poke_pvalidate(void *dst, const void *src, size_t
> len)
> +{
> +       pvalidate(dst, len, true); // if fail, terminate
> +}
> +
> +void *pvalidated_poke(void *addr)
> +{
> +       return __text_poke(text_poke_pvalidate, addr, NULL, PAGE_SIZE);
> +}
> +
>  /**
>   * text_poke_kgdb - Update instructions on a live kernel by kgdb
>   * @addr: address to modify
> 
> 
> 
> >
> > At this point, the complexity of creating the temp mapping for
> > PVALIDATE is seeming excessive.  On balance it seems simpler to
> > revert to an approach where the use of set_memory_np() and
> > set_memory_p() is conditional.  It would be necessary when #VC
> > and #VE exceptions are directed to a paravisor.  (This assumes the
> > paravisor interface in the hypervisor callbacks does the natural thing
> > of working with physical addresses, so there's no need for a temp
> > mapping.)
> >
> > Optionally, the set_memory_np()/set_memory_p() approach could
> > be used in other cases where the hypervisor callbacks work with
> > physical addresses.  But it can't be used with cases where the
> > hypervisor callbacks need valid virtual addresses.
> >
> > So on net, set_memory_np()/set_memory_p() would be used in
> > the Hyper-V cases of TDX and SEV-SNP with a paravisor.   It could
> > optionally be used with TDX with no paravisor, but my sense is
> > that Kirill wants to keep TDX "as is" and let the exception handlers
> > do the load_unaligned_zeropad() fixup.
> >
> > It could not be used with SEV-SNP with no paravisor.   Additional fixes
> > may be needed on the SEV-SNP side to properly fixup
> > load_unaligned_zeropad() accesses to a page that's in transition
> > between encrypted and decrypted.
> >
> 
> Yea, I don't know about this paravisor/exception stuff.
diff mbox series

Patch

diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
index 454acd7a2daf..4d4a3fc0b725 100644
--- a/arch/x86/boot/compressed/sev.c
+++ b/arch/x86/boot/compressed/sev.c
@@ -224,7 +224,7 @@  static phys_addr_t __snp_accept_memory(struct snp_psc_desc *desc,
 	if (vmgexit_psc(boot_ghcb, desc))
 		sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PSC);
 
-	pvalidate_pages(desc);
+	pvalidate_pages(desc, 0);
 
 	return pa;
 }
diff --git a/arch/x86/kernel/sev-shared.c b/arch/x86/kernel/sev-shared.c
index ccb0915e84e1..fc45fdcf3892 100644
--- a/arch/x86/kernel/sev-shared.c
+++ b/arch/x86/kernel/sev-shared.c
@@ -1071,35 +1071,70 @@  static void __init setup_cpuid_table(const struct cc_blob_sev_info *cc_info)
 	}
 }
 
-static void pvalidate_pages(struct snp_psc_desc *desc)
+#ifdef __BOOT_COMPRESSED
+static int pvalidate_pfn(unsigned long vaddr, unsigned int size,
+			 unsigned long pfn, bool validate, int *rc2)
+{
+	return 0;
+}
+#else
+static int pvalidate_pfn(unsigned long vaddr, unsigned int size,
+			 unsigned long pfn, bool validate, int *rc2)
+{
+	int rc;
+	struct page *page = pfn_to_page(pfn);
+
+	*rc2 = vmap_pages_range(vaddr, vaddr + PAGE_SIZE,
+			PAGE_KERNEL, &page, PAGE_SHIFT);
+	rc = pvalidate(vaddr, size, validate);
+	vunmap_range(vaddr, vaddr + PAGE_SIZE);
+
+	return rc;
+}
+#endif
+
+static void pvalidate_pages(struct snp_psc_desc *desc, unsigned long vaddr)
 {
 	struct psc_entry *e;
-	unsigned long vaddr;
+	unsigned long pfn;
 	unsigned int size;
 	unsigned int i;
 	bool validate;
-	int rc;
+	int rc, rc2 = 0;
 
 	for (i = 0; i <= desc->hdr.end_entry; i++) {
 		e = &desc->entries[i];
 
-		vaddr = (unsigned long)pfn_to_kaddr(e->gfn);
-		size = e->pagesize ? RMP_PG_SIZE_2M : RMP_PG_SIZE_4K;
+		size = e->pagesize;
 		validate = e->operation == SNP_PAGE_STATE_PRIVATE;
+		pfn = e->gfn;
 
-		rc = pvalidate(vaddr, size, validate);
-		if (rc == PVALIDATE_FAIL_SIZEMISMATCH && size == RMP_PG_SIZE_2M) {
-			unsigned long vaddr_end = vaddr + PMD_SIZE;
+		if (vaddr) {
+			rc = pvalidate_pfn(vaddr, size, pfn, validate, &rc2);
+		} else {
+			vaddr = (unsigned long)pfn_to_kaddr(pfn);
+			rc = pvalidate(vaddr, size, validate);
+		}
 
-			for (; vaddr < vaddr_end; vaddr += PAGE_SIZE) {
-				rc = pvalidate(vaddr, RMP_PG_SIZE_4K, validate);
+		if (rc == PVALIDATE_FAIL_SIZEMISMATCH && size == RMP_PG_SIZE_2M) {
+			unsigned long last_pfn = pfn + PTRS_PER_PMD - 1;
+
+			for (; pfn <= last_pfn; pfn++) {
+				if (vaddr) {
+					rc = pvalidate_pfn(vaddr, RMP_PG_SIZE_4K,
+							   pfn, validate, &rc2);
+				} else {
+					vaddr = (unsigned long)pfn_to_kaddr(pfn);
+					rc = pvalidate(vaddr, RMP_PG_SIZE_4K, validate);
+				}
 				if (rc)
 					break;
 			}
 		}
 
 		if (rc) {
-			WARN(1, "Failed to validate address 0x%lx ret %d", vaddr, rc);
+			WARN(1, "Failed to validate address 0x%lx ret %d ret2 %d",
+				vaddr, rc, rc2);
 			sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PVALIDATE);
 		}
 	}
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 7eac92c07a58..08b2e2a0d67d 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -790,7 +790,7 @@  void __init snp_prep_memory(unsigned long paddr, unsigned int sz, enum psc_op op
 }
 
 static unsigned long __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
-				       unsigned long vaddr_end, int op)
+				       unsigned long vaddr_end, int op, unsigned long temp_vaddr)
 {
 	struct ghcb_state state;
 	bool use_large_entry;
@@ -842,7 +842,7 @@  static unsigned long __set_pages_state(struct snp_psc_desc *data, unsigned long
 
 	/* Page validation must be rescinded before changing to shared */
 	if (op == SNP_PAGE_STATE_SHARED)
-		pvalidate_pages(data);
+		pvalidate_pages(data, temp_vaddr);
 
 	local_irq_save(flags);
 
@@ -862,12 +862,13 @@  static unsigned long __set_pages_state(struct snp_psc_desc *data, unsigned long
 
 	/* Page validation must be performed after changing to private */
 	if (op == SNP_PAGE_STATE_PRIVATE)
-		pvalidate_pages(data);
+		pvalidate_pages(data, temp_vaddr);
 
 	return vaddr;
 }
 
-static void set_pages_state(unsigned long vaddr, unsigned long npages, int op)
+static void set_pages_state(unsigned long vaddr, unsigned long npages,
+				int op, unsigned long temp_vaddr)
 {
 	struct snp_psc_desc desc;
 	unsigned long vaddr_end;
@@ -880,23 +881,30 @@  static void set_pages_state(unsigned long vaddr, unsigned long npages, int op)
 	vaddr_end = vaddr + (npages << PAGE_SHIFT);
 
 	while (vaddr < vaddr_end)
-		vaddr = __set_pages_state(&desc, vaddr, vaddr_end, op);
+		vaddr = __set_pages_state(&desc, vaddr, vaddr_end,
+						op, temp_vaddr);
 }
 
 void snp_set_memory_shared(unsigned long vaddr, unsigned long npages)
 {
-	if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
-		return;
+	struct vm_struct *area;
+	unsigned long temp_vaddr;
 
-	set_pages_state(vaddr, npages, SNP_PAGE_STATE_SHARED);
+	area = get_vm_area(PAGE_SIZE * (PTRS_PER_PMD + 1), 0);
+	temp_vaddr = ALIGN((unsigned long)(area->addr + PAGE_SIZE), PMD_SIZE);
+	set_pages_state(vaddr, npages, SNP_PAGE_STATE_SHARED, temp_vaddr);
+	free_vm_area(area);
 }
 
 void snp_set_memory_private(unsigned long vaddr, unsigned long npages)
 {
-	if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
-		return;
+	struct vm_struct *area;
+	unsigned long temp_vaddr;
 
-	set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE);
+	area = get_vm_area(PAGE_SIZE * (PTRS_PER_PMD + 1), 0);
+	temp_vaddr = ALIGN((unsigned long)(area->addr + PAGE_SIZE), PMD_SIZE);
+	set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE, temp_vaddr);
+	free_vm_area(area);
 }
 
 void snp_accept_memory(phys_addr_t start, phys_addr_t end)
@@ -909,7 +917,7 @@  void snp_accept_memory(phys_addr_t start, phys_addr_t end)
 	vaddr = (unsigned long)__va(start);
 	npages = (end - start) >> PAGE_SHIFT;
 
-	set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE);
+	set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE, 0);
 }
 
 static int snp_set_vmsa(void *va, bool vmsa)