Message ID | 20231121212016.1154303-5-mhklinux@outlook.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | x86/coco: Mark CoCo VM pages not present when changing encrypted state | expand |
On Tue, 2023-11-21 at 13:20 -0800, mhkelley58@gmail.com wrote: > +static int pvalidate_pfn(unsigned long vaddr, unsigned int size, > + unsigned long pfn, bool validate, int *rc2) > +{ > + int rc; > + struct page *page = pfn_to_page(pfn); > + > + *rc2 = vmap_pages_range(vaddr, vaddr + PAGE_SIZE, > + PAGE_KERNEL, &page, PAGE_SHIFT); Can't this fail and then the pvalidate below would encounter trouble? Sort of separately, if those vmalloc objections can't be worked through, did you consider doing something like text_poke() does (create the temporary mapping in a temporary MM) for pvalidate purposes? I don't know enough about what kind of special exceptions might popup during that operation though, might be playing with fire... > + rc = pvalidate(vaddr, size, validate); > + vunmap_range(vaddr, vaddr + PAGE_SIZE); > + > + return rc; > +}
From: Edgecombe, Rick P <rick.p.edgecombe@intel.com> Sent: Monday, November 27, 2023 1:39 PM > > On Tue, 2023-11-21 at 13:20 -0800, mhkelley58@gmail.com wrote: > > +static int pvalidate_pfn(unsigned long vaddr, unsigned int size, > > + unsigned long pfn, bool validate, int *rc2) > > +{ > > + int rc; > > + struct page *page = pfn_to_page(pfn); > > + > > + *rc2 = vmap_pages_range(vaddr, vaddr + PAGE_SIZE, > > + PAGE_KERNEL, &page, PAGE_SHIFT); > > Can't this fail and then the pvalidate below would encounter trouble? Yes. My intent was to just let the PVALIDATE fail because of operating on a vaddr that's invalid. But that would be worth a comment. > > Sort of separately, if those vmalloc objections can't be worked > through, did you consider doing something like text_poke() does (create > the temporary mapping in a temporary MM) for pvalidate purposes? I > don't know enough about what kind of special exceptions might popup > during that operation though, might be playing with fire... Interesting idea. But from a quick glance at the text_poke() code, such an approach seems somewhat complex, and I suspect it will have the same perf issues (or worse) as creating a new vmalloc area for each PVALIDATE invocation. At this point, the complexity of creating the temp mapping for PVALIDATE is seeming excessive. On balance it seems simpler to revert to an approach where the use of set_memory_np() and set_memory_p() is conditional. It would be necessary when #VC and #VE exceptions are directed to a paravisor. (This assumes the paravisor interface in the hypervisor callbacks does the natural thing of working with physical addresses, so there's no need for a temp mapping.) Optionally, the set_memory_np()/set_memory_p() approach could be used in other cases where the hypervisor callbacks work with physical addresses. But it can't be used with cases where the hypervisor callbacks need valid virtual addresses. So on net, set_memory_np()/set_memory_p() would be used in the Hyper-V cases of TDX and SEV-SNP with a paravisor. It could optionally be used with TDX with no paravisor, but my sense is that Kirill wants to keep TDX "as is" and let the exception handlers do the load_unaligned_zeropad() fixup. It could not be used with SEV-SNP with no paravisor. Additional fixes may be needed on the SEV-SNP side to properly fixup load_unaligned_zeropad() accesses to a page that's in transition between encrypted and decrypted. I'll work on a patch series that takes the conditional approach. Michael > > > + rc = pvalidate(vaddr, size, validate); > > + vunmap_range(vaddr, vaddr + PAGE_SIZE); > > + > > + return rc; > > +}
On Tue, 2023-11-28 at 18:08 +0000, Michael Kelley wrote: > > > > Sort of separately, if those vmalloc objections can't be worked > > through, did you consider doing something like text_poke() does > > (create > > the temporary mapping in a temporary MM) for pvalidate purposes? I > > don't know enough about what kind of special exceptions might popup > > during that operation though, might be playing with fire... > > Interesting idea. But from a quick glance at the text_poke() code, > such an approach seems somewhat complex, and I suspect it will have > the same perf issues (or worse) as creating a new vmalloc area for > each PVALIDATE invocation. Using new vmalloc area's will eventually result in a kernel shootdown, but usually have no flushes. text_poke will always result in a local- only flush. So at least whatever slowdown there is would only affect the calling thread. As for complexity, I think it might be simple to implement actually. What kind of special exceptions could come out of pvalidate, I'm not so sure. But the kernel terminates the VM on failure anyway, so maybe it's not an issue? diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index 73be3931e4f0..a13293564eeb 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -1905,6 +1905,16 @@ void *text_poke(void *addr, const void *opcode, size_t len) return __text_poke(text_poke_memcpy, addr, opcode, len); } +static void text_poke_pvalidate(void *dst, const void *src, size_t len) +{ + pvalidate(dst, len, true); // if fail, terminate +} + +void *pvalidated_poke(void *addr) +{ + return __text_poke(text_poke_pvalidate, addr, NULL, PAGE_SIZE); +} + /** * text_poke_kgdb - Update instructions on a live kernel by kgdb * @addr: address to modify > > At this point, the complexity of creating the temp mapping for > PVALIDATE is seeming excessive. On balance it seems simpler to > revert to an approach where the use of set_memory_np() and > set_memory_p() is conditional. It would be necessary when #VC > and #VE exceptions are directed to a paravisor. (This assumes the > paravisor interface in the hypervisor callbacks does the natural > thing > of working with physical addresses, so there's no need for a temp > mapping.) > > Optionally, the set_memory_np()/set_memory_p() approach could > be used in other cases where the hypervisor callbacks work with > physical addresses. But it can't be used with cases where the > hypervisor > callbacks need valid virtual addresses. > > So on net, set_memory_np()/set_memory_p() would be used in > the Hyper-V cases of TDX and SEV-SNP with a paravisor. It could > optionally be used with TDX with no paravisor, but my sense is > that Kirill wants to keep TDX "as is" and let the exception handlers > do the load_unaligned_zeropad() fixup. > > It could not be used with SEV-SNP with no paravisor. Additional > fixes > may be needed on the SEV-SNP side to properly fixup > load_unaligned_zeropad() accesses to a page that's in transition > between encrypted and decrypted. > Yea, I don't know about this paravisor/exception stuff.
From: Edgecombe, Rick P <rick.p.edgecombe@intel.com> Sent: Tuesday, November 28, 2023 10:59 AM > > On Tue, 2023-11-28 at 18:08 +0000, Michael Kelley wrote: > > > > > > Sort of separately, if those vmalloc objections can't be worked > > > through, did you consider doing something like text_poke() does > > > (create > > > the temporary mapping in a temporary MM) for pvalidate purposes? I > > > don't know enough about what kind of special exceptions might popup > > > during that operation though, might be playing with fire... > > > > Interesting idea. But from a quick glance at the text_poke() code, > > such an approach seems somewhat complex, and I suspect it will have > > the same perf issues (or worse) as creating a new vmalloc area for > > each PVALIDATE invocation. > > Using new vmalloc area's will eventually result in a kernel shootdown, > but usually have no flushes. text_poke will always result in a local- > only flush. So at least whatever slowdown there is would only affect > the calling thread. > > As for complexity, I think it might be simple to implement actually. > What kind of special exceptions could come out of pvalidate, I'm not so > sure. But the kernel terminates the VM on failure anyway, so maybe it's > not an issue? Sorry for the delay in getting back to this topic. OK, I see now what you are suggesting. For each page that needs to be PVALIDATE'd, use __text_poke() to create the temp mapping and run PVALIDATE. However, there are some problems. __text_poke() runs vmalloc_to_page() for addresses that aren't core kernel text, and vmalloc_to_page() will fail if the PTE "present" bit has been cleared. That could be easily addressed by changing it to use slow_virt_to_phys(). But PVALIDATE also needs to be able to return the PVALIDATE_FAIL_SIZEMISMATCH error code, which is tested for in pvalidate_pages() and does not terminate the VM. __text_poke() doesn’t have the machinery to return such an error code, and that's harder to fix. There's also the conceptual issue. The PVALIDATE use case isn't working with a text area, so that case would be abusing "text_poke" a bit. I could imagine __text_poke() having code to verify that it's working on a text area, even if that code isn't there now. To get a sense of performance, I hacked the equivalent of text_poke() to work with PVALIDATE. The biggest case of transitioning pages from encrypted to decrypted is the swiotlb area, which is 1 Gbyte for a VM with 16 Gbytes or more of memory. In a Hyper-V CoCo VM, current code takes about 270 milliseconds to transition that 1 Gbyte swiotlb area. With my initial approach using vmap_pages_range(), that 270 ms went to 319 ms, which is fairly negligible in the overall VM boot time. Using the text_poke() approach increased the time to 368 ms, which is bigger but still probably not a show-stopper. It's definitely faster than creating a new vmalloc area for each page that needs to be PVALIDATE'd, which adds about 6 seconds to the boot time. All-in-all, I'm back to my Plan B, which is to mark the pages "not present" only in configurations where the hypervisor callbacks operate on physical addresses instead of virtual addresses. Since SEV-SNP needs virtual addresses, it will need to handle exceptions generated by load_unaligned_zeropad() and do the appropriate fixup. I'll try to get my "Plan B" patch set posted in a new few days. Michael > > diff --git a/arch/x86/kernel/alternative.c > b/arch/x86/kernel/alternative.c > index 73be3931e4f0..a13293564eeb 100644 > --- a/arch/x86/kernel/alternative.c > +++ b/arch/x86/kernel/alternative.c > @@ -1905,6 +1905,16 @@ void *text_poke(void *addr, const void *opcode, > size_t len) > return __text_poke(text_poke_memcpy, addr, opcode, len); > } > > +static void text_poke_pvalidate(void *dst, const void *src, size_t > len) > +{ > + pvalidate(dst, len, true); // if fail, terminate > +} > + > +void *pvalidated_poke(void *addr) > +{ > + return __text_poke(text_poke_pvalidate, addr, NULL, PAGE_SIZE); > +} > + > /** > * text_poke_kgdb - Update instructions on a live kernel by kgdb > * @addr: address to modify > > > > > > > At this point, the complexity of creating the temp mapping for > > PVALIDATE is seeming excessive. On balance it seems simpler to > > revert to an approach where the use of set_memory_np() and > > set_memory_p() is conditional. It would be necessary when #VC > > and #VE exceptions are directed to a paravisor. (This assumes the > > paravisor interface in the hypervisor callbacks does the natural thing > > of working with physical addresses, so there's no need for a temp > > mapping.) > > > > Optionally, the set_memory_np()/set_memory_p() approach could > > be used in other cases where the hypervisor callbacks work with > > physical addresses. But it can't be used with cases where the > > hypervisor callbacks need valid virtual addresses. > > > > So on net, set_memory_np()/set_memory_p() would be used in > > the Hyper-V cases of TDX and SEV-SNP with a paravisor. It could > > optionally be used with TDX with no paravisor, but my sense is > > that Kirill wants to keep TDX "as is" and let the exception handlers > > do the load_unaligned_zeropad() fixup. > > > > It could not be used with SEV-SNP with no paravisor. Additional fixes > > may be needed on the SEV-SNP side to properly fixup > > load_unaligned_zeropad() accesses to a page that's in transition > > between encrypted and decrypted. > > > > Yea, I don't know about this paravisor/exception stuff.
diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c index 454acd7a2daf..4d4a3fc0b725 100644 --- a/arch/x86/boot/compressed/sev.c +++ b/arch/x86/boot/compressed/sev.c @@ -224,7 +224,7 @@ static phys_addr_t __snp_accept_memory(struct snp_psc_desc *desc, if (vmgexit_psc(boot_ghcb, desc)) sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PSC); - pvalidate_pages(desc); + pvalidate_pages(desc, 0); return pa; } diff --git a/arch/x86/kernel/sev-shared.c b/arch/x86/kernel/sev-shared.c index ccb0915e84e1..fc45fdcf3892 100644 --- a/arch/x86/kernel/sev-shared.c +++ b/arch/x86/kernel/sev-shared.c @@ -1071,35 +1071,70 @@ static void __init setup_cpuid_table(const struct cc_blob_sev_info *cc_info) } } -static void pvalidate_pages(struct snp_psc_desc *desc) +#ifdef __BOOT_COMPRESSED +static int pvalidate_pfn(unsigned long vaddr, unsigned int size, + unsigned long pfn, bool validate, int *rc2) +{ + return 0; +} +#else +static int pvalidate_pfn(unsigned long vaddr, unsigned int size, + unsigned long pfn, bool validate, int *rc2) +{ + int rc; + struct page *page = pfn_to_page(pfn); + + *rc2 = vmap_pages_range(vaddr, vaddr + PAGE_SIZE, + PAGE_KERNEL, &page, PAGE_SHIFT); + rc = pvalidate(vaddr, size, validate); + vunmap_range(vaddr, vaddr + PAGE_SIZE); + + return rc; +} +#endif + +static void pvalidate_pages(struct snp_psc_desc *desc, unsigned long vaddr) { struct psc_entry *e; - unsigned long vaddr; + unsigned long pfn; unsigned int size; unsigned int i; bool validate; - int rc; + int rc, rc2 = 0; for (i = 0; i <= desc->hdr.end_entry; i++) { e = &desc->entries[i]; - vaddr = (unsigned long)pfn_to_kaddr(e->gfn); - size = e->pagesize ? RMP_PG_SIZE_2M : RMP_PG_SIZE_4K; + size = e->pagesize; validate = e->operation == SNP_PAGE_STATE_PRIVATE; + pfn = e->gfn; - rc = pvalidate(vaddr, size, validate); - if (rc == PVALIDATE_FAIL_SIZEMISMATCH && size == RMP_PG_SIZE_2M) { - unsigned long vaddr_end = vaddr + PMD_SIZE; + if (vaddr) { + rc = pvalidate_pfn(vaddr, size, pfn, validate, &rc2); + } else { + vaddr = (unsigned long)pfn_to_kaddr(pfn); + rc = pvalidate(vaddr, size, validate); + } - for (; vaddr < vaddr_end; vaddr += PAGE_SIZE) { - rc = pvalidate(vaddr, RMP_PG_SIZE_4K, validate); + if (rc == PVALIDATE_FAIL_SIZEMISMATCH && size == RMP_PG_SIZE_2M) { + unsigned long last_pfn = pfn + PTRS_PER_PMD - 1; + + for (; pfn <= last_pfn; pfn++) { + if (vaddr) { + rc = pvalidate_pfn(vaddr, RMP_PG_SIZE_4K, + pfn, validate, &rc2); + } else { + vaddr = (unsigned long)pfn_to_kaddr(pfn); + rc = pvalidate(vaddr, RMP_PG_SIZE_4K, validate); + } if (rc) break; } } if (rc) { - WARN(1, "Failed to validate address 0x%lx ret %d", vaddr, rc); + WARN(1, "Failed to validate address 0x%lx ret %d ret2 %d", + vaddr, rc, rc2); sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PVALIDATE); } } diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c index 7eac92c07a58..08b2e2a0d67d 100644 --- a/arch/x86/kernel/sev.c +++ b/arch/x86/kernel/sev.c @@ -790,7 +790,7 @@ void __init snp_prep_memory(unsigned long paddr, unsigned int sz, enum psc_op op } static unsigned long __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr, - unsigned long vaddr_end, int op) + unsigned long vaddr_end, int op, unsigned long temp_vaddr) { struct ghcb_state state; bool use_large_entry; @@ -842,7 +842,7 @@ static unsigned long __set_pages_state(struct snp_psc_desc *data, unsigned long /* Page validation must be rescinded before changing to shared */ if (op == SNP_PAGE_STATE_SHARED) - pvalidate_pages(data); + pvalidate_pages(data, temp_vaddr); local_irq_save(flags); @@ -862,12 +862,13 @@ static unsigned long __set_pages_state(struct snp_psc_desc *data, unsigned long /* Page validation must be performed after changing to private */ if (op == SNP_PAGE_STATE_PRIVATE) - pvalidate_pages(data); + pvalidate_pages(data, temp_vaddr); return vaddr; } -static void set_pages_state(unsigned long vaddr, unsigned long npages, int op) +static void set_pages_state(unsigned long vaddr, unsigned long npages, + int op, unsigned long temp_vaddr) { struct snp_psc_desc desc; unsigned long vaddr_end; @@ -880,23 +881,30 @@ static void set_pages_state(unsigned long vaddr, unsigned long npages, int op) vaddr_end = vaddr + (npages << PAGE_SHIFT); while (vaddr < vaddr_end) - vaddr = __set_pages_state(&desc, vaddr, vaddr_end, op); + vaddr = __set_pages_state(&desc, vaddr, vaddr_end, + op, temp_vaddr); } void snp_set_memory_shared(unsigned long vaddr, unsigned long npages) { - if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP)) - return; + struct vm_struct *area; + unsigned long temp_vaddr; - set_pages_state(vaddr, npages, SNP_PAGE_STATE_SHARED); + area = get_vm_area(PAGE_SIZE * (PTRS_PER_PMD + 1), 0); + temp_vaddr = ALIGN((unsigned long)(area->addr + PAGE_SIZE), PMD_SIZE); + set_pages_state(vaddr, npages, SNP_PAGE_STATE_SHARED, temp_vaddr); + free_vm_area(area); } void snp_set_memory_private(unsigned long vaddr, unsigned long npages) { - if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP)) - return; + struct vm_struct *area; + unsigned long temp_vaddr; - set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE); + area = get_vm_area(PAGE_SIZE * (PTRS_PER_PMD + 1), 0); + temp_vaddr = ALIGN((unsigned long)(area->addr + PAGE_SIZE), PMD_SIZE); + set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE, temp_vaddr); + free_vm_area(area); } void snp_accept_memory(phys_addr_t start, phys_addr_t end) @@ -909,7 +917,7 @@ void snp_accept_memory(phys_addr_t start, phys_addr_t end) vaddr = (unsigned long)__va(start); npages = (end - start) >> PAGE_SHIFT; - set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE); + set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE, 0); } static int snp_set_vmsa(void *va, bool vmsa)