Message ID | 20240528095522.509667-12-kirill.shutemov@linux.intel.com (mailing list archive) |
---|---|
State | Handled Elsewhere, archived |
Headers | show |
Series | x86/tdx: Add kexec support | expand |
On Tue, May 28, 2024 at 12:55:14PM +0300, Kirill A. Shutemov wrote: > +static void tdx_kexec_finish(void) > +{ > + unsigned long addr, end; > + long found = 0, shared; > + > + lockdep_assert_irqs_disabled(); > + > + addr = PAGE_OFFSET; > + end = PAGE_OFFSET + get_max_mapped(); > + > + while (addr < end) { > + unsigned long size; > + unsigned int level; > + pte_t *pte; > + > + pte = lookup_address(addr, &level); > + size = page_level_size(level); > + > + if (pte && pte_decrypted(*pte)) { > + int pages = size / PAGE_SIZE; > + > + /* > + * Touching memory with shared bit set triggers implicit > + * conversion to shared. > + * > + * Make sure nobody touches the shared range from > + * now on. > + */ > + set_pte(pte, __pte(0)); > + Format the below into a comment here: /* The only thing one can do at this point on failure is panic. It is reasonable to proceed, especially for the crash case because the kexec-ed kernel is using a different page table so there won't be a mismatch between shared/private marking of the page so it doesn't matter. Also, even if the failure is real and the page cannot be touched as private, the kdump kernel will boot fine as it uses pre-reserved memory. What happens next depends on what the dumping process does and there's a reasonable chance to produce useful dump on crash. Regardless, the print leaves a trace in the log to give a clue for debug. One possible reason for the failure is if kdump raced with memory conversion. In this case shared bit in page table got set (or not cleared form shared->private conversion), but the page is actually private. So this failure is not going to affect the kexec'ed kernel. */ <--- > + if (!tdx_enc_status_changed(addr, pages, true)) { > + pr_err("Failed to unshare range %#lx-%#lx\n", > + addr, addr + size); > + } > + > + found += pages; > + } > + > + addr += size; > + } > + > + __flush_tlb_all(); > + > + shared = atomic_long_read(&nr_shared); > + if (shared != found) { > + pr_err("shared page accounting is off\n"); > + pr_err("nr_shared = %ld, nr_found = %ld\n", shared, found); > + } > +} ... > static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc) > { > - if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) > - return __set_memory_enc_pgtable(addr, numpages, enc); > + int ret = 0; > > - return 0; > + if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) { > + if (!down_read_trylock(&mem_enc_lock)) > + return -EBUSY; > + > + ret = __set_memory_enc_pgtable(addr, numpages, enc); > + > + up_read(&mem_enc_lock); > + } So CC_ATTR_MEM_ENCRYPT is set for SEV* guests too. You need to change that code here to take the lock only on TDX, where you want it, not on the others. Thx.
Hello Boris, On 5/31/2024 10:14 AM, Borislav Petkov wrote: >> static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc) >> { >> - if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) >> - return __set_memory_enc_pgtable(addr, numpages, enc); >> + int ret = 0; >> >> - return 0; >> + if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) { >> + if (!down_read_trylock(&mem_enc_lock)) >> + return -EBUSY; >> + >> + ret = __set_memory_enc_pgtable(addr, numpages, enc); >> + >> + up_read(&mem_enc_lock); >> + } > So CC_ATTR_MEM_ENCRYPT is set for SEV* guests too. You need to change > that code here to take the lock only on TDX, where you want it, not on > the others. SNP guest kexec patches are based on top of this patch-series and SNP guests also need this exclusive mem_enc_lock protection, so CC_ATTR_MEM_ENCRYPT makes sense to be used here. Thanks, Ashish
On Fri, May 31, 2024 at 12:34:49PM -0500, Kalra, Ashish wrote: > SNP guest kexec patches are based on top of this patch-series and SNP guests > also need this exclusive mem_enc_lock protection, so CC_ATTR_MEM_ENCRYPT > makes sense to be used here. Well, for the future, I'd encourage you to always send an Acked-by: you or Reviewed-by: you as a reply to such patches so that it is clear that such a change is desired. Thx.
On Fri, May 31, 2024 at 05:14:42PM +0200, Borislav Petkov wrote: > On Tue, May 28, 2024 at 12:55:14PM +0300, Kirill A. Shutemov wrote: > > +static void tdx_kexec_finish(void) > > +{ > > + unsigned long addr, end; > > + long found = 0, shared; > > + > > + lockdep_assert_irqs_disabled(); > > + > > + addr = PAGE_OFFSET; > > + end = PAGE_OFFSET + get_max_mapped(); > > + > > + while (addr < end) { > > + unsigned long size; > > + unsigned int level; > > + pte_t *pte; > > + > > + pte = lookup_address(addr, &level); > > + size = page_level_size(level); > > + > > + if (pte && pte_decrypted(*pte)) { > > + int pages = size / PAGE_SIZE; > > + > > + /* > > + * Touching memory with shared bit set triggers implicit > > + * conversion to shared. > > + * > > + * Make sure nobody touches the shared range from > > + * now on. > > + */ > > + set_pte(pte, __pte(0)); > > + > > Format the below into a comment here: > > /* > > The only thing one can do at this point on failure is panic. It is > reasonable to proceed, especially for the crash case because the > kexec-ed kernel is using a different page table so there won't be > a mismatch between shared/private marking of the page so it doesn't > matter. Page tables would not make a difference here. We will switch to identity mappings soon. And kexec-ed kernel will build new page tables from scratch. I will drop the part after "It is reasonable to proceed".
On 5/28/24 02:55, Kirill A. Shutemov wrote: > +/* Stop new private<->shared conversions */ > +static void tdx_kexec_begin(bool crash) > +{ > + /* > + * Crash kernel reaches here with interrupts disabled: can't wait for > + * conversions to finish. > + * > + * If race happened, just report and proceed. > + */ > + if (!set_memory_enc_stop_conversion(!crash)) > + pr_warn("Failed to stop shared<->private conversions\n"); > +} I don't like having to pass 'crash' in here. If interrupts are the problem we have ways of testing for those directly. If it's being in an oops that's a problem, we have 'oops_in_progress' for that. In other words, I'd much rather this function (or better yet set_memory_enc_stop_conversion() itself) use some existing API to change its behavior in a crash rather than have the context be passed down and twiddled through several levels of function calls. There are a ton of these in the console code: if (oops_in_progress) foo_trylock(); else foo_lock(); To me, that's a billion times more clear than a 'wait' argument that gets derives from who-knows-what that I have to trace through ten levels of function calls.
On Tue, Jun 04, 2024 at 09:27:59AM -0700, Dave Hansen wrote: > On 5/28/24 02:55, Kirill A. Shutemov wrote: > > +/* Stop new private<->shared conversions */ > > +static void tdx_kexec_begin(bool crash) > > +{ > > + /* > > + * Crash kernel reaches here with interrupts disabled: can't wait for > > + * conversions to finish. > > + * > > + * If race happened, just report and proceed. > > + */ > > + if (!set_memory_enc_stop_conversion(!crash)) > > + pr_warn("Failed to stop shared<->private conversions\n"); > > +} > > I don't like having to pass 'crash' in here. > > If interrupts are the problem we have ways of testing for those directly. > > If it's being in an oops that's a problem, we have 'oops_in_progress' > for that. > > In other words, I'd much rather this function (or better yet > set_memory_enc_stop_conversion() itself) use some existing API to change > its behavior in a crash rather than have the context be passed down and > twiddled through several levels of function calls. > > There are a ton of these in the console code: > > if (oops_in_progress) > foo_trylock(); > else > foo_lock(); > > To me, that's a billion times more clear than a 'wait' argument that > gets derives from who-knows-what that I have to trace through ten levels > of function calls. Okay fair enough. Check out the fixup below. Is it what you mean? One other thing I realized is that these callback are dead code if kernel compiled without kexec support. Do we want them to be wrapped with #ifdef COFNIG_KEXEC_CORE everywhere? It is going to be ugly. Any better ideas? diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c index 3d23ea0f5d45..1c5aa036b76b 100644 --- a/arch/x86/coco/tdx/tdx.c +++ b/arch/x86/coco/tdx/tdx.c @@ -834,7 +834,7 @@ static int tdx_enc_status_change_finish(unsigned long vaddr, int numpages, } /* Stop new private<->shared conversions */ -static void tdx_kexec_begin(bool crash) +static void tdx_kexec_begin(void) { /* * Crash kernel reaches here with interrupts disabled: can't wait for @@ -842,7 +842,7 @@ static void tdx_kexec_begin(bool crash) * * If race happened, just report and proceed. */ - if (!set_memory_enc_stop_conversion(!crash)) + if (!set_memory_enc_stop_conversion()) pr_warn("Failed to stop shared<->private conversions\n"); } diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h index d490db38db9e..4b2abce2e3e7 100644 --- a/arch/x86/include/asm/set_memory.h +++ b/arch/x86/include/asm/set_memory.h @@ -50,7 +50,7 @@ int set_memory_np(unsigned long addr, int numpages); int set_memory_p(unsigned long addr, int numpages); int set_memory_4k(unsigned long addr, int numpages); -bool set_memory_enc_stop_conversion(bool wait); +bool set_memory_enc_stop_conversion(void); int set_memory_encrypted(unsigned long addr, int numpages); int set_memory_decrypted(unsigned long addr, int numpages); diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h index b0f313278967..213cf5379a5a 100644 --- a/arch/x86/include/asm/x86_init.h +++ b/arch/x86/include/asm/x86_init.h @@ -152,8 +152,6 @@ struct x86_init_acpi { * @enc_kexec_begin Begin the two-step process of converting shared memory back * to private. It stops the new conversions from being started * and waits in-flight conversions to finish, if possible. - * The @crash parameter denotes whether the function is being - * called in the crash shutdown path. * @enc_kexec_finish Finish the two-step process of converting shared memory to * private. All memory is private after the call when * the function returns. @@ -165,7 +163,7 @@ struct x86_guest { int (*enc_status_change_finish)(unsigned long vaddr, int npages, bool enc); bool (*enc_tlb_flush_required)(bool enc); bool (*enc_cache_flush_required)(void); - void (*enc_kexec_begin)(bool crash); + void (*enc_kexec_begin)(void); void (*enc_kexec_finish)(void); }; diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c index fc52ea80cdc8..340af8155658 100644 --- a/arch/x86/kernel/crash.c +++ b/arch/x86/kernel/crash.c @@ -137,7 +137,7 @@ void native_machine_crash_shutdown(struct pt_regs *regs) * down and interrupts have been disabled. This allows the callback to * detect a race with the conversion and report it. */ - x86_platform.guest.enc_kexec_begin(true); + x86_platform.guest.enc_kexec_begin(); x86_platform.guest.enc_kexec_finish(); crash_save_cpu(regs, safe_smp_processor_id()); diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c index 513809b5b27c..0e0a4cf6b5eb 100644 --- a/arch/x86/kernel/reboot.c +++ b/arch/x86/kernel/reboot.c @@ -723,7 +723,7 @@ void native_machine_shutdown(void) * conversions to finish cleanly. */ if (kexec_in_progress) - x86_platform.guest.enc_kexec_begin(false); + x86_platform.guest.enc_kexec_begin(); /* Stop the cpus and apics */ #ifdef CONFIG_X86_IO_APIC diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c index 8a79fb505303..82b128d3f309 100644 --- a/arch/x86/kernel/x86_init.c +++ b/arch/x86/kernel/x86_init.c @@ -138,7 +138,7 @@ static int enc_status_change_prepare_noop(unsigned long vaddr, int npages, bool static int enc_status_change_finish_noop(unsigned long vaddr, int npages, bool enc) { return 0; } static bool enc_tlb_flush_required_noop(bool enc) { return false; } static bool enc_cache_flush_required_noop(void) { return false; } -static void enc_kexec_begin_noop(bool crash) {} +static void enc_kexec_begin_noop(void) {} static void enc_kexec_finish_noop(void) {} static bool is_private_mmio_noop(u64 addr) {return false; } diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index 2a548b65ef5f..443a97e515c0 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -2240,13 +2240,14 @@ static DECLARE_RWSEM(mem_enc_lock); * * Taking the exclusive mem_enc_lock waits for in-flight conversions to complete. * The lock is not released to prevent new conversions from being started. - * - * If sleep is not allowed, as in a crash scenario, try to take the lock. - * Failure indicates that there is a race with the conversion. */ -bool set_memory_enc_stop_conversion(bool wait) +bool set_memory_enc_stop_conversion(void) { - if (!wait) + /* + * In a crash scenario, sleep is not allowed. Try to take the lock. + * Failure indicates that there is a race with the conversion. + */ + if (oops_in_progress) return down_write_trylock(&mem_enc_lock); down_write(&mem_enc_lock);
On 6/5/24 05:43, Kirill A. Shutemov wrote: > Okay fair enough. Check out the fixup below. Is it what you mean? Yes. Much better. > One other thing I realized is that these callback are dead code if kernel > compiled without kexec support. Do we want them to be wrapped with > #ifdef COFNIG_KEXEC_CORE everywhere? It is going to be ugly. > > Any better ideas? The other callbacks don't have #ifdefs either and they're dependent on memory encryption as far as I can tell. I think a simple: if (IS_ENABLED(COFNIG_KEXEC_CORE)) return; in the top of the callbacks will result in a tiny little stub function when kexec is disabled. So the bloat will be limited to kernels that have TDX compiled in but kexec compiled out (probably never). The bloat will be two callback pointer, one tiny stub function, and a quick call/return in a slow path. I think that probably ends up being a few dozen bytes of bloat in kernel text for a "probably never" config.
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c index 979891e97d83..c0a651fa8963 100644 --- a/arch/x86/coco/tdx/tdx.c +++ b/arch/x86/coco/tdx/tdx.c @@ -7,6 +7,7 @@ #include <linux/cpufeature.h> #include <linux/export.h> #include <linux/io.h> +#include <linux/kexec.h> #include <asm/coco.h> #include <asm/tdx.h> #include <asm/vmx.h> @@ -14,6 +15,7 @@ #include <asm/insn.h> #include <asm/insn-eval.h> #include <asm/pgtable.h> +#include <asm/set_memory.h> /* MMIO direction */ #define EPT_READ 0 @@ -831,6 +833,70 @@ static int tdx_enc_status_change_finish(unsigned long vaddr, int numpages, return 0; } +/* Stop new private<->shared conversions */ +static void tdx_kexec_begin(bool crash) +{ + /* + * Crash kernel reaches here with interrupts disabled: can't wait for + * conversions to finish. + * + * If race happened, just report and proceed. + */ + if (!set_memory_enc_stop_conversion(!crash)) + pr_warn("Failed to stop shared<->private conversions\n"); +} + +/* Walk direct mapping and convert all shared memory back to private */ +static void tdx_kexec_finish(void) +{ + unsigned long addr, end; + long found = 0, shared; + + lockdep_assert_irqs_disabled(); + + addr = PAGE_OFFSET; + end = PAGE_OFFSET + get_max_mapped(); + + while (addr < end) { + unsigned long size; + unsigned int level; + pte_t *pte; + + pte = lookup_address(addr, &level); + size = page_level_size(level); + + if (pte && pte_decrypted(*pte)) { + int pages = size / PAGE_SIZE; + + /* + * Touching memory with shared bit set triggers implicit + * conversion to shared. + * + * Make sure nobody touches the shared range from + * now on. + */ + set_pte(pte, __pte(0)); + + if (!tdx_enc_status_changed(addr, pages, true)) { + pr_err("Failed to unshare range %#lx-%#lx\n", + addr, addr + size); + } + + found += pages; + } + + addr += size; + } + + __flush_tlb_all(); + + shared = atomic_long_read(&nr_shared); + if (shared != found) { + pr_err("shared page accounting is off\n"); + pr_err("nr_shared = %ld, nr_found = %ld\n", shared, found); + } +} + void __init tdx_early_init(void) { struct tdx_module_args args = { @@ -890,6 +956,9 @@ void __init tdx_early_init(void) x86_platform.guest.enc_cache_flush_required = tdx_cache_flush_required; x86_platform.guest.enc_tlb_flush_required = tdx_tlb_flush_required; + x86_platform.guest.enc_kexec_begin = tdx_kexec_begin; + x86_platform.guest.enc_kexec_finish = tdx_kexec_finish; + /* * TDX intercepts the RDMSR to read the X2APIC ID in the parallel * bringup low level code. That raises #VE which cannot be handled diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 65b8e5bb902c..e39311a89bf4 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -140,6 +140,11 @@ static inline int pte_young(pte_t pte) return pte_flags(pte) & _PAGE_ACCESSED; } +static inline bool pte_decrypted(pte_t pte) +{ + return cc_mkdec(pte_val(pte)) == pte_val(pte); +} + #define pmd_dirty pmd_dirty static inline bool pmd_dirty(pmd_t pmd) { diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h index 9aee31862b4a..d490db38db9e 100644 --- a/arch/x86/include/asm/set_memory.h +++ b/arch/x86/include/asm/set_memory.h @@ -49,8 +49,11 @@ int set_memory_wb(unsigned long addr, int numpages); int set_memory_np(unsigned long addr, int numpages); int set_memory_p(unsigned long addr, int numpages); int set_memory_4k(unsigned long addr, int numpages); + +bool set_memory_enc_stop_conversion(bool wait); int set_memory_encrypted(unsigned long addr, int numpages); int set_memory_decrypted(unsigned long addr, int numpages); + int set_memory_np_noalias(unsigned long addr, int numpages); int set_memory_nonglobal(unsigned long addr, int numpages); int set_memory_global(unsigned long addr, int numpages); diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index a7a7a6c6a3fb..2a548b65ef5f 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -2227,12 +2227,47 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc) return ret; } +/* + * The lock serializes conversions between private and shared memory. + * + * It is taken for read on conversion. A write lock guarantees that no + * concurrent conversions are in progress. + */ +static DECLARE_RWSEM(mem_enc_lock); + +/* + * Stop new private<->shared conversions. + * + * Taking the exclusive mem_enc_lock waits for in-flight conversions to complete. + * The lock is not released to prevent new conversions from being started. + * + * If sleep is not allowed, as in a crash scenario, try to take the lock. + * Failure indicates that there is a race with the conversion. + */ +bool set_memory_enc_stop_conversion(bool wait) +{ + if (!wait) + return down_write_trylock(&mem_enc_lock); + + down_write(&mem_enc_lock); + + return true; +} + static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc) { - if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) - return __set_memory_enc_pgtable(addr, numpages, enc); + int ret = 0; - return 0; + if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) { + if (!down_read_trylock(&mem_enc_lock)) + return -EBUSY; + + ret = __set_memory_enc_pgtable(addr, numpages, enc); + + up_read(&mem_enc_lock); + } + + return ret; } int set_memory_encrypted(unsigned long addr, int numpages)