[v15,17/23] x86/kexec: Flush cache of TDX private memory

Message ID	2151c68079c1cb837d07bd8015e4ff1f662e1a6e.1699527082.git.kai.huang@intel.com (mailing list archive)
State	New, archived
Headers	show Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 799EC1DA20 for <kvm@vger.kernel.org>; Thu, 9 Nov 2023 11:58:03 +0000 (UTC) From: Kai Huang <kai.huang@intel.com> To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: x86@kernel.org, dave.hansen@intel.com, kirill.shutemov@linux.intel.com, peterz@infradead.org, tony.luck@intel.com, tglx@linutronix.de, bp@alien8.de, mingo@redhat.com, hpa@zytor.com, seanjc@google.com, pbonzini@redhat.com, rafael@kernel.org, david@redhat.com, dan.j.williams@intel.com, len.brown@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, ying.huang@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, nik.borisov@suse.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v15 17/23] x86/kexec: Flush cache of TDX private memory Date: Fri, 10 Nov 2023 00:55:54 +1300 Message-ID: <2151c68079c1cb837d07bd8015e4ff1f662e1a6e.1699527082.git.kai.huang@intel.com> In-Reply-To: <cover.1699527082.git.kai.huang@intel.com> References: <cover.1699527082.git.kai.huang@intel.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	TDX host kernel support \| expand [v15,00/23] TDX host kernel support [v15,01/23] x86/virt/tdx: Detect TDX during kernel boot [v15,02/23] x86/tdx: Define TDX supported page sizes as macros [v15,03/23] x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC [v15,04/23] x86/cpu: Detect TDX partial write machine check erratum [v15,05/23] x86/virt/tdx: Handle SEAMCALL no entropy error in common code [v15,06/23] x86/virt/tdx: Add SEAMCALL error printing for module initialization [v15,07/23] x86/virt/tdx: Add skeleton to enable TDX on demand [v15,08/23] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory [v15,09/23] x86/virt/tdx: Get module global metadata for module initialization [v15,10/23] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions [v15,11/23] x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions [v15,12/23] x86/virt/tdx: Allocate and set up PAMTs for TDMRs [v15,13/23] x86/virt/tdx: Designate reserved areas for all TDMRs [v15,14/23] x86/virt/tdx: Configure TDX module with the TDMRs and global KeyID [v15,15/23] x86/virt/tdx: Configure global KeyID on all packages [v15,16/23] x86/virt/tdx: Initialize all TDMRs [v15,17/23] x86/kexec: Flush cache of TDX private memory [v15,18/23] x86/virt/tdx: Keep TDMRs when module initialization is successful [v15,19/23] x86/virt/tdx: Improve readability of module initialization error handling [v15,20/23] x86/kexec(): Reset TDX private memory on platforms with TDX erratum [v15,21/23] x86/virt/tdx: Handle TDX interaction with ACPI S3 and deeper states [v15,22/23] x86/mce: Improve error log of kernel space TDX #MC due to erratum [v15,23/23] Documentation/x86: Add documentation for TDX host support

Huang, Kai Nov. 9, 2023, 11:55 a.m. UTC

There are two problems in terms of using kexec() to boot to a new kernel
when the old kernel has enabled TDX: 1) Part of the memory pages are
still TDX private pages; 2) There might be dirty cachelines associated
with TDX private pages.

The first problem doesn't matter on the platforms w/o the "partial write
machine check" erratum.  KeyID 0 doesn't have integrity check.  If the
new kernel wants to use any non-zero KeyID, it needs to convert the
memory to that KeyID and such conversion would work from any KeyID.

However the old kernel needs to guarantee there's no dirty cacheline
left behind before booting to the new kernel to avoid silent corruption
from later cacheline writeback (Intel hardware doesn't guarantee cache
coherency across different KeyIDs).

There are two things that the old kernel needs to do to achieve that:

1) Stop accessing TDX private memory mappings:
   a. Stop making TDX module SEAMCALLs (TDX global KeyID);
   b. Stop TDX guests from running (per-guest TDX KeyID).
2) Flush any cachelines from previous TDX private KeyID writes.

For 2), use wbinvd() to flush cache in stop_this_cpu(), following SME
support.  And in this way 1) happens for free as there's no TDX activity
between wbinvd() and the native_halt().

Flushing cache in stop_this_cpu() only flushes cache on remote cpus.  On
the rebooting cpu which does kexec(), unlike SME which does the cache
flush in relocate_kernel(), flush the cache right after stopping remote
cpus in machine_shutdown().

There are two reasons to do so: 1) For TDX there's no need to defer
cache flush to relocate_kernel() because all TDX activities have been
stopped.  2) On the platforms with the above erratum the kernel must
convert all TDX private pages back to normal before booting to the new
kernel in kexec(), and flushing cache early allows the kernel to convert
memory early rather than having to muck with the relocate_kernel()
assembly.

Theoretically, cache flush is only needed when the TDX module has been
initialized.  However initializing the TDX module is done on demand at
runtime, and it takes a mutex to read the module status.  Just check
whether TDX is enabled by the BIOS instead to flush cache.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---

v14 -> v15:
 - No change

v13 -> v14:
 - No change


---
 arch/x86/kernel/process.c |  8 +++++++-
 arch/x86/kernel/reboot.c  | 15 +++++++++++++++
 2 files changed, 22 insertions(+), 1 deletion(-)

Dave Hansen Nov. 27, 2023, 6:13 p.m. UTC | #1

On 11/9/23 03:55, Kai Huang wrote:
...
> --- a/arch/x86/kernel/reboot.c
> +++ b/arch/x86/kernel/reboot.c
> @@ -31,6 +31,7 @@
>  #include <asm/realmode.h>
>  #include <asm/x86_init.h>
>  #include <asm/efi.h>
> +#include <asm/tdx.h>
>  
>  /*
>   * Power off function, if any
> @@ -741,6 +742,20 @@ void native_machine_shutdown(void)
>  	local_irq_disable();
>  	stop_other_cpus();
>  #endif
> +	/*
> +	 * stop_other_cpus() has flushed all dirty cachelines of TDX
> +	 * private memory on remote cpus.  Unlike SME, which does the
> +	 * cache flush on _this_ cpu in the relocate_kernel(), flush
> +	 * the cache for _this_ cpu here.  This is because on the
> +	 * platforms with "partial write machine check" erratum the
> +	 * kernel needs to convert all TDX private pages back to normal
> +	 * before booting to the new kernel in kexec(), and the cache
> +	 * flush must be done before that.  If the kernel took SME's way,
> +	 * it would have to muck with the relocate_kernel() assembly to
> +	 * do memory conversion.
> +	 */
> +	if (platform_tdx_enabled())
> +		native_wbinvd();

Why can't the TDX host code just set host_mem_enc_active=1?

Sure, you'll end up *using* the SME WBINVD support, but then you don't
have two different WBINVD call sites.  You also don't have to mess with
a single line of assembly.

Huang, Kai Nov. 27, 2023, 7:33 p.m. UTC | #2

On Mon, 2023-11-27 at 10:13 -0800, Hansen, Dave wrote:
> On 11/9/23 03:55, Kai Huang wrote:
> ...
> > --- a/arch/x86/kernel/reboot.c
> > +++ b/arch/x86/kernel/reboot.c
> > @@ -31,6 +31,7 @@
> >  #include <asm/realmode.h>
> >  #include <asm/x86_init.h>
> >  #include <asm/efi.h>
> > +#include <asm/tdx.h>
> >  
> >  /*
> >   * Power off function, if any
> > @@ -741,6 +742,20 @@ void native_machine_shutdown(void)
> >  	local_irq_disable();
> >  	stop_other_cpus();
> >  #endif
> > +	/*
> > +	 * stop_other_cpus() has flushed all dirty cachelines of TDX
> > +	 * private memory on remote cpus.  Unlike SME, which does the
> > +	 * cache flush on _this_ cpu in the relocate_kernel(), flush
> > +	 * the cache for _this_ cpu here.  This is because on the
> > +	 * platforms with "partial write machine check" erratum the
> > +	 * kernel needs to convert all TDX private pages back to normal
> > +	 * before booting to the new kernel in kexec(), and the cache
> > +	 * flush must be done before that.  If the kernel took SME's way,
> > +	 * it would have to muck with the relocate_kernel() assembly to
> > +	 * do memory conversion.
> > +	 */
> > +	if (platform_tdx_enabled())
> > +		native_wbinvd();
> 
> Why can't the TDX host code just set host_mem_enc_active=1?
> 
> Sure, you'll end up *using* the SME WBINVD support, but then you don't
> have two different WBINVD call sites.  You also don't have to mess with
> a single line of assembly.

I wanted to avoid changing the assembly.

Perhaps the comment isn't very clear.  Flushing cache (on the CPU running kexec)
when the host_mem_enc_active=1 is handled in the relocate_kernel() assembly,
which happens at very late stage right before jumping to the new kernel. 
However for TDX when the platform has erratum we need to convert TDX private
pages back to normal, which must be done after flushing cache.  If we reuse
host_mem_enc_active=1, then we will need to change the assembly code to do that.

Huang, Kai Nov. 27, 2023, 8:02 p.m. UTC | #3

On Mon, 2023-11-27 at 19:33 +0000, Huang, Kai wrote:
> On Mon, 2023-11-27 at 10:13 -0800, Hansen, Dave wrote:
> > On 11/9/23 03:55, Kai Huang wrote:
> > ...
> > > --- a/arch/x86/kernel/reboot.c
> > > +++ b/arch/x86/kernel/reboot.c
> > > @@ -31,6 +31,7 @@
> > >  #include <asm/realmode.h>
> > >  #include <asm/x86_init.h>
> > >  #include <asm/efi.h>
> > > +#include <asm/tdx.h>
> > >  
> > >  /*
> > >   * Power off function, if any
> > > @@ -741,6 +742,20 @@ void native_machine_shutdown(void)
> > >  	local_irq_disable();
> > >  	stop_other_cpus();
> > >  #endif
> > > +	/*
> > > +	 * stop_other_cpus() has flushed all dirty cachelines of TDX
> > > +	 * private memory on remote cpus.  Unlike SME, which does the
> > > +	 * cache flush on _this_ cpu in the relocate_kernel(), flush
> > > +	 * the cache for _this_ cpu here.  This is because on the
> > > +	 * platforms with "partial write machine check" erratum the
> > > +	 * kernel needs to convert all TDX private pages back to normal
> > > +	 * before booting to the new kernel in kexec(), and the cache
> > > +	 * flush must be done before that.  If the kernel took SME's way,
> > > +	 * it would have to muck with the relocate_kernel() assembly to
> > > +	 * do memory conversion.
> > > +	 */
> > > +	if (platform_tdx_enabled())
> > > +		native_wbinvd();
> > 
> > Why can't the TDX host code just set host_mem_enc_active=1?
> > 
> > Sure, you'll end up *using* the SME WBINVD support, but then you don't
> > have two different WBINVD call sites.  You also don't have to mess with
> > a single line of assembly.
> 
> I wanted to avoid changing the assembly.
> 
> Perhaps the comment isn't very clear.  Flushing cache (on the CPU running kexec)
> when the host_mem_enc_active=1 is handled in the relocate_kernel() assembly,
> which happens at very late stage right before jumping to the new kernel. 
> However for TDX when the platform has erratum we need to convert TDX private
> pages back to normal, which must be done after flushing cache.  If we reuse
> host_mem_enc_active=1, then we will need to change the assembly code to do that.
> 

Forgot to say doing TDX page conversion in the relocate_assembly() isn't easy
because the cache flushing when host_mem_enc_active=1 happens after kernel has
switched to the identity mapping table, so we will need to do hacks like fixing
up symbol address etc.

Dave Hansen Nov. 27, 2023, 8:05 p.m. UTC | #4

On 11/27/23 11:33, Huang, Kai wrote:
> On Mon, 2023-11-27 at 10:13 -0800, Hansen, Dave wrote:
>> On 11/9/23 03:55, Kai Huang wrote:
>> ...
>>> --- a/arch/x86/kernel/reboot.c
>>> +++ b/arch/x86/kernel/reboot.c
>>> @@ -31,6 +31,7 @@
>>>  #include <asm/realmode.h>
>>>  #include <asm/x86_init.h>
>>>  #include <asm/efi.h>
>>> +#include <asm/tdx.h>
>>>
>>>  /*
>>>   * Power off function, if any
>>> @@ -741,6 +742,20 @@ void native_machine_shutdown(void)
>>>     local_irq_disable();
>>>     stop_other_cpus();
>>>  #endif
>>> +   /*
>>> +    * stop_other_cpus() has flushed all dirty cachelines of TDX
>>> +    * private memory on remote cpus.  Unlike SME, which does the
>>> +    * cache flush on _this_ cpu in the relocate_kernel(), flush
>>> +    * the cache for _this_ cpu here.  This is because on the
>>> +    * platforms with "partial write machine check" erratum the
>>> +    * kernel needs to convert all TDX private pages back to normal
>>> +    * before booting to the new kernel in kexec(), and the cache
>>> +    * flush must be done before that.  If the kernel took SME's way,
>>> +    * it would have to muck with the relocate_kernel() assembly to
>>> +    * do memory conversion.
>>> +    */
>>> +   if (platform_tdx_enabled())
>>> +           native_wbinvd();
>>
>> Why can't the TDX host code just set host_mem_enc_active=1?
>>
>> Sure, you'll end up *using* the SME WBINVD support, but then you don't
>> have two different WBINVD call sites.  You also don't have to mess with
>> a single line of assembly.
> 
> I wanted to avoid changing the assembly.
> 
> Perhaps the comment isn't very clear.  Flushing cache (on the CPU running kexec)
> when the host_mem_enc_active=1 is handled in the relocate_kernel() assembly,
> which happens at very late stage right before jumping to the new kernel.
> However for TDX when the platform has erratum we need to convert TDX private
> pages back to normal, which must be done after flushing cache.  If we reuse
> host_mem_enc_active=1, then we will need to change the assembly code to do that.

I honestly think you need to stop thinking about the partial write issue
at this point in the series.  It's really causing a horrible amount of
unnecessary confusion.

Here's the golden rule:

	The boot CPU needs to run WBINVD sometime after it stops writing
	to private memory but before it starts treating the memory as
	shared.

On SME kernels, that key point evidently in early boot when it's
enabling SME.  I _think_ that point is also a valid place to do WBINVD
on no-TDX-erratum systems.

On TDX systems with the erratum, there's a *second* point before the
private=>shared conversion occurs.  I think what you're trying to do
here is prematurely optimize the erratum-affected affected systems so
that they don't do two WBINVDs.  Please stop trying to do that.

This WBINVD is only _needed_ for the erratum.  It should be closer to
the actual erratum handing.

Huang, Kai Nov. 27, 2023, 8:52 p.m. UTC | #5

On Mon, 2023-11-27 at 12:05 -0800, Dave Hansen wrote:
> On 11/27/23 11:33, Huang, Kai wrote:
> > On Mon, 2023-11-27 at 10:13 -0800, Hansen, Dave wrote:
> > > On 11/9/23 03:55, Kai Huang wrote:
> > > ...
> > > > --- a/arch/x86/kernel/reboot.c
> > > > +++ b/arch/x86/kernel/reboot.c
> > > > @@ -31,6 +31,7 @@
> > > >  #include <asm/realmode.h>
> > > >  #include <asm/x86_init.h>
> > > >  #include <asm/efi.h>
> > > > +#include <asm/tdx.h>
> > > > 
> > > >  /*
> > > >   * Power off function, if any
> > > > @@ -741,6 +742,20 @@ void native_machine_shutdown(void)
> > > >     local_irq_disable();
> > > >     stop_other_cpus();
> > > >  #endif
> > > > +   /*
> > > > +    * stop_other_cpus() has flushed all dirty cachelines of TDX
> > > > +    * private memory on remote cpus.  Unlike SME, which does the
> > > > +    * cache flush on _this_ cpu in the relocate_kernel(), flush
> > > > +    * the cache for _this_ cpu here.  This is because on the
> > > > +    * platforms with "partial write machine check" erratum the
> > > > +    * kernel needs to convert all TDX private pages back to normal
> > > > +    * before booting to the new kernel in kexec(), and the cache
> > > > +    * flush must be done before that.  If the kernel took SME's way,
> > > > +    * it would have to muck with the relocate_kernel() assembly to
> > > > +    * do memory conversion.
> > > > +    */
> > > > +   if (platform_tdx_enabled())
> > > > +           native_wbinvd();
> > > 
> > > Why can't the TDX host code just set host_mem_enc_active=1?
> > > 
> > > Sure, you'll end up *using* the SME WBINVD support, but then you don't
> > > have two different WBINVD call sites.  You also don't have to mess with
> > > a single line of assembly.
> > 
> > I wanted to avoid changing the assembly.
> > 
> > Perhaps the comment isn't very clear.  Flushing cache (on the CPU running kexec)
> > when the host_mem_enc_active=1 is handled in the relocate_kernel() assembly,
> > which happens at very late stage right before jumping to the new kernel.
> > However for TDX when the platform has erratum we need to convert TDX private
> > pages back to normal, which must be done after flushing cache.  If we reuse
> > host_mem_enc_active=1, then we will need to change the assembly code to do that.
> 
> I honestly think you need to stop thinking about the partial write issue
> at this point in the series.  It's really causing a horrible amount of
> unnecessary confusion.
> 
> Here's the golden rule:
> 
> 	The boot CPU needs to run WBINVD sometime after it stops writing
> 	to private memory but before it starts treating the memory as
> 	shared.
> 
> On SME kernels, that key point evidently in early boot when it's
> enabling SME.  I _think_ that point is also a valid place to do WBINVD
> on no-TDX-erratum systems.

You mean we could advertise cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT) true for
TDX host? We could but IMHO it doesn't perfectly match.

SME kernel sets _PAGE_ENC on by default for all memory mappings IIUC, but TDX
host never actually sets any encryption bits in page tables managed by the
kernel.

So I think we can just do below, but not advertise CC_ATTR_HOST_MEM_ENCRYPT for
TDX host?

--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -377,7 +377,8 @@ void machine_kexec(struct kimage *image)
                                       (unsigned long)page_list,
                                       image->start,
                                       image->preserve_context,
-                                     
cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT));
+                                      cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT)
||
+                                      platform_tdx_enabled());


> 
> On TDX systems with the erratum, there's a *second* point before the
> private=>shared conversion occurs.  I think what you're trying to do
> here is prematurely optimize the erratum-affected affected systems so
> that they don't do two WBINVDs.  Please stop trying to do that.
> 
> This WBINVD is only _needed_ for the erratum.  It should be closer to
> the actual erratum handing.

If we do WBINVD early here then the second one isn't needed.  But 100% agreed
this handling/optimization should be done later closer to the erratum handling.

Dave Hansen Nov. 27, 2023, 9:06 p.m. UTC | #6

On 11/27/23 12:52, Huang, Kai wrote:
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -377,7 +377,8 @@ void machine_kexec(struct kimage *image)
>                                        (unsigned long)page_list,
>                                        image->start,
>                                        image->preserve_context,
> -
> cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT));
> +                                      cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT)
> ||
> +                                      platform_tdx_enabled());

Well, something more like the attached would be preferable, but you've
got the right idea logically.

Huang, Kai Nov. 27, 2023, 10:09 p.m. UTC | #7

On Mon, 2023-11-27 at 13:06 -0800, Hansen, Dave wrote:
> On 11/27/23 12:52, Huang, Kai wrote:
> > --- a/arch/x86/kernel/machine_kexec_64.c
> > +++ b/arch/x86/kernel/machine_kexec_64.c
> > @@ -377,7 +377,8 @@ void machine_kexec(struct kimage *image)
> >                                        (unsigned long)page_list,
> >                                        image->start,
> >                                        image->preserve_context,
> > -
> > cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT));
> > +                                      cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT)
> > > > 
> > +                                      platform_tdx_enabled());
> 
> Well, something more like the attached would be preferable, but you've
> got the right idea logically.

Thanks!

On top of that, I think below code (also attached the diff) should do
advertising the CC_ATTR_HOST_MEM_INCOHERENT for TDX host?

diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index 2e2d559169a8..bec70b967504 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -12,6 +12,8 @@
 
 #include <asm/coco.h>
 #include <asm/processor.h>
+#include <asm/cpufeatures.h>
+#include <asm/tdx.h>
 
 enum cc_vendor cc_vendor __ro_after_init = CC_VENDOR_NONE;
 static u64 cc_mask __ro_after_init;
@@ -23,7 +25,9 @@ static bool noinstr intel_cc_platform_has(enum cc_attr attr)
        case CC_ATTR_HOTPLUG_DISABLED:
        case CC_ATTR_GUEST_MEM_ENCRYPT:
        case CC_ATTR_MEM_ENCRYPT:
-               return true;
+               return cpu_feature_enabled(X86_FEATURE_TDX_GUEST);
+       case CC_ATTR_HOST_MEM_INCOHERENT:
+               return platform_tdx_enabled();
        default:
                return false;
        }
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 73cd2f7b7d87..1ae21348edc1 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1634,6 +1634,13 @@ static int __init tdx_init(void)
        tdx_guest_keyid_start = tdx_keyid_start + 1;
        tdx_nr_guest_keyids = nr_tdx_keyids - 1;
 
+       /*
+        * TDX doesn't guarantee cache coherency among different
+        * KeyIDs.  Advertise the CC_ATTR_HOST_MEM_INCOHERENT
+        * attribute for TDX host.
+        */
+       cc_vendor = CC_VENDOR_INTEL;
+
        return 0;
 }
 early_initcall(tdx_init);


I'll do some test with your code and the above code.

[v15,17/23] x86/kexec: Flush cache of TDX private memory

Commit Message

Comments

Patch