diff mbox series

[v15,17/23] x86/kexec: Flush cache of TDX private memory

Message ID 2151c68079c1cb837d07bd8015e4ff1f662e1a6e.1699527082.git.kai.huang@intel.com (mailing list archive)
State New, archived
Headers show
Series TDX host kernel support | expand

Commit Message

Huang, Kai Nov. 9, 2023, 11:55 a.m. UTC
There are two problems in terms of using kexec() to boot to a new kernel
when the old kernel has enabled TDX: 1) Part of the memory pages are
still TDX private pages; 2) There might be dirty cachelines associated
with TDX private pages.

The first problem doesn't matter on the platforms w/o the "partial write
machine check" erratum.  KeyID 0 doesn't have integrity check.  If the
new kernel wants to use any non-zero KeyID, it needs to convert the
memory to that KeyID and such conversion would work from any KeyID.

However the old kernel needs to guarantee there's no dirty cacheline
left behind before booting to the new kernel to avoid silent corruption
from later cacheline writeback (Intel hardware doesn't guarantee cache
coherency across different KeyIDs).

There are two things that the old kernel needs to do to achieve that:

1) Stop accessing TDX private memory mappings:
   a. Stop making TDX module SEAMCALLs (TDX global KeyID);
   b. Stop TDX guests from running (per-guest TDX KeyID).
2) Flush any cachelines from previous TDX private KeyID writes.

For 2), use wbinvd() to flush cache in stop_this_cpu(), following SME
support.  And in this way 1) happens for free as there's no TDX activity
between wbinvd() and the native_halt().

Flushing cache in stop_this_cpu() only flushes cache on remote cpus.  On
the rebooting cpu which does kexec(), unlike SME which does the cache
flush in relocate_kernel(), flush the cache right after stopping remote
cpus in machine_shutdown().

There are two reasons to do so: 1) For TDX there's no need to defer
cache flush to relocate_kernel() because all TDX activities have been
stopped.  2) On the platforms with the above erratum the kernel must
convert all TDX private pages back to normal before booting to the new
kernel in kexec(), and flushing cache early allows the kernel to convert
memory early rather than having to muck with the relocate_kernel()
assembly.

Theoretically, cache flush is only needed when the TDX module has been
initialized.  However initializing the TDX module is done on demand at
runtime, and it takes a mutex to read the module status.  Just check
whether TDX is enabled by the BIOS instead to flush cache.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---

v14 -> v15:
 - No change

v13 -> v14:
 - No change


---
 arch/x86/kernel/process.c |  8 +++++++-
 arch/x86/kernel/reboot.c  | 15 +++++++++++++++
 2 files changed, 22 insertions(+), 1 deletion(-)

Comments

Dave Hansen Nov. 27, 2023, 6:13 p.m. UTC | #1
On 11/9/23 03:55, Kai Huang wrote:
...
> --- a/arch/x86/kernel/reboot.c
> +++ b/arch/x86/kernel/reboot.c
> @@ -31,6 +31,7 @@
>  #include <asm/realmode.h>
>  #include <asm/x86_init.h>
>  #include <asm/efi.h>
> +#include <asm/tdx.h>
>  
>  /*
>   * Power off function, if any
> @@ -741,6 +742,20 @@ void native_machine_shutdown(void)
>  	local_irq_disable();
>  	stop_other_cpus();
>  #endif
> +	/*
> +	 * stop_other_cpus() has flushed all dirty cachelines of TDX
> +	 * private memory on remote cpus.  Unlike SME, which does the
> +	 * cache flush on _this_ cpu in the relocate_kernel(), flush
> +	 * the cache for _this_ cpu here.  This is because on the
> +	 * platforms with "partial write machine check" erratum the
> +	 * kernel needs to convert all TDX private pages back to normal
> +	 * before booting to the new kernel in kexec(), and the cache
> +	 * flush must be done before that.  If the kernel took SME's way,
> +	 * it would have to muck with the relocate_kernel() assembly to
> +	 * do memory conversion.
> +	 */
> +	if (platform_tdx_enabled())
> +		native_wbinvd();

Why can't the TDX host code just set host_mem_enc_active=1?

Sure, you'll end up *using* the SME WBINVD support, but then you don't
have two different WBINVD call sites.  You also don't have to mess with
a single line of assembly.
Huang, Kai Nov. 27, 2023, 7:33 p.m. UTC | #2
On Mon, 2023-11-27 at 10:13 -0800, Hansen, Dave wrote:
> On 11/9/23 03:55, Kai Huang wrote:
> ...
> > --- a/arch/x86/kernel/reboot.c
> > +++ b/arch/x86/kernel/reboot.c
> > @@ -31,6 +31,7 @@
> >  #include <asm/realmode.h>
> >  #include <asm/x86_init.h>
> >  #include <asm/efi.h>
> > +#include <asm/tdx.h>
> >  
> >  /*
> >   * Power off function, if any
> > @@ -741,6 +742,20 @@ void native_machine_shutdown(void)
> >  	local_irq_disable();
> >  	stop_other_cpus();
> >  #endif
> > +	/*
> > +	 * stop_other_cpus() has flushed all dirty cachelines of TDX
> > +	 * private memory on remote cpus.  Unlike SME, which does the
> > +	 * cache flush on _this_ cpu in the relocate_kernel(), flush
> > +	 * the cache for _this_ cpu here.  This is because on the
> > +	 * platforms with "partial write machine check" erratum the
> > +	 * kernel needs to convert all TDX private pages back to normal
> > +	 * before booting to the new kernel in kexec(), and the cache
> > +	 * flush must be done before that.  If the kernel took SME's way,
> > +	 * it would have to muck with the relocate_kernel() assembly to
> > +	 * do memory conversion.
> > +	 */
> > +	if (platform_tdx_enabled())
> > +		native_wbinvd();
> 
> Why can't the TDX host code just set host_mem_enc_active=1?
> 
> Sure, you'll end up *using* the SME WBINVD support, but then you don't
> have two different WBINVD call sites.  You also don't have to mess with
> a single line of assembly.

I wanted to avoid changing the assembly.

Perhaps the comment isn't very clear.  Flushing cache (on the CPU running kexec)
when the host_mem_enc_active=1 is handled in the relocate_kernel() assembly,
which happens at very late stage right before jumping to the new kernel. 
However for TDX when the platform has erratum we need to convert TDX private
pages back to normal, which must be done after flushing cache.  If we reuse
host_mem_enc_active=1, then we will need to change the assembly code to do that.
Huang, Kai Nov. 27, 2023, 8:02 p.m. UTC | #3
On Mon, 2023-11-27 at 19:33 +0000, Huang, Kai wrote:
> On Mon, 2023-11-27 at 10:13 -0800, Hansen, Dave wrote:
> > On 11/9/23 03:55, Kai Huang wrote:
> > ...
> > > --- a/arch/x86/kernel/reboot.c
> > > +++ b/arch/x86/kernel/reboot.c
> > > @@ -31,6 +31,7 @@
> > >  #include <asm/realmode.h>
> > >  #include <asm/x86_init.h>
> > >  #include <asm/efi.h>
> > > +#include <asm/tdx.h>
> > >  
> > >  /*
> > >   * Power off function, if any
> > > @@ -741,6 +742,20 @@ void native_machine_shutdown(void)
> > >  	local_irq_disable();
> > >  	stop_other_cpus();
> > >  #endif
> > > +	/*
> > > +	 * stop_other_cpus() has flushed all dirty cachelines of TDX
> > > +	 * private memory on remote cpus.  Unlike SME, which does the
> > > +	 * cache flush on _this_ cpu in the relocate_kernel(), flush
> > > +	 * the cache for _this_ cpu here.  This is because on the
> > > +	 * platforms with "partial write machine check" erratum the
> > > +	 * kernel needs to convert all TDX private pages back to normal
> > > +	 * before booting to the new kernel in kexec(), and the cache
> > > +	 * flush must be done before that.  If the kernel took SME's way,
> > > +	 * it would have to muck with the relocate_kernel() assembly to
> > > +	 * do memory conversion.
> > > +	 */
> > > +	if (platform_tdx_enabled())
> > > +		native_wbinvd();
> > 
> > Why can't the TDX host code just set host_mem_enc_active=1?
> > 
> > Sure, you'll end up *using* the SME WBINVD support, but then you don't
> > have two different WBINVD call sites.  You also don't have to mess with
> > a single line of assembly.
> 
> I wanted to avoid changing the assembly.
> 
> Perhaps the comment isn't very clear.  Flushing cache (on the CPU running kexec)
> when the host_mem_enc_active=1 is handled in the relocate_kernel() assembly,
> which happens at very late stage right before jumping to the new kernel. 
> However for TDX when the platform has erratum we need to convert TDX private
> pages back to normal, which must be done after flushing cache.  If we reuse
> host_mem_enc_active=1, then we will need to change the assembly code to do that.
> 

Forgot to say doing TDX page conversion in the relocate_assembly() isn't easy
because the cache flushing when host_mem_enc_active=1 happens after kernel has
switched to the identity mapping table, so we will need to do hacks like fixing
up symbol address etc.
Dave Hansen Nov. 27, 2023, 8:05 p.m. UTC | #4
On 11/27/23 11:33, Huang, Kai wrote:
> On Mon, 2023-11-27 at 10:13 -0800, Hansen, Dave wrote:
>> On 11/9/23 03:55, Kai Huang wrote:
>> ...
>>> --- a/arch/x86/kernel/reboot.c
>>> +++ b/arch/x86/kernel/reboot.c
>>> @@ -31,6 +31,7 @@
>>>  #include <asm/realmode.h>
>>>  #include <asm/x86_init.h>
>>>  #include <asm/efi.h>
>>> +#include <asm/tdx.h>
>>>
>>>  /*
>>>   * Power off function, if any
>>> @@ -741,6 +742,20 @@ void native_machine_shutdown(void)
>>>     local_irq_disable();
>>>     stop_other_cpus();
>>>  #endif
>>> +   /*
>>> +    * stop_other_cpus() has flushed all dirty cachelines of TDX
>>> +    * private memory on remote cpus.  Unlike SME, which does the
>>> +    * cache flush on _this_ cpu in the relocate_kernel(), flush
>>> +    * the cache for _this_ cpu here.  This is because on the
>>> +    * platforms with "partial write machine check" erratum the
>>> +    * kernel needs to convert all TDX private pages back to normal
>>> +    * before booting to the new kernel in kexec(), and the cache
>>> +    * flush must be done before that.  If the kernel took SME's way,
>>> +    * it would have to muck with the relocate_kernel() assembly to
>>> +    * do memory conversion.
>>> +    */
>>> +   if (platform_tdx_enabled())
>>> +           native_wbinvd();
>>
>> Why can't the TDX host code just set host_mem_enc_active=1?
>>
>> Sure, you'll end up *using* the SME WBINVD support, but then you don't
>> have two different WBINVD call sites.  You also don't have to mess with
>> a single line of assembly.
> 
> I wanted to avoid changing the assembly.
> 
> Perhaps the comment isn't very clear.  Flushing cache (on the CPU running kexec)
> when the host_mem_enc_active=1 is handled in the relocate_kernel() assembly,
> which happens at very late stage right before jumping to the new kernel.
> However for TDX when the platform has erratum we need to convert TDX private
> pages back to normal, which must be done after flushing cache.  If we reuse
> host_mem_enc_active=1, then we will need to change the assembly code to do that.

I honestly think you need to stop thinking about the partial write issue
at this point in the series.  It's really causing a horrible amount of
unnecessary confusion.

Here's the golden rule:

	The boot CPU needs to run WBINVD sometime after it stops writing
	to private memory but before it starts treating the memory as
	shared.

On SME kernels, that key point evidently in early boot when it's
enabling SME.  I _think_ that point is also a valid place to do WBINVD
on no-TDX-erratum systems.

On TDX systems with the erratum, there's a *second* point before the
private=>shared conversion occurs.  I think what you're trying to do
here is prematurely optimize the erratum-affected affected systems so
that they don't do two WBINVDs.  Please stop trying to do that.

This WBINVD is only _needed_ for the erratum.  It should be closer to
the actual erratum handing.
Huang, Kai Nov. 27, 2023, 8:52 p.m. UTC | #5
On Mon, 2023-11-27 at 12:05 -0800, Dave Hansen wrote:
> On 11/27/23 11:33, Huang, Kai wrote:
> > On Mon, 2023-11-27 at 10:13 -0800, Hansen, Dave wrote:
> > > On 11/9/23 03:55, Kai Huang wrote:
> > > ...
> > > > --- a/arch/x86/kernel/reboot.c
> > > > +++ b/arch/x86/kernel/reboot.c
> > > > @@ -31,6 +31,7 @@
> > > >  #include <asm/realmode.h>
> > > >  #include <asm/x86_init.h>
> > > >  #include <asm/efi.h>
> > > > +#include <asm/tdx.h>
> > > > 
> > > >  /*
> > > >   * Power off function, if any
> > > > @@ -741,6 +742,20 @@ void native_machine_shutdown(void)
> > > >     local_irq_disable();
> > > >     stop_other_cpus();
> > > >  #endif
> > > > +   /*
> > > > +    * stop_other_cpus() has flushed all dirty cachelines of TDX
> > > > +    * private memory on remote cpus.  Unlike SME, which does the
> > > > +    * cache flush on _this_ cpu in the relocate_kernel(), flush
> > > > +    * the cache for _this_ cpu here.  This is because on the
> > > > +    * platforms with "partial write machine check" erratum the
> > > > +    * kernel needs to convert all TDX private pages back to normal
> > > > +    * before booting to the new kernel in kexec(), and the cache
> > > > +    * flush must be done before that.  If the kernel took SME's way,
> > > > +    * it would have to muck with the relocate_kernel() assembly to
> > > > +    * do memory conversion.
> > > > +    */
> > > > +   if (platform_tdx_enabled())
> > > > +           native_wbinvd();
> > > 
> > > Why can't the TDX host code just set host_mem_enc_active=1?
> > > 
> > > Sure, you'll end up *using* the SME WBINVD support, but then you don't
> > > have two different WBINVD call sites.  You also don't have to mess with
> > > a single line of assembly.
> > 
> > I wanted to avoid changing the assembly.
> > 
> > Perhaps the comment isn't very clear.  Flushing cache (on the CPU running kexec)
> > when the host_mem_enc_active=1 is handled in the relocate_kernel() assembly,
> > which happens at very late stage right before jumping to the new kernel.
> > However for TDX when the platform has erratum we need to convert TDX private
> > pages back to normal, which must be done after flushing cache.  If we reuse
> > host_mem_enc_active=1, then we will need to change the assembly code to do that.
> 
> I honestly think you need to stop thinking about the partial write issue
> at this point in the series.  It's really causing a horrible amount of
> unnecessary confusion.
> 
> Here's the golden rule:
> 
> 	The boot CPU needs to run WBINVD sometime after it stops writing
> 	to private memory but before it starts treating the memory as
> 	shared.
> 
> On SME kernels, that key point evidently in early boot when it's
> enabling SME.  I _think_ that point is also a valid place to do WBINVD
> on no-TDX-erratum systems.

You mean we could advertise cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT) true for
TDX host? We could but IMHO it doesn't perfectly match.

SME kernel sets _PAGE_ENC on by default for all memory mappings IIUC, but TDX
host never actually sets any encryption bits in page tables managed by the
kernel.

So I think we can just do below, but not advertise CC_ATTR_HOST_MEM_ENCRYPT for
TDX host?

--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -377,7 +377,8 @@ void machine_kexec(struct kimage *image)
                                       (unsigned long)page_list,
                                       image->start,
                                       image->preserve_context,
-                                     
cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT));
+                                      cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT)
||
+                                      platform_tdx_enabled());


> 
> On TDX systems with the erratum, there's a *second* point before the
> private=>shared conversion occurs.  I think what you're trying to do
> here is prematurely optimize the erratum-affected affected systems so
> that they don't do two WBINVDs.  Please stop trying to do that.
> 
> This WBINVD is only _needed_ for the erratum.  It should be closer to
> the actual erratum handing.

If we do WBINVD early here then the second one isn't needed.  But 100% agreed
this handling/optimization should be done later closer to the erratum handling.
Dave Hansen Nov. 27, 2023, 9:06 p.m. UTC | #6
On 11/27/23 12:52, Huang, Kai wrote:
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -377,7 +377,8 @@ void machine_kexec(struct kimage *image)
>                                        (unsigned long)page_list,
>                                        image->start,
>                                        image->preserve_context,
> -
> cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT));
> +                                      cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT)
> ||
> +                                      platform_tdx_enabled());

Well, something more like the attached would be preferable, but you've
got the right idea logically.
Huang, Kai Nov. 27, 2023, 10:09 p.m. UTC | #7
On Mon, 2023-11-27 at 13:06 -0800, Hansen, Dave wrote:
> On 11/27/23 12:52, Huang, Kai wrote:
> > --- a/arch/x86/kernel/machine_kexec_64.c
> > +++ b/arch/x86/kernel/machine_kexec_64.c
> > @@ -377,7 +377,8 @@ void machine_kexec(struct kimage *image)
> >                                        (unsigned long)page_list,
> >                                        image->start,
> >                                        image->preserve_context,
> > -
> > cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT));
> > +                                      cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT)
> > > > 
> > +                                      platform_tdx_enabled());
> 
> Well, something more like the attached would be preferable, but you've
> got the right idea logically.

Thanks!

On top of that, I think below code (also attached the diff) should do
advertising the CC_ATTR_HOST_MEM_INCOHERENT for TDX host?

diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index 2e2d559169a8..bec70b967504 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -12,6 +12,8 @@
 
 #include <asm/coco.h>
 #include <asm/processor.h>
+#include <asm/cpufeatures.h>
+#include <asm/tdx.h>
 
 enum cc_vendor cc_vendor __ro_after_init = CC_VENDOR_NONE;
 static u64 cc_mask __ro_after_init;
@@ -23,7 +25,9 @@ static bool noinstr intel_cc_platform_has(enum cc_attr attr)
        case CC_ATTR_HOTPLUG_DISABLED:
        case CC_ATTR_GUEST_MEM_ENCRYPT:
        case CC_ATTR_MEM_ENCRYPT:
-               return true;
+               return cpu_feature_enabled(X86_FEATURE_TDX_GUEST);
+       case CC_ATTR_HOST_MEM_INCOHERENT:
+               return platform_tdx_enabled();
        default:
                return false;
        }
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 73cd2f7b7d87..1ae21348edc1 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1634,6 +1634,13 @@ static int __init tdx_init(void)
        tdx_guest_keyid_start = tdx_keyid_start + 1;
        tdx_nr_guest_keyids = nr_tdx_keyids - 1;
 
+       /*
+        * TDX doesn't guarantee cache coherency among different
+        * KeyIDs.  Advertise the CC_ATTR_HOST_MEM_INCOHERENT
+        * attribute for TDX host.
+        */
+       cc_vendor = CC_VENDOR_INTEL;
+
        return 0;
 }
 early_initcall(tdx_init);


I'll do some test with your code and the above code.
diff mbox series

Patch

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index b6f4e8399fca..8e3cf0f8d7f9 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -823,8 +823,14 @@  void __noreturn stop_this_cpu(void *dummy)
 	 *
 	 * Test the CPUID bit directly because the machine might've cleared
 	 * X86_FEATURE_SME due to cmdline options.
+	 *
+	 * The TDX module or guests might have left dirty cachelines
+	 * behind.  Flush them to avoid corruption from later writeback.
+	 * Note that this flushes on all systems where TDX is possible,
+	 * but does not actually check that TDX was in use.
 	 */
-	if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
+	if ((c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
+			|| platform_tdx_enabled())
 		native_wbinvd();
 
 	/*
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index 830425e6d38e..e1a4fa8de11d 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -31,6 +31,7 @@ 
 #include <asm/realmode.h>
 #include <asm/x86_init.h>
 #include <asm/efi.h>
+#include <asm/tdx.h>
 
 /*
  * Power off function, if any
@@ -741,6 +742,20 @@  void native_machine_shutdown(void)
 	local_irq_disable();
 	stop_other_cpus();
 #endif
+	/*
+	 * stop_other_cpus() has flushed all dirty cachelines of TDX
+	 * private memory on remote cpus.  Unlike SME, which does the
+	 * cache flush on _this_ cpu in the relocate_kernel(), flush
+	 * the cache for _this_ cpu here.  This is because on the
+	 * platforms with "partial write machine check" erratum the
+	 * kernel needs to convert all TDX private pages back to normal
+	 * before booting to the new kernel in kexec(), and the cache
+	 * flush must be done before that.  If the kernel took SME's way,
+	 * it would have to muck with the relocate_kernel() assembly to
+	 * do memory conversion.
+	 */
+	if (platform_tdx_enabled())
+		native_wbinvd();
 
 	lapic_shutdown();
 	restore_boot_irq_mode();