[v3,00/21] TDX host kernel support

Message ID	cover.1649219184.git.kai.huang@intel.com (mailing list archive)
Headers	show Return-Path: <kvm-owner@kernel.org> From: Kai Huang <kai.huang@intel.com> To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: seanjc@google.com, pbonzini@redhat.com, dave.hansen@intel.com, len.brown@intel.com, tony.luck@intel.com, rafael.j.wysocki@intel.com, reinette.chatre@intel.com, dan.j.williams@intel.com, peterz@infradead.org, ak@linux.intel.com, kirill.shutemov@linux.intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, isaku.yamahata@intel.com, kai.huang@intel.com Subject: [PATCH v3 00/21] TDX host kernel support Date: Wed, 6 Apr 2022 16:49:12 +1200 Message-Id: <cover.1649219184.git.kai.huang@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	TDX host kernel support \| expand [v3,00/21] TDX host kernel support [v3,01/21] x86/virt/tdx: Detect SEAM [v3,02/21] x86/virt/tdx: Detect TDX private KeyIDs [v3,03/21] x86/virt/tdx: Implement the SEAMCALL base function [v3,04/21] x86/virt/tdx: Add skeleton for detecting and initializing TDX on demand [v3,05/21] x86/virt/tdx: Detect P-SEAMLDR and TDX module [v3,06/21] x86/virt/tdx: Shut down TDX module in case of error [v3,07/21] x86/virt/tdx: Do TDX module global initialization [v3,08/21] x86/virt/tdx: Do logical-cpu scope TDX module initialization [v3,09/21] x86/virt/tdx: Get information about TDX module and convertible memory [v3,10/21] x86/virt/tdx: Add placeholder to coveret all system RAM as TDX memory [v3,11/21] x86/virt/tdx: Choose to use all system RAM as TDX memory [v3,12/21] x86/virt/tdx: Create TDMRs to cover all system RAM [v3,13/21] x86/virt/tdx: Allocate and set up PAMTs for TDMRs [v3,14/21] x86/virt/tdx: Set up reserved areas for all TDMRs [v3,15/21] x86/virt/tdx: Reserve TDX module global KeyID [v3,16/21] x86/virt/tdx: Configure TDX module with TDMRs and global KeyID [v3,17/21] x86/virt/tdx: Configure global KeyID on all packages [v3,18/21] x86/virt/tdx: Initialize all TDMRs [v3,19/21] x86: Flush cache of TDX private memory during kexec() [v3,20/21] x86/virt/tdx: Add kernel command line to opt-in TDX host support [v3,21/21] Documentation/x86: Add documentation for TDX host support

Huang, Kai April 6, 2022, 4:49 a.m. UTC

Intel Trusted Domain Extensions (TDX) protects guest VMs from malicious
host and certain physical attacks. This series provides support for
initializing the TDX module in the host kernel. KVM support for TDX is
being developed separately[1].

The code has been tested on couple of TDX-capable machines. I would
consider it as ready for review. I highly appreciate if anyone can help
to review this series (from high level design to detail implementations).
For Intel reviewers (CC'ed), please help to review, and I would
appreciate Reviewed-by or Acked-by tags if the patches look good to you.

Thanks in advance.

This series is based on Kirill's TDX guest series[2]. The reason is host
side SEAMCALL implementation can share TDCALL's implementation which is
implemented in TDX guest series.

You can find TDX related specs here:
https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html

You can also find this series in below repo in github:
https://github.com/intel/tdx/tree/host-upstream

Changelog history:

- V2 -> v3:

- Rebased to latest TDX guest code, which is based on 5.18-rc1.
- Addressed comments from Isaku.
- Fixed memory leak and unnecessary function argument in the patch to
configure the key for the global keyid (patch 17).
- Enhanced a little bit to the patch to get TDX module and CMR
information (patch 09).
- Fixed an unintended change in the patch to allocate PAMT (patch 13).
- Addressed comments from Kevin:
- Slightly improvement on commit message to patch 03.
- Removed WARN_ON_ONCE() in the check of cpus_booted_once_mask in
seamrr_enabled() (patch 04).
- Changed documentation patch to add TDX host kernel support materials
to Documentation/x86/tdx.rst together with TDX guest staff, instead
of a standalone file (patch 21)

- RFC (v1) -> v2:
- Rebased to Kirill's latest TDX guest code.
- Fixed two issues that are related to finding all RAM memory regions
based on e820.
- Minor improvement on comments and commit messages.

V2:
https://lore.kernel.org/lkml/cover.1647167475.git.kai.huang@intel.com/T/
RFC (v1):
https://lore.kernel.org/all/e0ff030a49b252d91c789a89c303bb4206f85e3d.1646007267.git.kai.huang@intel.com/T/

== Background ==

Intel Trusted Domain Extensions (TDX) protects guest VMs from malicious
host and certain physical attacks. To support TDX, a new CPU mode called
Secure Arbitration Mode (SEAM) is added to Intel processors.

SEAM is an extension to the existing VMX architecture. It defines a new
VMX root operation (SEAM VMX root) and a new VMX non-root operation (SEAM
VMX non-root).

SEAM VMX root operation is designed to host a CPU-attested, software
module called the 'TDX module' which implements functions to manage
crypto protected VMs called Trust Domains (TD). SEAM VMX root is also
designed to host a CPU-attested, software module called the 'Intel
Persistent SEAMLDR (Intel P-SEAMLDR)' to load and update the TDX module.

Host kernel transits to either the P-SEAMLDR or the TDX module via a new
SEAMCALL instruction. SEAMCALLs are host-side interface functions
defined by the P-SEAMLDR and the TDX module around the new SEAMCALL
instruction. They are similar to a hypercall, except they are made by
host kernel to the SEAM software modules.

TDX leverages Intel Multi-Key Total Memory Encryption (MKTME) to crypto
protect TD guests. TDX reserves part of MKTME KeyID space as TDX private
KeyIDs, which can only be used by software runs in SEAM. The physical
address bits for encoding TDX private KeyID are treated as reserved bits
when not in SEAM operation. The partitioning of MKTME KeyIDs and TDX
private KeyIDs is configured by BIOS.

Before being able to manage TD guests, the TDX module must be loaded
and properly initialized using SEAMCALLs defined by TDX architecture.
This series assumes both the P-SEAMLDR and the TDX module are loaded by
BIOS before the kernel boots.

There's no CPUID or MSR to detect either the P-SEAMLDR or the TDX module.
Instead, detecting them can be done by using P-SEAMLDR's SEAMLDR.INFO
SEAMCALL to detect P-SEAMLDR. The success of this SEAMCALL means the
P-SEAMLDR is loaded. The P-SEAMLDR information returned by this
SEAMCALL further tells whether TDX module is loaded.

The TDX module is initialized in multiple steps:

1) Global initialization;
2) Logical-CPU scope initialization;
3) Enumerate the TDX module capabilities;
4) Configure the TDX module about usable memory ranges and
global KeyID information;
5) Package-scope configuration for the global KeyID;
6) Initialize TDX metadata for usable memory ranges based on 4).

Step 2) requires calling some SEAMCALL on all "BIOS-enabled" (in MADT
table) logical cpus, otherwise step 4) will fail. Step 5) requires
calling SEAMCALL on at least one cpu on all packages.

TDX module can also be shut down at any time during module's lifetime, by
calling SEAMCALL on all "BIOS-enabled" logical cpus.

== Design Considerations ==

1. Lazy TDX module initialization on-demand by caller

None of the steps in the TDX module initialization process must be done
during kernel boot. This series doesn't initialize TDX at boot time, but
instead, provides two functions to allow caller to detect and initialize
TDX on demand:

if (tdx_detect())
goto no_tdx;
if (tdx_init())
goto no_tdx;

This approach has below pros:

1) Initializing the TDX module requires to reserve ~1/256th system RAM as
metadata. Enabling TDX on demand allows only to consume this memory when
TDX is truly needed (i.e. when KVM wants to create TD guests).

2) Both detecting and initializing the TDX module require calling
SEAMCALL. However, SEAMCALL requires CPU being already in VMX operation
(VMXON has been done). So far, KVM is the only user of TDX, and it
already handles VMXON/VMXOFF. Therefore, letting KVM to initialize TDX
on-demand avoids handling VMXON/VMXOFF (which is not that trivial) in
core-kernel. Also, in long term, likely a reference based VMXON/VMXOFF
approach is needed since more kernel components will need to handle
VMXON/VMXONFF.

3) It is more flexible to support "TDX module runtime update" (not in
this series). After updating to the new module at runtime, kernel needs
to go through the initialization process again. For the new module,
it's possible the metadata allocated for the old module cannot be reused
for the new module, and needs to be re-allocated again.

2. Kernel policy on TDX memory

Host kernel is responsible for choosing which memory regions can be used
as TDX memory, and configuring those memory regions to the TDX module by
using an array of "TD Memory Regions" (TDMR), which is a data structure
defined by TDX architecture.

The first generation of TDX essentially guarantees that all system RAM
memory regions (excluding the memory below 1MB) can be used as TDX
memory. To avoid having to modify the page allocator to distinguish TDX
and non-TDX allocation, this series chooses to use all system RAM as TDX
memory.

E820 table is used to find all system RAM entries. Following
e820__memblock_setup(), both E820_TYPE_RAM and E820_TYPE_RESERVED_KERN
types are treated as TDX memory, and contiguous ranges in the same NUMA
node are merged together (similar to memblock_add()) before trimming the
non-page-aligned part.

X86 Legacy PMEMs (E820_TYPE_PRAM) also unconditionally treated as TDX
memory as underneath they are RAM and can be potentially used as TD guest
memory.

Memblock is not used to find all RAM regions as: 1) it is gone after
kernel boots; 2) it doesn't have legacy PMEM.

3. Memory hotplug

The first generation of TDX architecturally doesn't support memory
hotplug. And the first generation of TDX-capable platforms don't support
physical memory hotplug. Since it physically cannot happen, this series
doesn't add any check in ACPI memory hotplug code path to disable it.

A special case of memory hotplug is adding NVDIMM as system RAM using
kmem driver. However the first generation of TDX-capable platforms
cannot enable TDX and NVDIMM simultaneously, so in practice this cannot
happen either.

Another case is admin can use 'memmap' kernel command line to create
legacy PMEMs and use them as TD guest memory, or theoretically, can use
kmem driver to add them as system RAM. To avoid having to change memory
hotplug code to prevent this from happening, this series always include
legacy PMEMs when constructing TDMRs so they are also TDX memory.

4. CPU hotplug

The first generation of TDX architecturally doesn't support ACPI CPU
hotplug. All logical cpus are enabled by BIOS in MADT table. Also, the
first generation of TDX-capable platforms don't support ACPI CPU hotplug
either. Since this physically cannot happen, this series doesn't add any
check in ACPI CPU hotplug code path to disable it.

Also, only TDX module initialization requires all BIOS-enabled cpus are
online. After the initialization, any logical cpu can be brought down
and brought up to online again later. Therefore this series doesn't
change logical CPU hotplug either.

5. TDX interaction with kexec()

If TDX is ever enabled and/or used to run any TD guests, the cachelines
of TDX private memory, including PAMTs, used by TDX module need to be
flushed before transiting to the new kernel otherwise they may silently
corrupt the new kernel. Similar to SME, this series flushes cache in
stop_this_cpu().

The TDX module can be initialized only once during its lifetime. The
first generation of TDX doesn't have interface to reset TDX module to
uninitialized state so it can be initialized again.

This implies:

- If the old kernel fails to initialize TDX, the new kernel cannot
use TDX too unless the new kernel fixes the bug which leads to
initialization failure in the old kernel and can resume from where
the old kernel stops. This requires certain coordination between
the two kernels.

- If the old kernel has initialized TDX successfully, the new kernel
may be able to use TDX if the two kernels have the exactly same
configurations on the TDX module. It further requires the new kernel
to reserve the TDX metadata pages (allocated by the old kernel) in
its page allocator. It also requires coordination between the two
kernels. Furthermore, if kexec() is done when there are active TD
guests running, the new kernel cannot use TDX because it's extremely
hard for the old kernel to pass all TDX private pages to the new
kernel.

Given that, this series doesn't support TDX after kexec() (except the
old kernel doesn't attempt to initialize TDX at all).

And this series doesn't shut down TDX module but leaves it open during
kexec(). It is because shutting down TDX module requires CPU being in
VMX operation but there's no guarantee of this during kexec(). Leaving
the TDX module open is not the best case, but it is OK since the new
kernel won't be able to use TDX anyway (therefore TDX module won't run
at all).

[1] https://lore.kernel.org/lkml/772b20e270b3451aea9714260f2c40ddcc4afe80.1646422845.git.isaku.yamahata@intel.com/T/
[2] https://github.com/intel/tdx/tree/guest-upstream

Kai Huang (21):
x86/virt/tdx: Detect SEAM
x86/virt/tdx: Detect TDX private KeyIDs
x86/virt/tdx: Implement the SEAMCALL base function
x86/virt/tdx: Add skeleton for detecting and initializing TDX on
demand
x86/virt/tdx: Detect P-SEAMLDR and TDX module
x86/virt/tdx: Shut down TDX module in case of error
x86/virt/tdx: Do TDX module global initialization
x86/virt/tdx: Do logical-cpu scope TDX module initialization
x86/virt/tdx: Get information about TDX module and convertible memory
x86/virt/tdx: Add placeholder to coveret all system RAM as TDX memory
x86/virt/tdx: Choose to use all system RAM as TDX memory
x86/virt/tdx: Create TDMRs to cover all system RAM
x86/virt/tdx: Allocate and set up PAMTs for TDMRs
x86/virt/tdx: Set up reserved areas for all TDMRs
x86/virt/tdx: Reserve TDX module global KeyID
x86/virt/tdx: Configure TDX module with TDMRs and global KeyID
x86/virt/tdx: Configure global KeyID on all packages
x86/virt/tdx: Initialize all TDMRs
x86: Flush cache of TDX private memory during kexec()
x86/virt/tdx: Add kernel command line to opt-in TDX host support
Documentation/x86: Add documentation for TDX host support

.../admin-guide/kernel-parameters.txt | 6 +
Documentation/x86/tdx.rst | 326 +++-
arch/x86/Kconfig | 14 +
arch/x86/Makefile | 2 +
arch/x86/include/asm/tdx.h | 15 +
arch/x86/kernel/cpu/intel.c | 3 +
arch/x86/kernel/process.c | 15 +-
arch/x86/virt/Makefile | 2 +
arch/x86/virt/vmx/Makefile | 2 +
arch/x86/virt/vmx/tdx/Makefile | 2 +
arch/x86/virt/vmx/tdx/seamcall.S | 52 +
arch/x86/virt/vmx/tdx/tdx.c | 1717 +++++++++++++++++
arch/x86/virt/vmx/tdx/tdx.h | 137 ++
13 files changed, 2279 insertions(+), 14 deletions(-)
create mode 100644 arch/x86/virt/Makefile
create mode 100644 arch/x86/virt/vmx/Makefile
create mode 100644 arch/x86/virt/vmx/tdx/Makefile
create mode 100644 arch/x86/virt/vmx/tdx/seamcall.S
create mode 100644 arch/x86/virt/vmx/tdx/tdx.c
create mode 100644 arch/x86/virt/vmx/tdx/tdx.h

Huang, Kai April 14, 2022, 10:19 a.m. UTC | #1

On Wed, 2022-04-06 at 16:49 +1200, Kai Huang wrote:
> Intel Trusted Domain Extensions (TDX) protects guest VMs from malicious
> host and certain physical attacks.  This series provides support for
> initializing the TDX module in the host kernel.  KVM support for TDX is
> being developed separately[1].
> 
> The code has been tested on couple of TDX-capable machines.  I would
> consider it as ready for review. I highly appreciate if anyone can help
> to review this series (from high level design to detail implementations).
> For Intel reviewers (CC'ed), please help to review, and I would
> appreciate Reviewed-by or Acked-by tags if the patches look good to you.

Hi Intel reviewers,

Kindly ping.  Could you help to review?

--
Thanks,
-Kai

Dave Hansen April 26, 2022, 8:13 p.m. UTC | #2

On 4/5/22 21:49, Kai Huang wrote:
> SEAM VMX root operation is designed to host a CPU-attested, software
> module called the 'TDX module' which implements functions to manage
> crypto protected VMs called Trust Domains (TD).  SEAM VMX root is also

"crypto protected"?  What the heck is that?

> designed to host a CPU-attested, software module called the 'Intel
> Persistent SEAMLDR (Intel P-SEAMLDR)' to load and update the TDX module.
> 
> Host kernel transits to either the P-SEAMLDR or the TDX module via a new

 ^ The

> SEAMCALL instruction.  SEAMCALLs are host-side interface functions
> defined by the P-SEAMLDR and the TDX module around the new SEAMCALL
> instruction.  They are similar to a hypercall, except they are made by
> host kernel to the SEAM software modules.

This is still missing some important high-level things, like that the
TDX module is protected from the untrusted VMM.  Heck, it forgets to
mention that the VMM itself is untrusted and the TDX module replaces
things that the VMM usually does.

It would also be nice to mention here how this compares with SEV-SNP.
Where is the TDX module in that design?  Why doesn't SEV need all this code?

> TDX leverages Intel Multi-Key Total Memory Encryption (MKTME) to crypto
> protect TD guests.  TDX reserves part of MKTME KeyID space as TDX private
> KeyIDs, which can only be used by software runs in SEAM.  The physical

					    ^ which

> address bits for encoding TDX private KeyID are treated as reserved bits
> when not in SEAM operation.  The partitioning of MKTME KeyIDs and TDX
> private KeyIDs is configured by BIOS.
> 
> Before being able to manage TD guests, the TDX module must be loaded
> and properly initialized using SEAMCALLs defined by TDX architecture.
> This series assumes both the P-SEAMLDR and the TDX module are loaded by
> BIOS before the kernel boots.
> 
> There's no CPUID or MSR to detect either the P-SEAMLDR or the TDX module.
> Instead, detecting them can be done by using P-SEAMLDR's SEAMLDR.INFO
> SEAMCALL to detect P-SEAMLDR.  The success of this SEAMCALL means the
> P-SEAMLDR is loaded.  The P-SEAMLDR information returned by this
> SEAMCALL further tells whether TDX module is loaded.

There's a bit of information missing here.  The kernel might not know
the state of things being loaded.  A previous kernel might have loaded
it and left it in an unknown state.

> The TDX module is initialized in multiple steps:
> 
>         1) Global initialization;
>         2) Logical-CPU scope initialization;
>         3) Enumerate the TDX module capabilities;
>         4) Configure the TDX module about usable memory ranges and
>            global KeyID information;
>         5) Package-scope configuration for the global KeyID;
>         6) Initialize TDX metadata for usable memory ranges based on 4).
> 
> Step 2) requires calling some SEAMCALL on all "BIOS-enabled" (in MADT
> table) logical cpus, otherwise step 4) will fail.  Step 5) requires
> calling SEAMCALL on at least one cpu on all packages.
> 
> TDX module can also be shut down at any time during module's lifetime, by
> calling SEAMCALL on all "BIOS-enabled" logical cpus.
> 
> == Design Considerations ==
> 
> 1. Lazy TDX module initialization on-demand by caller

This doesn't really tell us what "lazy" is or what the alternatives are.

There are basically two ways the TDX module could be loaded.  Either:
  * In early boot
or
  * At runtime just before the first TDX guest is run

This series implements the runtime loading.

> None of the steps in the TDX module initialization process must be done
> during kernel boot.  This series doesn't initialize TDX at boot time, but
> instead, provides two functions to allow caller to detect and initialize
> TDX on demand:
> 
>         if (tdx_detect())
>                 goto no_tdx;
>         if (tdx_init())
>                 goto no_tdx;
> 
> This approach has below pros:
> 
> 1) Initializing the TDX module requires to reserve ~1/256th system RAM as
> metadata.  Enabling TDX on demand allows only to consume this memory when
> TDX is truly needed (i.e. when KVM wants to create TD guests).
> 
> 2) Both detecting and initializing the TDX module require calling
> SEAMCALL.  However, SEAMCALL requires CPU being already in VMX operation
> (VMXON has been done).  So far, KVM is the only user of TDX, and it
> already handles VMXON/VMXOFF.  Therefore, letting KVM to initialize TDX
> on-demand avoids handling VMXON/VMXOFF (which is not that trivial) in
> core-kernel.  Also, in long term, likely a reference based VMXON/VMXOFF
> approach is needed since more kernel components will need to handle
> VMXON/VMXONFF.
> 
> 3) It is more flexible to support "TDX module runtime update" (not in
> this series).  After updating to the new module at runtime, kernel needs
> to go through the initialization process again.  For the new module,
> it's possible the metadata allocated for the old module cannot be reused
> for the new module, and needs to be re-allocated again.
> 
> 2. Kernel policy on TDX memory
> 
> Host kernel is responsible for choosing which memory regions can be used
> as TDX memory, and configuring those memory regions to the TDX module by
> using an array of "TD Memory Regions" (TDMR), which is a data structure
> defined by TDX architecture.


This is putting the cart before the horse.  Don't define the details up
front.

	The TDX architecture allows the VMM to designate specific memory
	as usable for TDX private memory.  This series chooses to
	designate _all_ system RAM as TDX to avoid having to modify the
	page allocator to distinguish TDX and non-TDX-capable memory

... then go on to explain the details.

> The first generation of TDX essentially guarantees that all system RAM
> memory regions (excluding the memory below 1MB) can be used as TDX
> memory.  To avoid having to modify the page allocator to distinguish TDX
> and non-TDX allocation, this series chooses to use all system RAM as TDX
> memory.
> 
> E820 table is used to find all system RAM entries.  Following
> e820__memblock_setup(), both E820_TYPE_RAM and E820_TYPE_RESERVED_KERN
> types are treated as TDX memory, and contiguous ranges in the same NUMA
> node are merged together (similar to memblock_add()) before trimming the
> non-page-aligned part.

This e820 cruft is too much detail for a cover letter.  In general, once
you start talking about individual functions, you've gone too far in the
cover letter.

> 3. Memory hotplug
> 
> The first generation of TDX architecturally doesn't support memory
> hotplug.  And the first generation of TDX-capable platforms don't support
> physical memory hotplug.  Since it physically cannot happen, this series
> doesn't add any check in ACPI memory hotplug code path to disable it.
> 
> A special case of memory hotplug is adding NVDIMM as system RAM using
> kmem driver.  However the first generation of TDX-capable platforms
> cannot enable TDX and NVDIMM simultaneously, so in practice this cannot
> happen either.

What prevents this code from today's code being run on tomorrow's
platforms and breaking these assumptions?

> Another case is admin can use 'memmap' kernel command line to create
> legacy PMEMs and use them as TD guest memory, or theoretically, can use
> kmem driver to add them as system RAM.  To avoid having to change memory
> hotplug code to prevent this from happening, this series always include
> legacy PMEMs when constructing TDMRs so they are also TDX memory.
> 
> 4. CPU hotplug
> 
> The first generation of TDX architecturally doesn't support ACPI CPU
> hotplug.  All logical cpus are enabled by BIOS in MADT table.  Also, the
> first generation of TDX-capable platforms don't support ACPI CPU hotplug
> either.  Since this physically cannot happen, this series doesn't add any
> check in ACPI CPU hotplug code path to disable it.
> 
> Also, only TDX module initialization requires all BIOS-enabled cpus are
> online.  After the initialization, any logical cpu can be brought down
> and brought up to online again later.  Therefore this series doesn't
> change logical CPU hotplug either.
> 
> 5. TDX interaction with kexec()
> 
> If TDX is ever enabled and/or used to run any TD guests, the cachelines
> of TDX private memory, including PAMTs, used by TDX module need to be
> flushed before transiting to the new kernel otherwise they may silently
> corrupt the new kernel.  Similar to SME, this series flushes cache in
> stop_this_cpu().

What does this have to do with kexec()?  What's a PAMT?

> The TDX module can be initialized only once during its lifetime.  The
> first generation of TDX doesn't have interface to reset TDX module to

				      ^ an

> uninitialized state so it can be initialized again.
> 
> This implies:
> 
>   - If the old kernel fails to initialize TDX, the new kernel cannot
>     use TDX too unless the new kernel fixes the bug which leads to
>     initialization failure in the old kernel and can resume from where
>     the old kernel stops. This requires certain coordination between
>     the two kernels.

OK, but what does this *MEAN*?

>   - If the old kernel has initialized TDX successfully, the new kernel
>     may be able to use TDX if the two kernels have the exactly same
>     configurations on the TDX module. It further requires the new kernel
>     to reserve the TDX metadata pages (allocated by the old kernel) in
>     its page allocator. It also requires coordination between the two
>     kernels.  Furthermore, if kexec() is done when there are active TD
>     guests running, the new kernel cannot use TDX because it's extremely
>     hard for the old kernel to pass all TDX private pages to the new
>     kernel.
> 
> Given that, this series doesn't support TDX after kexec() (except the
> old kernel doesn't attempt to initialize TDX at all).
> 
> And this series doesn't shut down TDX module but leaves it open during
> kexec().  It is because shutting down TDX module requires CPU being in
> VMX operation but there's no guarantee of this during kexec().  Leaving
> the TDX module open is not the best case, but it is OK since the new
> kernel won't be able to use TDX anyway (therefore TDX module won't run
> at all).

tl;dr: kexec() doesn't work with this code.

Right?

That doesn't seem good.

Huang, Kai April 27, 2022, 1:15 a.m. UTC | #3

On Tue, 2022-04-26 at 13:13 -0700, Dave Hansen wrote:
> On 4/5/22 21:49, Kai Huang wrote:
> > SEAM VMX root operation is designed to host a CPU-attested, software
> > module called the 'TDX module' which implements functions to manage
> > crypto protected VMs called Trust Domains (TD).  SEAM VMX root is also
> 
> "crypto protected"?  What the heck is that?

How about "crypto-protected"?  I googled and it seems it is used by someone
else.

> 
> > designed to host a CPU-attested, software module called the 'Intel
> > Persistent SEAMLDR (Intel P-SEAMLDR)' to load and update the TDX module.
> > 
> > Host kernel transits to either the P-SEAMLDR or the TDX module via a new
> 
>  ^ The

Thanks.

> 
> > SEAMCALL instruction.  SEAMCALLs are host-side interface functions
> > defined by the P-SEAMLDR and the TDX module around the new SEAMCALL
> > instruction.  They are similar to a hypercall, except they are made by
> > host kernel to the SEAM software modules.
> 
> This is still missing some important high-level things, like that the
> TDX module is protected from the untrusted VMM.  Heck, it forgets to
> mention that the VMM itself is untrusted and the TDX module replaces
> things that the VMM usually does.
> 
> It would also be nice to mention here how this compares with SEV-SNP.
> Where is the TDX module in that design?  Why doesn't SEV need all this code?
> 
> > TDX leverages Intel Multi-Key Total Memory Encryption (MKTME) to crypto
> > protect TD guests.  TDX reserves part of MKTME KeyID space as TDX private
> > KeyIDs, which can only be used by software runs in SEAM.  The physical
> 
> 					    ^ which

Thanks.

> 
> > address bits for encoding TDX private KeyID are treated as reserved bits
> > when not in SEAM operation.  The partitioning of MKTME KeyIDs and TDX
> > private KeyIDs is configured by BIOS.
> > 
> > Before being able to manage TD guests, the TDX module must be loaded
> > and properly initialized using SEAMCALLs defined by TDX architecture.
> > This series assumes both the P-SEAMLDR and the TDX module are loaded by
> > BIOS before the kernel boots.
> > 
> > There's no CPUID or MSR to detect either the P-SEAMLDR or the TDX module.
> > Instead, detecting them can be done by using P-SEAMLDR's SEAMLDR.INFO
> > SEAMCALL to detect P-SEAMLDR.  The success of this SEAMCALL means the
> > P-SEAMLDR is loaded.  The P-SEAMLDR information returned by this
> > SEAMCALL further tells whether TDX module is loaded.
> 
> There's a bit of information missing here.  The kernel might not know
> the state of things being loaded.  A previous kernel might have loaded
> it and left it in an unknown state.
> 
> > The TDX module is initialized in multiple steps:
> > 
> >         1) Global initialization;
> >         2) Logical-CPU scope initialization;
> >         3) Enumerate the TDX module capabilities;
> >         4) Configure the TDX module about usable memory ranges and
> >            global KeyID information;
> >         5) Package-scope configuration for the global KeyID;
> >         6) Initialize TDX metadata for usable memory ranges based on 4).
> > 
> > Step 2) requires calling some SEAMCALL on all "BIOS-enabled" (in MADT
> > table) logical cpus, otherwise step 4) will fail.  Step 5) requires
> > calling SEAMCALL on at least one cpu on all packages.
> > 
> > TDX module can also be shut down at any time during module's lifetime, by
> > calling SEAMCALL on all "BIOS-enabled" logical cpus.
> > 
> > == Design Considerations ==
> > 
> > 1. Lazy TDX module initialization on-demand by caller
> 
> This doesn't really tell us what "lazy" is or what the alternatives are.
> 
> There are basically two ways the TDX module could be loaded.  Either:
>   * In early boot
> or
>   * At runtime just before the first TDX guest is run
> 
> This series implements the runtime loading.

OK will do.

> 
> > None of the steps in the TDX module initialization process must be done
> > during kernel boot.  This series doesn't initialize TDX at boot time, but
> > instead, provides two functions to allow caller to detect and initialize
> > TDX on demand:
> > 
> >         if (tdx_detect())
> >                 goto no_tdx;
> >         if (tdx_init())
> >                 goto no_tdx;
> > 
> > This approach has below pros:
> > 
> > 1) Initializing the TDX module requires to reserve ~1/256th system RAM as
> > metadata.  Enabling TDX on demand allows only to consume this memory when
> > TDX is truly needed (i.e. when KVM wants to create TD guests).
> > 
> > 2) Both detecting and initializing the TDX module require calling
> > SEAMCALL.  However, SEAMCALL requires CPU being already in VMX operation
> > (VMXON has been done).  So far, KVM is the only user of TDX, and it
> > already handles VMXON/VMXOFF.  Therefore, letting KVM to initialize TDX
> > on-demand avoids handling VMXON/VMXOFF (which is not that trivial) in
> > core-kernel.  Also, in long term, likely a reference based VMXON/VMXOFF
> > approach is needed since more kernel components will need to handle
> > VMXON/VMXONFF.
> > 
> > 3) It is more flexible to support "TDX module runtime update" (not in
> > this series).  After updating to the new module at runtime, kernel needs
> > to go through the initialization process again.  For the new module,
> > it's possible the metadata allocated for the old module cannot be reused
> > for the new module, and needs to be re-allocated again.
> > 
> > 2. Kernel policy on TDX memory
> > 
> > Host kernel is responsible for choosing which memory regions can be used
> > as TDX memory, and configuring those memory regions to the TDX module by
> > using an array of "TD Memory Regions" (TDMR), which is a data structure
> > defined by TDX architecture.
> 
> 
> This is putting the cart before the horse.  Don't define the details up
> front.
> 
> 	The TDX architecture allows the VMM to designate specific memory
> 	as usable for TDX private memory.  This series chooses to
> 	designate _all_ system RAM as TDX to avoid having to modify the
> 	page allocator to distinguish TDX and non-TDX-capable memory
> 
> ... then go on to explain the details.

Thanks.  Will update.

> 
> > The first generation of TDX essentially guarantees that all system RAM
> > memory regions (excluding the memory below 1MB) can be used as TDX
> > memory.  To avoid having to modify the page allocator to distinguish TDX
> > and non-TDX allocation, this series chooses to use all system RAM as TDX
> > memory.
> > 
> > E820 table is used to find all system RAM entries.  Following
> > e820__memblock_setup(), both E820_TYPE_RAM and E820_TYPE_RESERVED_KERN
> > types are treated as TDX memory, and contiguous ranges in the same NUMA
> > node are merged together (similar to memblock_add()) before trimming the
> > non-page-aligned part.
> 
> This e820 cruft is too much detail for a cover letter.  In general, once
> you start talking about individual functions, you've gone too far in the
> cover letter.

Will remove.

> 
> > 3. Memory hotplug
> > 
> > The first generation of TDX architecturally doesn't support memory
> > hotplug.  And the first generation of TDX-capable platforms don't support
> > physical memory hotplug.  Since it physically cannot happen, this series
> > doesn't add any check in ACPI memory hotplug code path to disable it.
> > 
> > A special case of memory hotplug is adding NVDIMM as system RAM using
> > kmem driver.  However the first generation of TDX-capable platforms
> > cannot enable TDX and NVDIMM simultaneously, so in practice this cannot
> > happen either.
> 
> What prevents this code from today's code being run on tomorrow's
> platforms and breaking these assumptions?

I forgot to add below (which is in the documentation patch):

"This can be enhanced when future generation of TDX starts to support ACPI
memory hotplug, or NVDIMM and TDX can be enabled simultaneously on the
same platform."

Is this acceptable?

> 
> > Another case is admin can use 'memmap' kernel command line to create
> > legacy PMEMs and use them as TD guest memory, or theoretically, can use
> > kmem driver to add them as system RAM.  To avoid having to change memory
> > hotplug code to prevent this from happening, this series always include
> > legacy PMEMs when constructing TDMRs so they are also TDX memory.
> > 
> > 4. CPU hotplug
> > 
> > The first generation of TDX architecturally doesn't support ACPI CPU
> > hotplug.  All logical cpus are enabled by BIOS in MADT table.  Also, the
> > first generation of TDX-capable platforms don't support ACPI CPU hotplug
> > either.  Since this physically cannot happen, this series doesn't add any
> > check in ACPI CPU hotplug code path to disable it.
> > 
> > Also, only TDX module initialization requires all BIOS-enabled cpus are
> > online.  After the initialization, any logical cpu can be brought down
> > and brought up to online again later.  Therefore this series doesn't
> > change logical CPU hotplug either.
> > 
> > 5. TDX interaction with kexec()
> > 
> > If TDX is ever enabled and/or used to run any TD guests, the cachelines
> > of TDX private memory, including PAMTs, used by TDX module need to be
> > flushed before transiting to the new kernel otherwise they may silently
> > corrupt the new kernel.  Similar to SME, this series flushes cache in
> > stop_this_cpu().
> 
> What does this have to do with kexec()?  What's a PAMT?

The point is the dirty cachelines of TDX private memory must be flushed
otherwise they may slightly corrupt the new kexec()-ed kernel.

Will use "TDX metadata" instead of "PAMT".  The former has already been
mentioned above.

> 
> > The TDX module can be initialized only once during its lifetime.  The
> > first generation of TDX doesn't have interface to reset TDX module to
> 
> 				      ^ an

Thanks.

> 
> > uninitialized state so it can be initialized again.
> > 
> > This implies:
> > 
> >   - If the old kernel fails to initialize TDX, the new kernel cannot
> >     use TDX too unless the new kernel fixes the bug which leads to
> >     initialization failure in the old kernel and can resume from where
> >     the old kernel stops. This requires certain coordination between
> >     the two kernels.
> 
> OK, but what does this *MEAN*?

This means we need to extend the information which the old kernel passes to the
new kernel.  But I don't think it's feasible.  I'll refine this kexec() section
to make it more concise next version.

> 
> >   - If the old kernel has initialized TDX successfully, the new kernel
> >     may be able to use TDX if the two kernels have the exactly same
> >     configurations on the TDX module. It further requires the new kernel
> >     to reserve the TDX metadata pages (allocated by the old kernel) in
> >     its page allocator. It also requires coordination between the two
> >     kernels.  Furthermore, if kexec() is done when there are active TD
> >     guests running, the new kernel cannot use TDX because it's extremely
> >     hard for the old kernel to pass all TDX private pages to the new
> >     kernel.
> > 
> > Given that, this series doesn't support TDX after kexec() (except the
> > old kernel doesn't attempt to initialize TDX at all).
> > 
> > And this series doesn't shut down TDX module but leaves it open during
> > kexec().  It is because shutting down TDX module requires CPU being in
> > VMX operation but there's no guarantee of this during kexec().  Leaving
> > the TDX module open is not the best case, but it is OK since the new
> > kernel won't be able to use TDX anyway (therefore TDX module won't run
> > at all).
> 
> tl;dr: kexec() doesn't work with this code.
> 
> Right?
> 
> That doesn't seem good.

It can work in my understanding.  We just need to flush cache before booting to
the new kernel.

Dave Hansen April 27, 2022, 9:59 p.m. UTC | #4

On 4/26/22 18:15, Kai Huang wrote:
> On Tue, 2022-04-26 at 13:13 -0700, Dave Hansen wrote:
>> On 4/5/22 21:49, Kai Huang wrote:
>>> SEAM VMX root operation is designed to host a CPU-attested, software
>>> module called the 'TDX module' which implements functions to manage
>>> crypto protected VMs called Trust Domains (TD).  SEAM VMX root is also
>>
>> "crypto protected"?  What the heck is that?
> 
> How about "crypto-protected"?  I googled and it seems it is used by someone
> else.

Cryptography itself doesn't provide (much) protection in the TDX
architecture.  TDX guests are isolated from the VMM in ways that
traditional guests are not, but that has almost nothing to do with
cryptography.

Is it cryptography that keeps the host from reading guest private data
in the clear?  Is it cryptography that keeps the host from reading guest
ciphertext?  Does cryptography enforce the extra rules of Secure-EPT?

>>> 3. Memory hotplug
>>>
>>> The first generation of TDX architecturally doesn't support memory
>>> hotplug.  And the first generation of TDX-capable platforms don't support
>>> physical memory hotplug.  Since it physically cannot happen, this series
>>> doesn't add any check in ACPI memory hotplug code path to disable it.
>>>
>>> A special case of memory hotplug is adding NVDIMM as system RAM using
>>> kmem driver.  However the first generation of TDX-capable platforms
>>> cannot enable TDX and NVDIMM simultaneously, so in practice this cannot
>>> happen either.
>>
>> What prevents this code from today's code being run on tomorrow's
>> platforms and breaking these assumptions?
> 
> I forgot to add below (which is in the documentation patch):
> 
> "This can be enhanced when future generation of TDX starts to support ACPI
> memory hotplug, or NVDIMM and TDX can be enabled simultaneously on the
> same platform."
> 
> Is this acceptable?

No, Kai.

You're basically saying: *this* code doesn't work with feature A, B and
C.  Then, you're pivoting to say that it doesn't matter because one
version of Intel's hardware doesn't support A, B, or C.

I don't care about this *ONE* version of the hardware.  I care about
*ALL* the hardware that this code will ever support.  *ALL* the hardware
on which this code will run.

In 5 years, if someone takes this code and runs it on Intel hardware
with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens?

You can't just ignore the problems because they're not present on one
version of the hardware.

>>> Another case is admin can use 'memmap' kernel command line to create
>>> legacy PMEMs and use them as TD guest memory, or theoretically, can use
>>> kmem driver to add them as system RAM.  To avoid having to change memory
>>> hotplug code to prevent this from happening, this series always include
>>> legacy PMEMs when constructing TDMRs so they are also TDX memory.
>>>
>>> 4. CPU hotplug
>>>
>>> The first generation of TDX architecturally doesn't support ACPI CPU
>>> hotplug.  All logical cpus are enabled by BIOS in MADT table.  Also, the
>>> first generation of TDX-capable platforms don't support ACPI CPU hotplug
>>> either.  Since this physically cannot happen, this series doesn't add any
>>> check in ACPI CPU hotplug code path to disable it.
>>>
>>> Also, only TDX module initialization requires all BIOS-enabled cpus are
>>> online.  After the initialization, any logical cpu can be brought down
>>> and brought up to online again later.  Therefore this series doesn't
>>> change logical CPU hotplug either.
>>>
>>> 5. TDX interaction with kexec()
>>>
>>> If TDX is ever enabled and/or used to run any TD guests, the cachelines
>>> of TDX private memory, including PAMTs, used by TDX module need to be
>>> flushed before transiting to the new kernel otherwise they may silently
>>> corrupt the new kernel.  Similar to SME, this series flushes cache in
>>> stop_this_cpu().
>>
>> What does this have to do with kexec()?  What's a PAMT?
> 
> The point is the dirty cachelines of TDX private memory must be flushed
> otherwise they may slightly corrupt the new kexec()-ed kernel.
> 
> Will use "TDX metadata" instead of "PAMT".  The former has already been
> mentioned above.

Longer description for the patch itself:

TDX memory encryption is built on top of MKTME which uses physical
address aliases to designate encryption keys.  This architecture is not
cache coherent.  Software is responsible for flushing the CPU caches
when memory changes keys.  When kexec()'ing, memory can be repurposed
from TDX use to non-TDX use, changing the effective encryption key.

Cover-letter-level description:

Just like SME, TDX hosts require special cache flushing before kexec().

>>> uninitialized state so it can be initialized again.
>>>
>>> This implies:
>>>
>>>   - If the old kernel fails to initialize TDX, the new kernel cannot
>>>     use TDX too unless the new kernel fixes the bug which leads to
>>>     initialization failure in the old kernel and can resume from where
>>>     the old kernel stops. This requires certain coordination between
>>>     the two kernels.
>>
>> OK, but what does this *MEAN*?
> 
> This means we need to extend the information which the old kernel passes to the
> new kernel.  But I don't think it's feasible.  I'll refine this kexec() section
> to make it more concise next version.
> 
>>
>>>   - If the old kernel has initialized TDX successfully, the new kernel
>>>     may be able to use TDX if the two kernels have the exactly same
>>>     configurations on the TDX module. It further requires the new kernel
>>>     to reserve the TDX metadata pages (allocated by the old kernel) in
>>>     its page allocator. It also requires coordination between the two
>>>     kernels.  Furthermore, if kexec() is done when there are active TD
>>>     guests running, the new kernel cannot use TDX because it's extremely
>>>     hard for the old kernel to pass all TDX private pages to the new
>>>     kernel.
>>>
>>> Given that, this series doesn't support TDX after kexec() (except the
>>> old kernel doesn't attempt to initialize TDX at all).
>>>
>>> And this series doesn't shut down TDX module but leaves it open during
>>> kexec().  It is because shutting down TDX module requires CPU being in
>>> VMX operation but there's no guarantee of this during kexec().  Leaving
>>> the TDX module open is not the best case, but it is OK since the new
>>> kernel won't be able to use TDX anyway (therefore TDX module won't run
>>> at all).
>>
>> tl;dr: kexec() doesn't work with this code.
>>
>> Right?
>>
>> That doesn't seem good.
> 
> It can work in my understanding.  We just need to flush cache before booting to
> the new kernel.

What about all the concerns about TDX module configuration changing?

Huang, Kai April 28, 2022, 12:37 a.m. UTC | #5

On Wed, 2022-04-27 at 14:59 -0700, Dave Hansen wrote:
> On 4/26/22 18:15, Kai Huang wrote:
> > On Tue, 2022-04-26 at 13:13 -0700, Dave Hansen wrote:
> > > On 4/5/22 21:49, Kai Huang wrote:
> > > > SEAM VMX root operation is designed to host a CPU-attested, software
> > > > module called the 'TDX module' which implements functions to manage
> > > > crypto protected VMs called Trust Domains (TD).  SEAM VMX root is also
> > > 
> > > "crypto protected"?  What the heck is that?
> > 
> > How about "crypto-protected"?  I googled and it seems it is used by someone
> > else.
> 
> Cryptography itself doesn't provide (much) protection in the TDX
> architecture.  TDX guests are isolated from the VMM in ways that
> traditional guests are not, but that has almost nothing to do with
> cryptography.
> 
> Is it cryptography that keeps the host from reading guest private data
> in the clear?  Is it cryptography that keeps the host from reading guest
> ciphertext?  Does cryptography enforce the extra rules of Secure-EPT?

OK will change to "protected VMs" in this entire series.

> 
> > > > 3. Memory hotplug
> > > > 
> > > > The first generation of TDX architecturally doesn't support memory
> > > > hotplug.  And the first generation of TDX-capable platforms don't support
> > > > physical memory hotplug.  Since it physically cannot happen, this series
> > > > doesn't add any check in ACPI memory hotplug code path to disable it.
> > > > 
> > > > A special case of memory hotplug is adding NVDIMM as system RAM using
> > > > kmem driver.  However the first generation of TDX-capable platforms
> > > > cannot enable TDX and NVDIMM simultaneously, so in practice this cannot
> > > > happen either.
> > > 
> > > What prevents this code from today's code being run on tomorrow's
> > > platforms and breaking these assumptions?
> > 
> > I forgot to add below (which is in the documentation patch):
> > 
> > "This can be enhanced when future generation of TDX starts to support ACPI
> > memory hotplug, or NVDIMM and TDX can be enabled simultaneously on the
> > same platform."
> > 
> > Is this acceptable?
> 
> No, Kai.
> 
> You're basically saying: *this* code doesn't work with feature A, B and
> C.  Then, you're pivoting to say that it doesn't matter because one
> version of Intel's hardware doesn't support A, B, or C.
> 
> I don't care about this *ONE* version of the hardware.  I care about
> *ALL* the hardware that this code will ever support.  *ALL* the hardware
> on which this code will run.
> 
> In 5 years, if someone takes this code and runs it on Intel hardware
> with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens?

I thought we could document this in the documentation saying that this code can
only work on TDX machines that don't have above capabilities (SPR for now).  We
can change the code and the documentation  when we add the support of those
features in the future, and update the documentation.

If 5 years later someone takes this code, he/she should take a look at the
documentation and figure out that he/she should choose a newer kernel if the
machine support those features.

I'll think about design solutions if above doesn't look good for you.

> 
> You can't just ignore the problems because they're not present on one
> version of the hardware.
> 
> > > > Another case is admin can use 'memmap' kernel command line to create
> > > > legacy PMEMs and use them as TD guest memory, or theoretically, can use
> > > > kmem driver to add them as system RAM.  To avoid having to change memory
> > > > hotplug code to prevent this from happening, this series always include
> > > > legacy PMEMs when constructing TDMRs so they are also TDX memory.
> > > > 
> > > > 4. CPU hotplug
> > > > 
> > > > The first generation of TDX architecturally doesn't support ACPI CPU
> > > > hotplug.  All logical cpus are enabled by BIOS in MADT table.  Also, the
> > > > first generation of TDX-capable platforms don't support ACPI CPU hotplug
> > > > either.  Since this physically cannot happen, this series doesn't add any
> > > > check in ACPI CPU hotplug code path to disable it.
> > > > 
> > > > Also, only TDX module initialization requires all BIOS-enabled cpus are
> > > > online.  After the initialization, any logical cpu can be brought down
> > > > and brought up to online again later.  Therefore this series doesn't
> > > > change logical CPU hotplug either.
> > > > 
> > > > 5. TDX interaction with kexec()
> > > > 
> > > > If TDX is ever enabled and/or used to run any TD guests, the cachelines
> > > > of TDX private memory, including PAMTs, used by TDX module need to be
> > > > flushed before transiting to the new kernel otherwise they may silently
> > > > corrupt the new kernel.  Similar to SME, this series flushes cache in
> > > > stop_this_cpu().
> > > 
> > > What does this have to do with kexec()?  What's a PAMT?
> > 
> > The point is the dirty cachelines of TDX private memory must be flushed
> > otherwise they may slightly corrupt the new kexec()-ed kernel.
> > 
> > Will use "TDX metadata" instead of "PAMT".  The former has already been
> > mentioned above.
> 
> Longer description for the patch itself:
> 
> TDX memory encryption is built on top of MKTME which uses physical
> address aliases to designate encryption keys.  This architecture is not
> cache coherent.  Software is responsible for flushing the CPU caches
> when memory changes keys.  When kexec()'ing, memory can be repurposed
> from TDX use to non-TDX use, changing the effective encryption key.
> 
> Cover-letter-level description:
> 
> Just like SME, TDX hosts require special cache flushing before kexec().

Thanks.

> 
> > > > uninitialized state so it can be initialized again.
> > > > 
> > > > This implies:
> > > > 
> > > >   - If the old kernel fails to initialize TDX, the new kernel cannot
> > > >     use TDX too unless the new kernel fixes the bug which leads to
> > > >     initialization failure in the old kernel and can resume from where
> > > >     the old kernel stops. This requires certain coordination between
> > > >     the two kernels.
> > > 
> > > OK, but what does this *MEAN*?
> > 
> > This means we need to extend the information which the old kernel passes to the
> > new kernel.  But I don't think it's feasible.  I'll refine this kexec() section
> > to make it more concise next version.
> > 
> > > 
> > > >   - If the old kernel has initialized TDX successfully, the new kernel
> > > >     may be able to use TDX if the two kernels have the exactly same
> > > >     configurations on the TDX module. It further requires the new kernel
> > > >     to reserve the TDX metadata pages (allocated by the old kernel) in
> > > >     its page allocator. It also requires coordination between the two
> > > >     kernels.  Furthermore, if kexec() is done when there are active TD
> > > >     guests running, the new kernel cannot use TDX because it's extremely
> > > >     hard for the old kernel to pass all TDX private pages to the new
> > > >     kernel.
> > > > 
> > > > Given that, this series doesn't support TDX after kexec() (except the
> > > > old kernel doesn't attempt to initialize TDX at all).
> > > > 
> > > > And this series doesn't shut down TDX module but leaves it open during
> > > > kexec().  It is because shutting down TDX module requires CPU being in
> > > > VMX operation but there's no guarantee of this during kexec().  Leaving
> > > > the TDX module open is not the best case, but it is OK since the new
> > > > kernel won't be able to use TDX anyway (therefore TDX module won't run
> > > > at all).
> > > 
> > > tl;dr: kexec() doesn't work with this code.
> > > 
> > > Right?
> > > 
> > > That doesn't seem good.
> > 
> > It can work in my understanding.  We just need to flush cache before booting to
> > the new kernel.
> 
> What about all the concerns about TDX module configuration changing?
> 

Leaving the TDX module in fully initialized state or shutdown state (in case of
error during it's initialization) to the new kernel is fine.  If the new kernel
doesn't use TDX at all, then the TDX module won't access memory using it's
global TDX KeyID.  If the new kernel wants to use TDX, it will fail on the very
first SEAMCALL when it tries to initialize the TDX module, and won't use
SEAMCALL to call the TDX module again.  If the new kernel doesn't follow this,
then it is a bug in the new kernel, or the new kernel is malicious, in which
case it can potentially corrupt the data.  But I don't think we need to consider
this as if the new kernel is malicious, then it can corrupt data anyway.

Does this make sense?

Is there any other concerns that I missed?

Dave Hansen April 28, 2022, 12:50 a.m. UTC | #6

On 4/27/22 17:37, Kai Huang wrote:
> On Wed, 2022-04-27 at 14:59 -0700, Dave Hansen wrote:
>> In 5 years, if someone takes this code and runs it on Intel hardware
>> with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens?
> 
> I thought we could document this in the documentation saying that this code can
> only work on TDX machines that don't have above capabilities (SPR for now).  We
> can change the code and the documentation  when we add the support of those
> features in the future, and update the documentation.
> 
> If 5 years later someone takes this code, he/she should take a look at the
> documentation and figure out that he/she should choose a newer kernel if the
> machine support those features.
> 
> I'll think about design solutions if above doesn't look good for you.

No, it doesn't look good to me.

You can't just say:

	/*
	 * This code will eat puppies if used on systems with hotplug.
	 */

and merrily await the puppy bloodbath.

If it's not compatible, then you have to *MAKE* it not compatible in a
safe, controlled way.

>> You can't just ignore the problems because they're not present on one
>> version of the hardware.

Please, please read this again ^^

>> What about all the concerns about TDX module configuration changing?
> 
> Leaving the TDX module in fully initialized state or shutdown state (in case of
> error during it's initialization) to the new kernel is fine.  If the new kernel
> doesn't use TDX at all, then the TDX module won't access memory using it's
> global TDX KeyID.  If the new kernel wants to use TDX, it will fail on the very
> first SEAMCALL when it tries to initialize the TDX module, and won't use
> SEAMCALL to call the TDX module again.  If the new kernel doesn't follow this,
> then it is a bug in the new kernel, or the new kernel is malicious, in which
> case it can potentially corrupt the data.  But I don't think we need to consider
> this as if the new kernel is malicious, then it can corrupt data anyway.
> 
> Does this make sense?

No, I'm pretty lost.  But, I'll look at the next version of this with
fresh eyes and hopefully you'll have had time to streamline the text by
then.

Huang, Kai April 28, 2022, 12:58 a.m. UTC | #7

On Wed, 2022-04-27 at 17:50 -0700, Dave Hansen wrote:
> On 4/27/22 17:37, Kai Huang wrote:
> > On Wed, 2022-04-27 at 14:59 -0700, Dave Hansen wrote:
> > > In 5 years, if someone takes this code and runs it on Intel hardware
> > > with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens?
> > 
> > I thought we could document this in the documentation saying that this code can
> > only work on TDX machines that don't have above capabilities (SPR for now).  We
> > can change the code and the documentation  when we add the support of those
> > features in the future, and update the documentation.
> > 
> > If 5 years later someone takes this code, he/she should take a look at the
> > documentation and figure out that he/she should choose a newer kernel if the
> > machine support those features.
> > 
> > I'll think about design solutions if above doesn't look good for you.
> 
> No, it doesn't look good to me.
> 
> You can't just say:
> 
> 	/*
> 	 * This code will eat puppies if used on systems with hotplug.
> 	 */
> 
> and merrily await the puppy bloodbath.
> 
> If it's not compatible, then you have to *MAKE* it not compatible in a
> safe, controlled way.
> 
> > > You can't just ignore the problems because they're not present on one
> > > version of the hardware.
> 
> Please, please read this again ^^

OK.  I'll think about solutions and come back later.

> 
> > > What about all the concerns about TDX module configuration changing?
> > 
> > Leaving the TDX module in fully initialized state or shutdown state (in case of
> > error during it's initialization) to the new kernel is fine.  If the new kernel
> > doesn't use TDX at all, then the TDX module won't access memory using it's
> > global TDX KeyID.  If the new kernel wants to use TDX, it will fail on the very
> > first SEAMCALL when it tries to initialize the TDX module, and won't use
> > SEAMCALL to call the TDX module again.  If the new kernel doesn't follow this,
> > then it is a bug in the new kernel, or the new kernel is malicious, in which
> > case it can potentially corrupt the data.  But I don't think we need to consider
> > this as if the new kernel is malicious, then it can corrupt data anyway.
> > 
> > Does this make sense?
> 
> No, I'm pretty lost.  But, I'll look at the next version of this with
> fresh eyes and hopefully you'll have had time to streamline the text by
> then.

OK thanks.

Dan Williams April 28, 2022, 1:01 a.m. UTC | #8

On Tue, Apr 26, 2022 at 1:10 PM Dave Hansen <dave.hansen@intel.com> wrote:
[..]
> > 3. Memory hotplug
> >
> > The first generation of TDX architecturally doesn't support memory
> > hotplug.  And the first generation of TDX-capable platforms don't support
> > physical memory hotplug.  Since it physically cannot happen, this series
> > doesn't add any check in ACPI memory hotplug code path to disable it.
> >
> > A special case of memory hotplug is adding NVDIMM as system RAM using

Saw "NVDIMM" mentioned while browsing this, so stopped to make a comment...

> > kmem driver.  However the first generation of TDX-capable platforms
> > cannot enable TDX and NVDIMM simultaneously, so in practice this cannot
> > happen either.
>
> What prevents this code from today's code being run on tomorrow's
> platforms and breaking these assumptions?

The assumption is already broken today with NVDIMM-N. The lack of
DDR-T support on TDX enabled platforms has zero effect on DDR-based
persistent memory solutions. In other words, please describe the
actual software and hardware conflicts at play here, and do not make
the mistake of assuming that "no DDR-T support on TDX platforms" ==
"no NVDIMM support".

> > Another case is admin can use 'memmap' kernel command line to create
> > legacy PMEMs and use them as TD guest memory, or theoretically, can use
> > kmem driver to add them as system RAM.  To avoid having to change memory
> > hotplug code to prevent this from happening, this series always include
> > legacy PMEMs when constructing TDMRs so they are also TDX memory.

I am not sure what you are trying to say here?

> > 4. CPU hotplug
> >
> > The first generation of TDX architecturally doesn't support ACPI CPU
> > hotplug.  All logical cpus are enabled by BIOS in MADT table.  Also, the
> > first generation of TDX-capable platforms don't support ACPI CPU hotplug
> > either.  Since this physically cannot happen, this series doesn't add any
> > check in ACPI CPU hotplug code path to disable it.

What are the actual challenges posed to TDX with respect to CPU hotplug?

> > Also, only TDX module initialization requires all BIOS-enabled cpus are

Please define "BIOS-enabled" cpus. There is no "BIOS-enabled" line in
/proc/cpuinfo for example.

Huang, Kai April 28, 2022, 1:21 a.m. UTC | #9

On Wed, 2022-04-27 at 18:01 -0700, Dan Williams wrote:
> On Tue, Apr 26, 2022 at 1:10 PM Dave Hansen <dave.hansen@intel.com> wrote:
> [..]
> > > 3. Memory hotplug
> > > 
> > > The first generation of TDX architecturally doesn't support memory
> > > hotplug.  And the first generation of TDX-capable platforms don't support
> > > physical memory hotplug.  Since it physically cannot happen, this series
> > > doesn't add any check in ACPI memory hotplug code path to disable it.
> > > 
> > > A special case of memory hotplug is adding NVDIMM as system RAM using
> 
> Saw "NVDIMM" mentioned while browsing this, so stopped to make a comment...
> 
> > > kmem driver.  However the first generation of TDX-capable platforms
> > > cannot enable TDX and NVDIMM simultaneously, so in practice this cannot
> > > happen either.
> > 
> > What prevents this code from today's code being run on tomorrow's
> > platforms and breaking these assumptions?
> 
> The assumption is already broken today with NVDIMM-N. The lack of
> DDR-T support on TDX enabled platforms has zero effect on DDR-based
> persistent memory solutions. In other words, please describe the
> actual software and hardware conflicts at play here, and do not make
> the mistake of assuming that "no DDR-T support on TDX platforms" ==
> "no NVDIMM support".

Sorry I got this information from planning team or execution team I guess. I was
told NVDIMM and TDX cannot "co-exist" on the first generation of TDX capable
machine.  "co-exist" means they cannot be turned on simultaneously on the same
platform.  I am also not aware NVDIMM-N, nor the difference between DDR based
and DDR-T based persistent memory.  Could you give some more background here so
I can take a look?

> 
> > > Another case is admin can use 'memmap' kernel command line to create
> > > legacy PMEMs and use them as TD guest memory, or theoretically, can use
> > > kmem driver to add them as system RAM.  To avoid having to change memory
> > > hotplug code to prevent this from happening, this series always include
> > > legacy PMEMs when constructing TDMRs so they are also TDX memory.
> 
> I am not sure what you are trying to say here?

We want to always make sure the memory managed by page allocator is TDX memory.
So if the legacy PMEMs are unconditionally configured as TDX memory, then we
don't need to prevent them from being added as system memory via kmem driver.

> 
> > > 4. CPU hotplug
> > > 
> > > The first generation of TDX architecturally doesn't support ACPI CPU
> > > hotplug.  All logical cpus are enabled by BIOS in MADT table.  Also, the
> > > first generation of TDX-capable platforms don't support ACPI CPU hotplug
> > > either.  Since this physically cannot happen, this series doesn't add any
> > > check in ACPI CPU hotplug code path to disable it.
> 
> What are the actual challenges posed to TDX with respect to CPU hotplug?

During the TDX module initialization, there is a step to call SEAMCALL on all
logical cpus to initialize per-cpu TDX staff.  TDX doesn't support initializing
the new hot-added CPUs after the initialization.  There are MCHECK/BIOS changes
to enforce this check too I guess but I don't know details about this.

> 
> > > Also, only TDX module initialization requires all BIOS-enabled cpus are
> 
> Please define "BIOS-enabled" cpus. There is no "BIOS-enabled" line in
> /proc/cpuinfo for example.

It means the CPUs with "enable" bit set in the MADT table.

Huang, Kai April 29, 2022, 1:40 a.m. UTC | #10

On Thu, 2022-04-28 at 12:58 +1200, Kai Huang wrote:
> On Wed, 2022-04-27 at 17:50 -0700, Dave Hansen wrote:
> > On 4/27/22 17:37, Kai Huang wrote:
> > > On Wed, 2022-04-27 at 14:59 -0700, Dave Hansen wrote:
> > > > In 5 years, if someone takes this code and runs it on Intel hardware
> > > > with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens?
> > > 
> > > I thought we could document this in the documentation saying that this code can
> > > only work on TDX machines that don't have above capabilities (SPR for now).  We
> > > can change the code and the documentation  when we add the support of those
> > > features in the future, and update the documentation.
> > > 
> > > If 5 years later someone takes this code, he/she should take a look at the
> > > documentation and figure out that he/she should choose a newer kernel if the
> > > machine support those features.
> > > 
> > > I'll think about design solutions if above doesn't look good for you.
> > 
> > No, it doesn't look good to me.
> > 
> > You can't just say:
> > 
> > 	/*
> > 	 * This code will eat puppies if used on systems with hotplug.
> > 	 */
> > 
> > and merrily await the puppy bloodbath.
> > 
> > If it's not compatible, then you have to *MAKE* it not compatible in a
> > safe, controlled way.
> > 
> > > > You can't just ignore the problems because they're not present on one
> > > > version of the hardware.
> > 
> > Please, please read this again ^^
> 
> OK.  I'll think about solutions and come back later.
> > 

Hi Dave,

I think we have two approaches to handle memory hotplug interaction with the TDX
module initialization.  

The first approach is simple.  We just block memory from being added as system
RAM managed by page allocator when the platform supports TDX [1]. It seems we
can add some arch-specific-check to __add_memory_resource() and reject the new
memory resource if platform supports TDX.  __add_memory_resource() is called by
both __add_memory() and add_memory_driver_managed() so it prevents from adding
NVDIMM as system RAM and normal ACPI memory hotplug [2].

The second approach is relatively more complicated.  Instead of directly
rejecting the new memory resource in __add_memory_resource(), we check whether
the memory resource can be added based on CMR and the TDX module initialization
status.   This is feasible as with the latest public P-SEAMLDR spec, we can get
CMR from P-SEAMLDR SEAMCALL[3].  So we can detect P-SEAMLDR and get CMR info
during kernel boots.  And in __add_memory_resource() we do below check:

	tdx_init_disable();	/*similar to cpu_hotplug_disable() */
	if (tdx_module_initialized())
		// reject memory hotplug
	else if (new_memory_resource NOT in CMRs)
		// reject memory hotplug
	else
		allow memory hotplug
	tdx_init_enable();	/*similar to cpu_hotplug_enable() */

tdx_init_disable() temporarily disables TDX module initialization by trying to
grab the mutex.  If the TDX module initialization is already on going, then it
waits until it completes.

This should work better for future platforms, but would requires non-trivial
more code as we need to add VMXON/VMXOFF support to the core-kernel to detect
CMR using  SEAMCALL.  A side advantage is with VMXON in core-kernel we can
shutdown the TDX module in kexec().

But for this series I think the second approach is overkill and we can choose to
use the first simple approach?

Any suggestions?

[1] Platform supports TDX means SEAMRR is enabled, and there are at least 2 TDX
keyIDs.  Or we can just check SEAMRR is enabled, as in practice a SEAMRR is
enabled means the machine is TDX-capable, and for now a TDX-capable machine
doesn't support ACPI memory hotplug.

[2] It prevents adding legacy PMEM as system RAM too but I think it's fine.  If
user wants legacy PMEM then it is unlikely user will add it back and use as
system RAM.  User is unlikely to use legacy PMEM as TD guest memory directly as
TD guests is likely to use a new memfd backend which allows private page not
accessible from usrspace, so in this way we can exclude legacy PMEM from TDMRs.

[3] Please refer to SEAMLDR.SEAMINFO SEAMCALL in latest P-SEAMLDR spec:
https://www.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-seamldr-interface-specification.pdf
> > >

Dan Williams April 29, 2022, 2:58 a.m. UTC | #11

On Wed, Apr 27, 2022 at 6:21 PM Kai Huang <kai.huang@intel.com> wrote:
>
> On Wed, 2022-04-27 at 18:01 -0700, Dan Williams wrote:
> > On Tue, Apr 26, 2022 at 1:10 PM Dave Hansen <dave.hansen@intel.com> wrote:
> > [..]
> > > > 3. Memory hotplug
> > > >
> > > > The first generation of TDX architecturally doesn't support memory
> > > > hotplug.  And the first generation of TDX-capable platforms don't support
> > > > physical memory hotplug.  Since it physically cannot happen, this series
> > > > doesn't add any check in ACPI memory hotplug code path to disable it.
> > > >
> > > > A special case of memory hotplug is adding NVDIMM as system RAM using
> >
> > Saw "NVDIMM" mentioned while browsing this, so stopped to make a comment...
> >
> > > > kmem driver.  However the first generation of TDX-capable platforms
> > > > cannot enable TDX and NVDIMM simultaneously, so in practice this cannot
> > > > happen either.
> > >
> > > What prevents this code from today's code being run on tomorrow's
> > > platforms and breaking these assumptions?
> >
> > The assumption is already broken today with NVDIMM-N. The lack of
> > DDR-T support on TDX enabled platforms has zero effect on DDR-based
> > persistent memory solutions. In other words, please describe the
> > actual software and hardware conflicts at play here, and do not make
> > the mistake of assuming that "no DDR-T support on TDX platforms" ==
> > "no NVDIMM support".
>
> Sorry I got this information from planning team or execution team I guess. I was
> told NVDIMM and TDX cannot "co-exist" on the first generation of TDX capable
> machine.  "co-exist" means they cannot be turned on simultaneously on the same
> platform.  I am also not aware NVDIMM-N, nor the difference between DDR based
> and DDR-T based persistent memory.  Could you give some more background here so
> I can take a look?

My rough understanding is that TDX makes use of metadata communicated
"on the wire" for DDR, but that infrastructure is not there for DDR-T.
However, there are plenty of DDR based NVDIMMs that use super-caps /
batteries and flash to save contents. I believe the concern for TDX is
that the kernel needs to know not use TDX accepted PMEM as PMEM
because the contents saved by the DIMM's onboard energy source are
unreadable outside of a TD.

Here is one of the links that comes up in a search for NVDIMM-N.

https://www.snia.org/educational-library/what-you-can-do-nvdimm-n-and-nvdimm-p-2019

>
> >
> > > > Another case is admin can use 'memmap' kernel command line to create
> > > > legacy PMEMs and use them as TD guest memory, or theoretically, can use
> > > > kmem driver to add them as system RAM.  To avoid having to change memory
> > > > hotplug code to prevent this from happening, this series always include
> > > > legacy PMEMs when constructing TDMRs so they are also TDX memory.
> >
> > I am not sure what you are trying to say here?
>
> We want to always make sure the memory managed by page allocator is TDX memory.

That only seems possible if the kernel is given a TDX capable physical
address map at the beginning of time.

> So if the legacy PMEMs are unconditionally configured as TDX memory, then we
> don't need to prevent them from being added as system memory via kmem driver.

I think that is too narrow of a focus.

Does a memory map exist for the physical address ranges that are TDX
capable? Please don't say EFI_MEMORY_CPU_CRYPTO, as that single bit is
ambiguous beyond the point of utility across the industry's entire
range of confidential computing memory capabilities.

One strawman would be an ACPI table with contents like:

struct acpi_protected_memory {
   struct range range;
   uuid_t platform_mem_crypto_capability;
};

With some way to map those uuids to a set of platform vendor specific
constraints and specifications. Some would be shared across
confidential computing vendors, some might be unique. Otherwise, I do
not see how you enforce the expectation of "all memory in the page
allocator is TDX capable". The other alternative is that *none* of the
memory in the page allocator is TDX capable and a special memory
allocation device is used to map memory for TDs. In either case a map
of all possible TDX memory is needed and the discussion above seems
like an incomplete / "hopeful" proposal about the memory dax_kmem, or
other sources, might online. See the CXL CEDT CFWMS (CXL Fixed Memory
Window Structure) as an example of an ACPI table that sets the
kernel's expectations about how a physical address range might be
used.

https://www.computeexpresslink.org/spec-landing

>
> >
> > > > 4. CPU hotplug
> > > >
> > > > The first generation of TDX architecturally doesn't support ACPI CPU
> > > > hotplug.  All logical cpus are enabled by BIOS in MADT table.  Also, the
> > > > first generation of TDX-capable platforms don't support ACPI CPU hotplug
> > > > either.  Since this physically cannot happen, this series doesn't add any
> > > > check in ACPI CPU hotplug code path to disable it.
> >
> > What are the actual challenges posed to TDX with respect to CPU hotplug?
>
> During the TDX module initialization, there is a step to call SEAMCALL on all
> logical cpus to initialize per-cpu TDX staff.  TDX doesn't support initializing
> the new hot-added CPUs after the initialization.  There are MCHECK/BIOS changes
> to enforce this check too I guess but I don't know details about this.

Is there an ACPI table that indicates CPU-x passed the check? Or since
the BIOS is invoked in the CPU-online path, is it trusted to suppress
those events for CPUs outside of the mcheck domain?

> > > > Also, only TDX module initialization requires all BIOS-enabled cpus are
> >
> > Please define "BIOS-enabled" cpus. There is no "BIOS-enabled" line in
> > /proc/cpuinfo for example.
>
> It means the CPUs with "enable" bit set in the MADT table.

That just indicates to the present CPUs and then a hot add event
changes the state of now present CPUs to enabled. Per above is the
BIOS responsible for rejecting those new CPUs, or is the kernel?

Dan Williams April 29, 2022, 3:04 a.m. UTC | #12

On Thu, Apr 28, 2022 at 6:40 PM Kai Huang <kai.huang@intel.com> wrote:
>
> On Thu, 2022-04-28 at 12:58 +1200, Kai Huang wrote:
> > On Wed, 2022-04-27 at 17:50 -0700, Dave Hansen wrote:
> > > On 4/27/22 17:37, Kai Huang wrote:
> > > > On Wed, 2022-04-27 at 14:59 -0700, Dave Hansen wrote:
> > > > > In 5 years, if someone takes this code and runs it on Intel hardware
> > > > > with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens?
> > > >
> > > > I thought we could document this in the documentation saying that this code can
> > > > only work on TDX machines that don't have above capabilities (SPR for now).  We
> > > > can change the code and the documentation  when we add the support of those
> > > > features in the future, and update the documentation.
> > > >
> > > > If 5 years later someone takes this code, he/she should take a look at the
> > > > documentation and figure out that he/she should choose a newer kernel if the
> > > > machine support those features.
> > > >
> > > > I'll think about design solutions if above doesn't look good for you.
> > >
> > > No, it doesn't look good to me.
> > >
> > > You can't just say:
> > >
> > >     /*
> > >      * This code will eat puppies if used on systems with hotplug.
> > >      */
> > >
> > > and merrily await the puppy bloodbath.
> > >
> > > If it's not compatible, then you have to *MAKE* it not compatible in a
> > > safe, controlled way.
> > >
> > > > > You can't just ignore the problems because they're not present on one
> > > > > version of the hardware.
> > >
> > > Please, please read this again ^^
> >
> > OK.  I'll think about solutions and come back later.
> > >
>
> Hi Dave,
>
> I think we have two approaches to handle memory hotplug interaction with the TDX
> module initialization.
>
> The first approach is simple.  We just block memory from being added as system
> RAM managed by page allocator when the platform supports TDX [1]. It seems we
> can add some arch-specific-check to __add_memory_resource() and reject the new
> memory resource if platform supports TDX.  __add_memory_resource() is called by
> both __add_memory() and add_memory_driver_managed() so it prevents from adding
> NVDIMM as system RAM and normal ACPI memory hotplug [2].

What if the memory being added *is* TDX capable? What if someone
wanted to manage a memory range as soft-reserved and move it back and
forth from the core-mm to device access. That should be perfectly
acceptable as long as the memory is TDX capable.

> The second approach is relatively more complicated.  Instead of directly
> rejecting the new memory resource in __add_memory_resource(), we check whether
> the memory resource can be added based on CMR and the TDX module initialization
> status.   This is feasible as with the latest public P-SEAMLDR spec, we can get
> CMR from P-SEAMLDR SEAMCALL[3].  So we can detect P-SEAMLDR and get CMR info
> during kernel boots.  And in __add_memory_resource() we do below check:
>
>         tdx_init_disable();     /*similar to cpu_hotplug_disable() */
>         if (tdx_module_initialized())
>                 // reject memory hotplug
>         else if (new_memory_resource NOT in CMRs)
>                 // reject memory hotplug
>         else
>                 allow memory hotplug
>         tdx_init_enable();      /*similar to cpu_hotplug_enable() */
>
> tdx_init_disable() temporarily disables TDX module initialization by trying to
> grab the mutex.  If the TDX module initialization is already on going, then it
> waits until it completes.
>
> This should work better for future platforms, but would requires non-trivial
> more code as we need to add VMXON/VMXOFF support to the core-kernel to detect
> CMR using  SEAMCALL.  A side advantage is with VMXON in core-kernel we can
> shutdown the TDX module in kexec().
>
> But for this series I think the second approach is overkill and we can choose to
> use the first simple approach?

This still sounds like it is trying to solve symptoms and not the root
problem. Why must the core-mm never have non-TDX memory when VMs are
fine to operate with either core-mm pages or memory from other sources
like hugetlbfs and device-dax?

Huang, Kai April 29, 2022, 5:35 a.m. UTC | #13

On Thu, 2022-04-28 at 20:04 -0700, Dan Williams wrote:
> On Thu, Apr 28, 2022 at 6:40 PM Kai Huang <kai.huang@intel.com> wrote:
> > 
> > On Thu, 2022-04-28 at 12:58 +1200, Kai Huang wrote:
> > > On Wed, 2022-04-27 at 17:50 -0700, Dave Hansen wrote:
> > > > On 4/27/22 17:37, Kai Huang wrote:
> > > > > On Wed, 2022-04-27 at 14:59 -0700, Dave Hansen wrote:
> > > > > > In 5 years, if someone takes this code and runs it on Intel hardware
> > > > > > with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens?
> > > > > 
> > > > > I thought we could document this in the documentation saying that this code can
> > > > > only work on TDX machines that don't have above capabilities (SPR for now).  We
> > > > > can change the code and the documentation  when we add the support of those
> > > > > features in the future, and update the documentation.
> > > > > 
> > > > > If 5 years later someone takes this code, he/she should take a look at the
> > > > > documentation and figure out that he/she should choose a newer kernel if the
> > > > > machine support those features.
> > > > > 
> > > > > I'll think about design solutions if above doesn't look good for you.
> > > > 
> > > > No, it doesn't look good to me.
> > > > 
> > > > You can't just say:
> > > > 
> > > >     /*
> > > >      * This code will eat puppies if used on systems with hotplug.
> > > >      */
> > > > 
> > > > and merrily await the puppy bloodbath.
> > > > 
> > > > If it's not compatible, then you have to *MAKE* it not compatible in a
> > > > safe, controlled way.
> > > > 
> > > > > > You can't just ignore the problems because they're not present on one
> > > > > > version of the hardware.
> > > > 
> > > > Please, please read this again ^^
> > > 
> > > OK.  I'll think about solutions and come back later.
> > > > 
> > 
> > Hi Dave,
> > 
> > I think we have two approaches to handle memory hotplug interaction with the TDX
> > module initialization.
> > 
> > The first approach is simple.  We just block memory from being added as system
> > RAM managed by page allocator when the platform supports TDX [1]. It seems we
> > can add some arch-specific-check to __add_memory_resource() and reject the new
> > memory resource if platform supports TDX.  __add_memory_resource() is called by
> > both __add_memory() and add_memory_driver_managed() so it prevents from adding
> > NVDIMM as system RAM and normal ACPI memory hotplug [2].
> 
> What if the memory being added *is* TDX capable? What if someone
> wanted to manage a memory range as soft-reserved and move it back and
> forth from the core-mm to device access. That should be perfectly
> acceptable as long as the memory is TDX capable.

Please see below.

> 
> > The second approach is relatively more complicated.  Instead of directly
> > rejecting the new memory resource in __add_memory_resource(), we check whether
> > the memory resource can be added based on CMR and the TDX module initialization
> > status.   This is feasible as with the latest public P-SEAMLDR spec, we can get
> > CMR from P-SEAMLDR SEAMCALL[3].  So we can detect P-SEAMLDR and get CMR info
> > during kernel boots.  And in __add_memory_resource() we do below check:
> > 
> >         tdx_init_disable();     /*similar to cpu_hotplug_disable() */
> >         if (tdx_module_initialized())
> >                 // reject memory hotplug
> >         else if (new_memory_resource NOT in CMRs)
> >                 // reject memory hotplug
> >         else
> >                 allow memory hotplug
> >         tdx_init_enable();      /*similar to cpu_hotplug_enable() */
> > 
> > tdx_init_disable() temporarily disables TDX module initialization by trying to
> > grab the mutex.  If the TDX module initialization is already on going, then it
> > waits until it completes.
> > 
> > This should work better for future platforms, but would requires non-trivial
> > more code as we need to add VMXON/VMXOFF support to the core-kernel to detect
> > CMR using  SEAMCALL.  A side advantage is with VMXON in core-kernel we can
> > shutdown the TDX module in kexec().
> > 
> > But for this series I think the second approach is overkill and we can choose to
> > use the first simple approach?
> 
> This still sounds like it is trying to solve symptoms and not the root
> problem. Why must the core-mm never have non-TDX memory when VMs are
> fine to operate with either core-mm pages or memory from other sources
> like hugetlbfs and device-dax?

Basically we don't want to modify page allocator API to distinguish TDX and non-
TDX allocation.  For instance, we don't want a new GFP_TDX.

There's another series done by Chao "KVM: mm: fd-based approach for supporting
KVM guest private memory" which essentially allows KVM to ask guest memory
backend to allocate page w/o having to mmap() to userspace.  

https://lore.kernel.org/kvm/20220310140911.50924-1-chao.p.peng@linux.intel.com/

More specifically, memfd will support a new MFD_INACCESSIBLE flag when it is
created so all pages associated with this memfd will be TDX capable memory.  The
backend will need to implement a new memfile_notifier_ops to allow KVM to get
and put the memory page.

struct memfile_pfn_ops {
	long (*get_lock_pfn)(struct inode *inode, pgoff_t offset, int *order);
	void (*put_unlock_pfn)(unsigned long pfn);
};

With that, it is backend's responsibility to implement get_lock_pfn() callback
in which the backend needs to ensure a TDX private page is allocated.

For TD guest, KVM should enforced to only use those fd-based backend.  I am not
sure whether anonymous pages should be supported anymore.

Sean, please correct me if I am wrong?

Currently only shmem is extended to support it.  By ensuring pages in page
allocator are all TDX memory, shmem can be extended easily to support TD guests.

If device-dax and hugetlbfs wants to support TD guests then they should
implement those callbacks and ensure only TDX memory is allocated.  For
instance, when future TDX supports NVDIMM (i.e. NVDIMM is included to CMRs),
then device-dax pages can be included as TDX memory when initializing the TDX
module and device-dax can implement it's own to support allocating page for TD
guests.

But TDX architecture can be changed to support memory hotplug in a more graceful
way in the future.  For instance, it can choose to support dynamically adding
any convertible memory as TDX memory *after* TDX module initialization.  But
this is just my brainstorming.

Anyway, for now, since only shmem (or + anonymous pages) can be used to create
TD guests, I think we can just reject any memory hot-add when platform supports
TDX as described in the first simple approach.  Eventually we may need something
like the second approach but TDX architecture can evolve too.

Huang, Kai April 29, 2022, 5:43 a.m. UTC | #14

On Thu, 2022-04-28 at 19:58 -0700, Dan Williams wrote:
> On Wed, Apr 27, 2022 at 6:21 PM Kai Huang <kai.huang@intel.com> wrote:
> > 
> > On Wed, 2022-04-27 at 18:01 -0700, Dan Williams wrote:
> > > On Tue, Apr 26, 2022 at 1:10 PM Dave Hansen <dave.hansen@intel.com> wrote:
> > > [..]
> > > > > 3. Memory hotplug
> > > > > 
> > > > > The first generation of TDX architecturally doesn't support memory
> > > > > hotplug.  And the first generation of TDX-capable platforms don't support
> > > > > physical memory hotplug.  Since it physically cannot happen, this series
> > > > > doesn't add any check in ACPI memory hotplug code path to disable it.
> > > > > 
> > > > > A special case of memory hotplug is adding NVDIMM as system RAM using
> > > 
> > > Saw "NVDIMM" mentioned while browsing this, so stopped to make a comment...
> > > 
> > > > > kmem driver.  However the first generation of TDX-capable platforms
> > > > > cannot enable TDX and NVDIMM simultaneously, so in practice this cannot
> > > > > happen either.
> > > > 
> > > > What prevents this code from today's code being run on tomorrow's
> > > > platforms and breaking these assumptions?
> > > 
> > > The assumption is already broken today with NVDIMM-N. The lack of
> > > DDR-T support on TDX enabled platforms has zero effect on DDR-based
> > > persistent memory solutions. In other words, please describe the
> > > actual software and hardware conflicts at play here, and do not make
> > > the mistake of assuming that "no DDR-T support on TDX platforms" ==
> > > "no NVDIMM support".
> > 
> > Sorry I got this information from planning team or execution team I guess. I was
> > told NVDIMM and TDX cannot "co-exist" on the first generation of TDX capable
> > machine.  "co-exist" means they cannot be turned on simultaneously on the same
> > platform.  I am also not aware NVDIMM-N, nor the difference between DDR based
> > and DDR-T based persistent memory.  Could you give some more background here so
> > I can take a look?
> 
> My rough understanding is that TDX makes use of metadata communicated
> "on the wire" for DDR, but that infrastructure is not there for DDR-T.
> However, there are plenty of DDR based NVDIMMs that use super-caps /
> batteries and flash to save contents. I believe the concern for TDX is
> that the kernel needs to know not use TDX accepted PMEM as PMEM
> because the contents saved by the DIMM's onboard energy source are
> unreadable outside of a TD.
> 
> Here is one of the links that comes up in a search for NVDIMM-N.
> 
> https://www.snia.org/educational-library/what-you-can-do-nvdimm-n-and-nvdimm-p-2019

Thanks for the info.  I need some more time to digest those different types of
DDRs and NVDIMMs.  However I guess they are not quite relevant since TDX has a
concept of "Convertible Memory Region".  Please see below.

> 
> > 
> > > 
> > > > > Another case is admin can use 'memmap' kernel command line to create
> > > > > legacy PMEMs and use them as TD guest memory, or theoretically, can use
> > > > > kmem driver to add them as system RAM.  To avoid having to change memory
> > > > > hotplug code to prevent this from happening, this series always include
> > > > > legacy PMEMs when constructing TDMRs so they are also TDX memory.
> > > 
> > > I am not sure what you are trying to say here?
> > 
> > We want to always make sure the memory managed by page allocator is TDX memory.
> 
> That only seems possible if the kernel is given a TDX capable physical
> address map at the beginning of time.

Yes TDX architecture has a concept "Convertible Memory Region" (CMR). The memory
used by TDX must be convertible memory.  BIOS generates an array of CMR entries
during boot and they are verified by MCHECK.  CMRs are static during machine's
runtime.

> 
> > So if the legacy PMEMs are unconditionally configured as TDX memory, then we
> > don't need to prevent them from being added as system memory via kmem driver.
> 
> I think that is too narrow of a focus.
> 
> Does a memory map exist for the physical address ranges that are TDX
> capable? Please don't say EFI_MEMORY_CPU_CRYPTO, as that single bit is
> ambiguous beyond the point of utility across the industry's entire
> range of confidential computing memory capabilities.
> 
> One strawman would be an ACPI table with contents like:
> 
> struct acpi_protected_memory {
>    struct range range;
>    uuid_t platform_mem_crypto_capability;
> };
> 
> With some way to map those uuids to a set of platform vendor specific
> constraints and specifications. Some would be shared across
> confidential computing vendors, some might be unique. Otherwise, I do
> not see how you enforce the expectation of "all memory in the page
> allocator is TDX capable". 
> 

Please see above.  TDX has CMR.

> The other alternative is that *none* of the
> memory in the page allocator is TDX capable and a special memory
> allocation device is used to map memory for TDs. In either case a map
> of all possible TDX memory is needed and the discussion above seems
> like an incomplete / "hopeful" proposal about the memory dax_kmem, or
> other sources, might online. 

Yes we are also developing a new memfd based approach to support TD guest
memory.  Please see my another reply to you.


> See the CXL CEDT CFWMS (CXL Fixed Memory
> Window Structure) as an example of an ACPI table that sets the
> kernel's expectations about how a physical address range might be
> used.
> 
> https://www.computeexpresslink.org/spec-landing

Thanks for the info. I'll take a look to get some background.

> 
> > 
> > > 
> > > > > 4. CPU hotplug
> > > > > 
> > > > > The first generation of TDX architecturally doesn't support ACPI CPU
> > > > > hotplug.  All logical cpus are enabled by BIOS in MADT table.  Also, the
> > > > > first generation of TDX-capable platforms don't support ACPI CPU hotplug
> > > > > either.  Since this physically cannot happen, this series doesn't add any
> > > > > check in ACPI CPU hotplug code path to disable it.
> > > 
> > > What are the actual challenges posed to TDX with respect to CPU hotplug?
> > 
> > During the TDX module initialization, there is a step to call SEAMCALL on all
> > logical cpus to initialize per-cpu TDX staff.  TDX doesn't support initializing
> > the new hot-added CPUs after the initialization.  There are MCHECK/BIOS changes
> > to enforce this check too I guess but I don't know details about this.
> 
> Is there an ACPI table that indicates CPU-x passed the check? Or since
> the BIOS is invoked in the CPU-online path, is it trusted to suppress
> those events for CPUs outside of the mcheck domain?

No the TDX module (and the P-SEAMLDR) internally maintains some data to record
the total number of LPs and packages, and which logical cpu has been
initialized, etc.

I asked Intel guys whether BIOS would suppress an ACPI CPU hotplug event but I
never got a concrete answer.  I'll try again.

> 
> > > > > Also, only TDX module initialization requires all BIOS-enabled cpus are
> > > 
> > > Please define "BIOS-enabled" cpus. There is no "BIOS-enabled" line in
> > > /proc/cpuinfo for example.
> > 
> > It means the CPUs with "enable" bit set in the MADT table.
> 
> That just indicates to the present CPUs and then a hot add event
> changes the state of now present CPUs to enabled. Per above is the
> BIOS responsible for rejecting those new CPUs, or is the kernel?

I'll ask BIOS guys again to see whether BIOS will suppress ACPI CPU hotplug
event.  But I think we can have a simple patch to reject ACPI CPU hotplug if
platform is TDX-capable?

Or do you think we don't need to explicitly reject ACPI CPU hotplug if we can
confirm with BIOS guys that it will suppress on TDX capable machine?

Dave Hansen April 29, 2022, 2:39 p.m. UTC | #15

On 4/28/22 19:58, Dan Williams wrote:
> That only seems possible if the kernel is given a TDX capable physical
> address map at the beginning of time.

TDX actually brings along its own memory map.  The "EAS"[1]. has a lot
of info on it, if you know where to find it.  Here's the relevant chunk:

CMR - Convertible Memory Range -
	A range of physical memory configured by BIOS and verified by
	MCHECK. MCHECK verificatio n is intended to help ensure that a
	CMR may be used to hold TDX memory pages encrypted with a
	private HKID.

So, the BIOS has the platform knowledge to enumerate this range.  It
stashes the information off somewhere that the TDX module can find it.
Then, during OS boot, the OS makes a SEAMCALL (TDH.SYS.CONFIG) to the
TDX module and gets the list of CMRs.

The OS then has to reconcile this CMR "memory map" against the regular
old BIOS-provided memory map, tossing out any memory regions which are
RAM, but not covered by a CMR, or disabling TDX entirely.

Fun, eh?

I'm still grappling with how this series handles the policy of what
memory to throw away when.

1.
https://www.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-module-1eas.pdf

Dan Williams April 29, 2022, 3:18 p.m. UTC | #16

On Fri, Apr 29, 2022 at 7:39 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 4/28/22 19:58, Dan Williams wrote:
> > That only seems possible if the kernel is given a TDX capable physical
> > address map at the beginning of time.
>
> TDX actually brings along its own memory map.  The "EAS"[1]. has a lot
> of info on it, if you know where to find it.  Here's the relevant chunk:
>
> CMR - Convertible Memory Range -
>         A range of physical memory configured by BIOS and verified by
>         MCHECK. MCHECK verificatio n is intended to help ensure that a
>         CMR may be used to hold TDX memory pages encrypted with a
>         private HKID.
>
> So, the BIOS has the platform knowledge to enumerate this range.  It
> stashes the information off somewhere that the TDX module can find it.
> Then, during OS boot, the OS makes a SEAMCALL (TDH.SYS.CONFIG) to the
> TDX module and gets the list of CMRs.
>
> The OS then has to reconcile this CMR "memory map" against the regular
> old BIOS-provided memory map, tossing out any memory regions which are
> RAM, but not covered by a CMR, or disabling TDX entirely.
>
> Fun, eh?

Yes, I want to challenge the idea that all core-mm memory must be TDX
capable. Instead, this feels more like something that wants a
hugetlbfs / dax-device like capability to ask the kernel to gather /
set-aside the enumerated TDX memory out of all the general purpose
memory it knows about and then VMs use that ABI to get access to
convertible memory. Trying to ensure that all page allocator memory is
TDX capable feels too restrictive with all the different ways pfns can
get into the allocator.

Dave Hansen April 29, 2022, 5:18 p.m. UTC | #17

On 4/29/22 08:18, Dan Williams wrote:
> Yes, I want to challenge the idea that all core-mm memory must be TDX
> capable. Instead, this feels more like something that wants a
> hugetlbfs / dax-device like capability to ask the kernel to gather /
> set-aside the enumerated TDX memory out of all the general purpose
> memory it knows about and then VMs use that ABI to get access to
> convertible memory. Trying to ensure that all page allocator memory is
> TDX capable feels too restrictive with all the different ways pfns can
> get into the allocator.

The KVM users are the problem here.  They use a variety of ABIs to get
memory and then hand it to KVM.  KVM basically just consumes the
physical addresses from the page tables.

Also, there's no _practical_ problem here today.  I can't actually think
of a case where any memory that ends up in the allocator on today's TDX
systems is not TDX capable.

Tomorrow's systems are going to be the problem.  They'll (presumably)
have a mix of CXL devices that will have varying capabilities.  Some
will surely lack the metadata storage for checksums and TD-owner bits.
TDX use will be *safe* on those systems: if you take this code and run
it on one tomorrow's systems, it will notice the TDX-incompatible memory
and will disable TDX.

The only way around this that I can see is to introduce ABI today that
anticipates the needs of the future systems.  We could require that all
the KVM memory be "validated" before handing it to TDX.  Maybe a new
syscall that says: "make sure this mapping works for TDX".  It could be
new sysfs ABI which specifies which NUMA nodes contain TDX-capable memory.

But, neither of those really help with, say, a device-DAX mapping of
TDX-*IN*capable memory handed to KVM.  The "new syscall" would just
throw up its hands and leave users with the same result: TDX can't be
used.  The new sysfs ABI for NUMA nodes wouldn't clearly apply to
device-DAX because they don't respect the NUMA policy ABI.

I'm open to ideas here.  If there's a viable ABI we can introduce to
train TDX users today that will work tomorrow too, I'm all for it.

Dan Williams April 29, 2022, 5:48 p.m. UTC | #18

On Fri, Apr 29, 2022 at 10:18 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 4/29/22 08:18, Dan Williams wrote:
> > Yes, I want to challenge the idea that all core-mm memory must be TDX
> > capable. Instead, this feels more like something that wants a
> > hugetlbfs / dax-device like capability to ask the kernel to gather /
> > set-aside the enumerated TDX memory out of all the general purpose
> > memory it knows about and then VMs use that ABI to get access to
> > convertible memory. Trying to ensure that all page allocator memory is
> > TDX capable feels too restrictive with all the different ways pfns can
> > get into the allocator.
>
> The KVM users are the problem here.  They use a variety of ABIs to get
> memory and then hand it to KVM.  KVM basically just consumes the
> physical addresses from the page tables.
>
> Also, there's no _practical_ problem here today.  I can't actually think
> of a case where any memory that ends up in the allocator on today's TDX
> systems is not TDX capable.
>
> Tomorrow's systems are going to be the problem.  They'll (presumably)
> have a mix of CXL devices that will have varying capabilities.  Some
> will surely lack the metadata storage for checksums and TD-owner bits.
> TDX use will be *safe* on those systems: if you take this code and run
> it on one tomorrow's systems, it will notice the TDX-incompatible memory
> and will disable TDX.
>
> The only way around this that I can see is to introduce ABI today that
> anticipates the needs of the future systems.  We could require that all
> the KVM memory be "validated" before handing it to TDX.  Maybe a new
> syscall that says: "make sure this mapping works for TDX".  It could be
> new sysfs ABI which specifies which NUMA nodes contain TDX-capable memory.

Yes, node-id seems the only reasonable handle that can be used, and it
does not seem too onerous for a KVM user to have to set a node policy
preferring all the TDX / confidential-computing capable nodes.

> But, neither of those really help with, say, a device-DAX mapping of
> TDX-*IN*capable memory handed to KVM.  The "new syscall" would just
> throw up its hands and leave users with the same result: TDX can't be
> used.  The new sysfs ABI for NUMA nodes wouldn't clearly apply to
> device-DAX because they don't respect the NUMA policy ABI.

They do have "target_node" attributes to associate node specific
metadata, and could certainly express target_node capabilities in its
own ABI. Then it's just a matter of making pfn_to_nid() do the right
thing so KVM kernel side can validate the capabilities of all inbound
pfns.

> I'm open to ideas here.  If there's a viable ABI we can introduce to
> train TDX users today that will work tomorrow too, I'm all for it.

In general, expressing NUMA node perf and node capabilities is
something Linux needs to get better at. HMAT data for example still
exists as sideband information ignored by numactl, but it feels
inevitable that perf and capability details become more of a first
class citizen for applications that have these mem-allocation-policy
constraints in the presence of disparate memory types.

Dave Hansen April 29, 2022, 6:34 p.m. UTC | #19

On 4/29/22 10:48, Dan Williams wrote:
>> But, neither of those really help with, say, a device-DAX mapping of
>> TDX-*IN*capable memory handed to KVM.  The "new syscall" would just
>> throw up its hands and leave users with the same result: TDX can't be
>> used.  The new sysfs ABI for NUMA nodes wouldn't clearly apply to
>> device-DAX because they don't respect the NUMA policy ABI.
> They do have "target_node" attributes to associate node specific
> metadata, and could certainly express target_node capabilities in its
> own ABI. Then it's just a matter of making pfn_to_nid() do the right
> thing so KVM kernel side can validate the capabilities of all inbound
> pfns.

Let's walk through how this would work with today's kernel on tomorrow's
hardware, without KVM validating PFNs:

1. daxaddr mmap("/dev/dax1234")
2. kvmfd = open("/dev/kvm")
3. ioctl(KVM_SET_USER_MEMORY_REGION, { daxaddr };
4. guest starts running
5. guest touches 'daxaddr'
6. Page fault handler maps 'daxaddr'
7. KVM finds new 'daxaddr' PTE
8. TDX code tries to add physical address to Secure-EPT
9. TDX "SEAMCALL" fails because page is not convertible
10. Guest dies

All we can do to improve on that is call something that pledges to only
map convertible memory at 'daxaddr'.  We can't *actually* validate the
physical addresses at mmap() time or even
KVM_SET_USER_MEMORY_REGION-time because the memory might not have been
allocated.

Those pledges are hard for anonymous memory though.  To fulfill the
pledge, we not only have to validate that the NUMA policy is compatible
at KVM_SET_USER_MEMORY_REGION, we also need to decline changes to the
policy that might undermine the pledge.

Dan Williams April 29, 2022, 6:47 p.m. UTC | #20

On Fri, Apr 29, 2022 at 11:34 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 4/29/22 10:48, Dan Williams wrote:
> >> But, neither of those really help with, say, a device-DAX mapping of
> >> TDX-*IN*capable memory handed to KVM.  The "new syscall" would just
> >> throw up its hands and leave users with the same result: TDX can't be
> >> used.  The new sysfs ABI for NUMA nodes wouldn't clearly apply to
> >> device-DAX because they don't respect the NUMA policy ABI.
> > They do have "target_node" attributes to associate node specific
> > metadata, and could certainly express target_node capabilities in its
> > own ABI. Then it's just a matter of making pfn_to_nid() do the right
> > thing so KVM kernel side can validate the capabilities of all inbound
> > pfns.
>
> Let's walk through how this would work with today's kernel on tomorrow's
> hardware, without KVM validating PFNs:
>
> 1. daxaddr mmap("/dev/dax1234")
> 2. kvmfd = open("/dev/kvm")
> 3. ioctl(KVM_SET_USER_MEMORY_REGION, { daxaddr };

At least for a file backed mapping the capability lookup could be done
here, no need to wait for the fault.

> 4. guest starts running
> 5. guest touches 'daxaddr'
> 6. Page fault handler maps 'daxaddr'
> 7. KVM finds new 'daxaddr' PTE
> 8. TDX code tries to add physical address to Secure-EPT
> 9. TDX "SEAMCALL" fails because page is not convertible
> 10. Guest dies
>
> All we can do to improve on that is call something that pledges to only
> map convertible memory at 'daxaddr'.  We can't *actually* validate the
> physical addresses at mmap() time or even
> KVM_SET_USER_MEMORY_REGION-time because the memory might not have been
> allocated.
>
> Those pledges are hard for anonymous memory though.  To fulfill the
> pledge, we not only have to validate that the NUMA policy is compatible
> at KVM_SET_USER_MEMORY_REGION, we also need to decline changes to the
> policy that might undermine the pledge.

I think it's less that the kernel needs to enforce a pledge and more
that an interface is needed to communicate the guest death reason.
I.e. "here is the impossible thing you asked for, next time set this
policy to avoid this problem".

Dave Hansen April 29, 2022, 7:20 p.m. UTC | #21

On 4/29/22 11:47, Dan Williams wrote:
> On Fri, Apr 29, 2022 at 11:34 AM Dave Hansen <dave.hansen@intel.com> wrote:
>>
>> On 4/29/22 10:48, Dan Williams wrote:
>>>> But, neither of those really help with, say, a device-DAX mapping of
>>>> TDX-*IN*capable memory handed to KVM.  The "new syscall" would just
>>>> throw up its hands and leave users with the same result: TDX can't be
>>>> used.  The new sysfs ABI for NUMA nodes wouldn't clearly apply to
>>>> device-DAX because they don't respect the NUMA policy ABI.
>>> They do have "target_node" attributes to associate node specific
>>> metadata, and could certainly express target_node capabilities in its
>>> own ABI. Then it's just a matter of making pfn_to_nid() do the right
>>> thing so KVM kernel side can validate the capabilities of all inbound
>>> pfns.
>>
>> Let's walk through how this would work with today's kernel on tomorrow's
>> hardware, without KVM validating PFNs:
>>
>> 1. daxaddr mmap("/dev/dax1234")
>> 2. kvmfd = open("/dev/kvm")
>> 3. ioctl(KVM_SET_USER_MEMORY_REGION, { daxaddr };
> 
> At least for a file backed mapping the capability lookup could be done
> here, no need to wait for the fault.

For DAX mappings, sure.  But, anything that's backed by page cache, you
can't know until the RAM is allocated.

...
>> Those pledges are hard for anonymous memory though.  To fulfill the
>> pledge, we not only have to validate that the NUMA policy is compatible
>> at KVM_SET_USER_MEMORY_REGION, we also need to decline changes to the
>> policy that might undermine the pledge.
> 
> I think it's less that the kernel needs to enforce a pledge and more
> that an interface is needed to communicate the guest death reason.
> I.e. "here is the impossible thing you asked for, next time set this
> policy to avoid this problem".

IF this code is booted on a system where non-TDX-capable memory is
discovered, do we:
1. Disable TDX, printk() some nasty message, then boot as normal
or,
2a. Boot normally with TDX enabled
2b. Add enhanced error messages in case of TDH.MEM.PAGE.AUG/ADD failure
    (the "SEAMCALLs" which are the last line of defense and will reject
     the request to add non-TDX-capable memory to a guest).  Or maybe
    an even earlier message.

For #1, if TDX is on, we are quite sure it will work.  But, it will
probably throw up its hands on tomorrow's hardware.  (This patch set).

For #2, TDX might break (guests get killed) at runtime on tomorrow's
hardware, but it also might be just fine.  Users might be able to work
around things by, for instance, figuring out a NUMA policy which
excludes TDX-incapable memory. (I think what Dan is looking for)

Is that a fair summary?

Dan Williams April 29, 2022, 9:20 p.m. UTC | #22

On Fri, Apr 29, 2022 at 12:20 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 4/29/22 11:47, Dan Williams wrote:
> > On Fri, Apr 29, 2022 at 11:34 AM Dave Hansen <dave.hansen@intel.com> wrote:
> >>
> >> On 4/29/22 10:48, Dan Williams wrote:
> >>>> But, neither of those really help with, say, a device-DAX mapping of
> >>>> TDX-*IN*capable memory handed to KVM.  The "new syscall" would just
> >>>> throw up its hands and leave users with the same result: TDX can't be
> >>>> used.  The new sysfs ABI for NUMA nodes wouldn't clearly apply to
> >>>> device-DAX because they don't respect the NUMA policy ABI.
> >>> They do have "target_node" attributes to associate node specific
> >>> metadata, and could certainly express target_node capabilities in its
> >>> own ABI. Then it's just a matter of making pfn_to_nid() do the right
> >>> thing so KVM kernel side can validate the capabilities of all inbound
> >>> pfns.
> >>
> >> Let's walk through how this would work with today's kernel on tomorrow's
> >> hardware, without KVM validating PFNs:
> >>
> >> 1. daxaddr mmap("/dev/dax1234")
> >> 2. kvmfd = open("/dev/kvm")
> >> 3. ioctl(KVM_SET_USER_MEMORY_REGION, { daxaddr };
> >
> > At least for a file backed mapping the capability lookup could be done
> > here, no need to wait for the fault.
>
> For DAX mappings, sure.  But, anything that's backed by page cache, you
> can't know until the RAM is allocated.
>
> ...
> >> Those pledges are hard for anonymous memory though.  To fulfill the
> >> pledge, we not only have to validate that the NUMA policy is compatible
> >> at KVM_SET_USER_MEMORY_REGION, we also need to decline changes to the
> >> policy that might undermine the pledge.
> >
> > I think it's less that the kernel needs to enforce a pledge and more
> > that an interface is needed to communicate the guest death reason.
> > I.e. "here is the impossible thing you asked for, next time set this
> > policy to avoid this problem".
>
> IF this code is booted on a system where non-TDX-capable memory is
> discovered, do we:
> 1. Disable TDX, printk() some nasty message, then boot as normal
> or,
> 2a. Boot normally with TDX enabled
> 2b. Add enhanced error messages in case of TDH.MEM.PAGE.AUG/ADD failure
>     (the "SEAMCALLs" which are the last line of defense and will reject
>      the request to add non-TDX-capable memory to a guest).  Or maybe
>     an even earlier message.
>
> For #1, if TDX is on, we are quite sure it will work.  But, it will
> probably throw up its hands on tomorrow's hardware.  (This patch set).
>
> For #2, TDX might break (guests get killed) at runtime on tomorrow's
> hardware, but it also might be just fine.  Users might be able to work
> around things by, for instance, figuring out a NUMA policy which
> excludes TDX-incapable memory. (I think what Dan is looking for)
>
> Is that a fair summary?

Yes, just the option for TDX and non-TDX to live alongside each
other... although in the past I have argued to do option-1 and enforce
it at the lowest level [1]. Like platform BIOS is responsible to
disable CXL if CXL support for a given CPU security feature is
missing. However, I think end users will want to have their
confidential computing and capacity too. As long as that is not
precluded to be added after the fact, option-1 can be a way forward
until a concrete user for mixed mode shows up.

Is there something already like this today for people that, for
example, attempt to use PCI BAR mappings as memory? Or does KVM simply
allow for garbage-in garbage-out?

In the end the patches shouldn't talk about whether or not PMEM is
supported on a platform or not, that's irrelevant. What matters is
that misconfigurations can happen, should be rare to non-existent on
current platforms, and if it becomes a problem the kernel can grow ABI
to let userspace enumerate the conflicts.

[1]: https://lore.kernel.org/linux-cxl/CAPcyv4jMQbHYQssaDDDQFEbOR1v14VUnejcSwOP9VGUnZSsCKw@mail.gmail.com/

Dave Hansen April 29, 2022, 9:27 p.m. UTC | #23

On 4/29/22 14:20, Dan Williams wrote:
> Is there something already like this today for people that, for
> example, attempt to use PCI BAR mappings as memory? Or does KVM simply
> allow for garbage-in garbage-out?

I'm just guessing, but I _assume_ those garbage PCI BAR mappings are how
KVM does device passthrough.

I know that some KVM users even use mem= to chop down the kernel-owned
'struct page'-backed memory, then have a kind of /dev/mem driver to let
the memory get mapped back into userspace.  KVM is happy to pass through
those mappings.

Huang, Kai May 2, 2022, 10:18 a.m. UTC | #24

On Fri, 2022-04-29 at 11:34 -0700, Dave Hansen wrote:
> On 4/29/22 10:48, Dan Williams wrote:
> > > But, neither of those really help with, say, a device-DAX mapping of
> > > TDX-*IN*capable memory handed to KVM.  The "new syscall" would just
> > > throw up its hands and leave users with the same result: TDX can't be
> > > used.  The new sysfs ABI for NUMA nodes wouldn't clearly apply to
> > > device-DAX because they don't respect the NUMA policy ABI.
> > They do have "target_node" attributes to associate node specific
> > metadata, and could certainly express target_node capabilities in its
> > own ABI. Then it's just a matter of making pfn_to_nid() do the right
> > thing so KVM kernel side can validate the capabilities of all inbound
> > pfns.
> 
> Let's walk through how this would work with today's kernel on tomorrow's
> hardware, without KVM validating PFNs:
> 
> 1. daxaddr mmap("/dev/dax1234")
> 2. kvmfd = open("/dev/kvm")
> 3. ioctl(KVM_SET_USER_MEMORY_REGION, { daxaddr };
> 4. guest starts running
> 5. guest touches 'daxaddr'
> 6. Page fault handler maps 'daxaddr'
> 7. KVM finds new 'daxaddr' PTE
> 8. TDX code tries to add physical address to Secure-EPT
> 9. TDX "SEAMCALL" fails because page is not convertible
> 10. Guest dies
> 
> All we can do to improve on that is call something that pledges to only
> map convertible memory at 'daxaddr'.  We can't *actually* validate the
> physical addresses at mmap() time or even
> KVM_SET_USER_MEMORY_REGION-time because the memory might not have been
> allocated.
> 
> Those pledges are hard for anonymous memory though.  To fulfill the
> pledge, we not only have to validate that the NUMA policy is compatible
> at KVM_SET_USER_MEMORY_REGION, we also need to decline changes to the
> policy that might undermine the pledge.

Hi Dave,

There's another series done by Chao "KVM: mm: fd-based approach for supporting
KVM guest private memory" which essentially allows KVM to ask guest memory
backend to allocate page w/o having to mmap() to userspace.  Please see my reply
below:

https://lore.kernel.org/lkml/cover.1649219184.git.kai.huang@intel.com/T/#mf9bf10a63eaaf0968c46ab33bdaf06bd2cfdd948

My understanding is for TDX guest KVM will be enforced to use the new mechanism.
So when TDX supports NVDIMM in the future, dax can be extended to support the
new mechanism to support using it as TD guest backend.

Sean, Paolo, Isaku, Chao,

Please correct me if I am wrong?

Huang, Kai May 3, 2022, 11:59 p.m. UTC | #25

On Fri, 2022-04-29 at 13:40 +1200, Kai Huang wrote:
> On Thu, 2022-04-28 at 12:58 +1200, Kai Huang wrote:
> > On Wed, 2022-04-27 at 17:50 -0700, Dave Hansen wrote:
> > > On 4/27/22 17:37, Kai Huang wrote:
> > > > On Wed, 2022-04-27 at 14:59 -0700, Dave Hansen wrote:
> > > > > In 5 years, if someone takes this code and runs it on Intel hardware
> > > > > with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens?
> > > > 
> > > > I thought we could document this in the documentation saying that this code can
> > > > only work on TDX machines that don't have above capabilities (SPR for now).  We
> > > > can change the code and the documentation  when we add the support of those
> > > > features in the future, and update the documentation.
> > > > 
> > > > If 5 years later someone takes this code, he/she should take a look at the
> > > > documentation and figure out that he/she should choose a newer kernel if the
> > > > machine support those features.
> > > > 
> > > > I'll think about design solutions if above doesn't look good for you.
> > > 
> > > No, it doesn't look good to me.
> > > 
> > > You can't just say:
> > > 
> > > 	/*
> > > 	 * This code will eat puppies if used on systems with hotplug.
> > > 	 */
> > > 
> > > and merrily await the puppy bloodbath.
> > > 
> > > If it's not compatible, then you have to *MAKE* it not compatible in a
> > > safe, controlled way.
> > > 
> > > > > You can't just ignore the problems because they're not present on one
> > > > > version of the hardware.
> > > 
> > > Please, please read this again ^^
> > 
> > OK.  I'll think about solutions and come back later.
> > > 
> 
> Hi Dave,
> 
> I think we have two approaches to handle memory hotplug interaction with the TDX
> module initialization.  
> 
> The first approach is simple.  We just block memory from being added as system
> RAM managed by page allocator when the platform supports TDX [1]. It seems we
> can add some arch-specific-check to __add_memory_resource() and reject the new
> memory resource if platform supports TDX.  __add_memory_resource() is called by
> both __add_memory() and add_memory_driver_managed() so it prevents from adding
> NVDIMM as system RAM and normal ACPI memory hotplug [2].

Hi Dave,

Try to close how to handle memory hotplug.  Any comments to below?

For the first approach, I forgot to think about memory hot-remove case.  If we
just reject adding new memory resource when TDX is capable on the platform, then
if the memory is hot-removed, we won't be able to add it back.  My thinking is
we still want to support memory online/offline because it is purely in software
but has nothing to do with TDX.  But if one memory resource can be put to
offline, it seems we don't have any way to prevent it from being removed. 

So if we do above, on the future platforms when memory hotplug can co-exist with
TDX, ACPI hot-add and kmem-hot-add memory will be prevented.  However if some
memory is hot-removed, it won't be able to be added back (even it is included in
CMR, or TDMRs after TDX module is initialized).

Is this behavior acceptable?  Or perhaps I have misunderstanding?

The second approach will behave more nicely, but I don't know whether it is
worth to do it now.

Btw, below logic when adding a new memory resource has a minor problem, please
see below...

> 
> The second approach is relatively more complicated.  Instead of directly
> rejecting the new memory resource in __add_memory_resource(), we check whether
> the memory resource can be added based on CMR and the TDX module initialization
> status.   This is feasible as with the latest public P-SEAMLDR spec, we can get
> CMR from P-SEAMLDR SEAMCALL[3].  So we can detect P-SEAMLDR and get CMR info
> during kernel boots.  And in __add_memory_resource() we do below check:
> 
> 	tdx_init_disable();	/*similar to cpu_hotplug_disable() */
> 	if (tdx_module_initialized())
> 		// reject memory hotplug
> 	else if (new_memory_resource NOT in CMRs)
> 		// reject memory hotplug
> 	else
> 		allow memory hotplug
> 	tdx_init_enable();	/*similar to cpu_hotplug_enable() */

...

Should be:

	// prevent racing with TDX module initialization */
	tdx_init_disable();

	if (tdx_module_initialized()) {
		if (new_memory_resource in TDMRs)
			// allow memory hot-add
		else
			// reject memory hot-add
	} else if (new_memory_resource in CMR) {
		// add new memory to TDX memory so it can be
		// included into TDMRs

		// allow memory hot-add
	}
	else
		// reject memory hot-add
	
	tdx_module_enable();

And when platform doesn't TDX, always allow memory hot-add.


> 
> tdx_init_disable() temporarily disables TDX module initialization by trying to
> grab the mutex.  If the TDX module initialization is already on going, then it
> waits until it completes.
> 
> This should work better for future platforms, but would requires non-trivial
> more code as we need to add VMXON/VMXOFF support to the core-kernel to detect
> CMR using  SEAMCALL.  A side advantage is with VMXON in core-kernel we can
> shutdown the TDX module in kexec().
> 
> But for this series I think the second approach is overkill and we can choose to
> use the first simple approach?
> 
> Any suggestions?
> 
> [1] Platform supports TDX means SEAMRR is enabled, and there are at least 2 TDX
> keyIDs.  Or we can just check SEAMRR is enabled, as in practice a SEAMRR is
> enabled means the machine is TDX-capable, and for now a TDX-capable machine
> doesn't support ACPI memory hotplug.
> 
> [2] It prevents adding legacy PMEM as system RAM too but I think it's fine.  If
> user wants legacy PMEM then it is unlikely user will add it back and use as
> system RAM.  User is unlikely to use legacy PMEM as TD guest memory directly as
> TD guests is likely to use a new memfd backend which allows private page not
> accessible from usrspace, so in this way we can exclude legacy PMEM from TDMRs.
> 
> [3] Please refer to SEAMLDR.SEAMINFO SEAMCALL in latest P-SEAMLDR spec:
> https://www.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-seamldr-interface-specification.pdf
> > > >

Dave Hansen May 4, 2022, 12:25 a.m. UTC | #26

On 5/3/22 16:59, Kai Huang wrote:
> Should be:
> 
> 	// prevent racing with TDX module initialization */
> 	tdx_init_disable();
> 
> 	if (tdx_module_initialized()) {
> 		if (new_memory_resource in TDMRs)
> 			// allow memory hot-add
> 		else
> 			// reject memory hot-add
> 	} else if (new_memory_resource in CMR) {
> 		// add new memory to TDX memory so it can be
> 		// included into TDMRs
> 
> 		// allow memory hot-add
> 	}
> 	else
> 		// reject memory hot-add
> 	
> 	tdx_module_enable();
> 
> And when platform doesn't TDX, always allow memory hot-add.

I don't think it even needs to be *that* complicated.

It could just be winner take all: if TDX is initialized first, don't
allow memory hotplug.  If memory hotplug happens first, don't allow TDX
to be initialized.

That's fine at least for a minimal patch set.

What you have up above is probably where you want to go eventually, but
it means doing things like augmenting the e820 since it's the single
source of truth for creating the TMDRs right now.

Huang, Kai May 4, 2022, 1:15 a.m. UTC | #27

On Tue, 2022-05-03 at 17:25 -0700, Dave Hansen wrote:
> On 5/3/22 16:59, Kai Huang wrote:
> > Should be:
> > 
> > 	// prevent racing with TDX module initialization */
> > 	tdx_init_disable();
> > 
> > 	if (tdx_module_initialized()) {
> > 		if (new_memory_resource in TDMRs)
> > 			// allow memory hot-add
> > 		else
> > 			// reject memory hot-add
> > 	} else if (new_memory_resource in CMR) {
> > 		// add new memory to TDX memory so it can be
> > 		// included into TDMRs
> > 
> > 		// allow memory hot-add
> > 	}
> > 	else
> > 		// reject memory hot-add
> > 	
> > 	tdx_module_enable();
> > 
> > And when platform doesn't TDX, always allow memory hot-add.
> 
> I don't think it even needs to be *that* complicated.
> 
> It could just be winner take all: if TDX is initialized first, don't
> allow memory hotplug.  If memory hotplug happens first, don't allow TDX
> to be initialized.
> 
> That's fine at least for a minimal patch set.

OK. This should also work.

We will need tdx_init_disable() which grabs the mutex to prevent TDX module
initialization from running concurrently, and to disable TDX module
initialization once for all.

> 
> What you have up above is probably where you want to go eventually, but
> it means doing things like augmenting the e820 since it's the single
> source of truth for creating the TMDRs right now.
> 

Yes.  But in this case, I am thinking about probably we should switch from
consulting e820 to consulting memblock.  The advantage of using e820 is it's
easy to include legacy PMEM as TDX memory, but the disadvantage is (as you can
see in e820_for_each_mem() loop) I am actually merging contiguous different
types of RAM entries in order to be consistent with the behavior of
e820_memblock_setup().  This is not nice.

If memory hot-add and TDX can only be one winner, legacy PMEM actually won't be
used as TDX memory anyway now.  The reason is TDX guest will very likely needing
to use the new fd-based backend (see my reply in other emails), but not just
some random backend.  To me it's totally fine to not support using legacy PMEM
directly as TD guest backend (and if we create a TD with real NVDIMM as backend
using dax, the TD cannot be created anyway).  Given we cannot kmem-hot-add
legacy PMEM back as system RAM, to me it's pointless to include legacy PMEM into
TDMRs.

In this case, we can just create TDMRs based on memblock directly.  One problem
is memblock will be gone after kernel boots, but this can be solved either by
keeping the memblock, or build the TDX memory early when memblock is still
alive.

Btw, eventually, as it's likely we need to support different source of TDX
memory (CLX memory, etc), I think eventually we will need some data structures
to represent TDX memory block and APIs to add those blocks to the whole TDX
memory so those TDX memory ranges from different source can be added before
initializing the TDX module.

	struct tdx_memblock {
		struct list_head list;
		phys_addr_t start;
		phys_addr_t end;
		int nid;
		...
	};

	struct tdx_memory {
		struct list_head tmb_list;
		...
	};

	int tdx_memory_add_memblock(start, end, nid, ...);

And the TDMRs can be created based on 'struct tdx_memory'.

For now, we only need to add memblock to TDX memory.

Any comments?

Dan Williams May 4, 2022, 2:31 p.m. UTC | #28

On Tue, May 3, 2022 at 4:59 PM Kai Huang <kai.huang@intel.com> wrote:
>
> On Fri, 2022-04-29 at 13:40 +1200, Kai Huang wrote:
> > On Thu, 2022-04-28 at 12:58 +1200, Kai Huang wrote:
> > > On Wed, 2022-04-27 at 17:50 -0700, Dave Hansen wrote:
> > > > On 4/27/22 17:37, Kai Huang wrote:
> > > > > On Wed, 2022-04-27 at 14:59 -0700, Dave Hansen wrote:
> > > > > > In 5 years, if someone takes this code and runs it on Intel hardware
> > > > > > with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens?
> > > > >
> > > > > I thought we could document this in the documentation saying that this code can
> > > > > only work on TDX machines that don't have above capabilities (SPR for now).  We
> > > > > can change the code and the documentation  when we add the support of those
> > > > > features in the future, and update the documentation.
> > > > >
> > > > > If 5 years later someone takes this code, he/she should take a look at the
> > > > > documentation and figure out that he/she should choose a newer kernel if the
> > > > > machine support those features.
> > > > >
> > > > > I'll think about design solutions if above doesn't look good for you.
> > > >
> > > > No, it doesn't look good to me.
> > > >
> > > > You can't just say:
> > > >
> > > >   /*
> > > >    * This code will eat puppies if used on systems with hotplug.
> > > >    */
> > > >
> > > > and merrily await the puppy bloodbath.
> > > >
> > > > If it's not compatible, then you have to *MAKE* it not compatible in a
> > > > safe, controlled way.
> > > >
> > > > > > You can't just ignore the problems because they're not present on one
> > > > > > version of the hardware.
> > > >
> > > > Please, please read this again ^^
> > >
> > > OK.  I'll think about solutions and come back later.
> > > >
> >
> > Hi Dave,
> >
> > I think we have two approaches to handle memory hotplug interaction with the TDX
> > module initialization.
> >
> > The first approach is simple.  We just block memory from being added as system
> > RAM managed by page allocator when the platform supports TDX [1]. It seems we
> > can add some arch-specific-check to __add_memory_resource() and reject the new
> > memory resource if platform supports TDX.  __add_memory_resource() is called by
> > both __add_memory() and add_memory_driver_managed() so it prevents from adding
> > NVDIMM as system RAM and normal ACPI memory hotplug [2].
>
> Hi Dave,
>
> Try to close how to handle memory hotplug.  Any comments to below?
>
> For the first approach, I forgot to think about memory hot-remove case.  If we
> just reject adding new memory resource when TDX is capable on the platform, then
> if the memory is hot-removed, we won't be able to add it back.  My thinking is
> we still want to support memory online/offline because it is purely in software
> but has nothing to do with TDX.  But if one memory resource can be put to
> offline, it seems we don't have any way to prevent it from being removed.
>
> So if we do above, on the future platforms when memory hotplug can co-exist with
> TDX, ACPI hot-add and kmem-hot-add memory will be prevented.  However if some
> memory is hot-removed, it won't be able to be added back (even it is included in
> CMR, or TDMRs after TDX module is initialized).
>
> Is this behavior acceptable?  Or perhaps I have misunderstanding?

Memory online at boot uses similar kernel paths as memory-online at
run time, so it sounds like your question is confusing physical vs
logical remove. Consider the case of logical offline then re-online
where the proposed TDX sanity check blocks the memory online, but then
a new kernel is kexec'd and that kernel again trusts the memory as TD
convertible again just because it onlines the memory in the boot path.
For physical memory remove it seems up to the platform to block that
if it conflicts with TDX, not for the kernel to add extra assumptions
that logical offline / online is incompatible with TDX.

Huang, Kai May 4, 2022, 10:50 p.m. UTC | #29

On Wed, 2022-05-04 at 07:31 -0700, Dan Williams wrote:
> On Tue, May 3, 2022 at 4:59 PM Kai Huang <kai.huang@intel.com> wrote:
> > 
> > On Fri, 2022-04-29 at 13:40 +1200, Kai Huang wrote:
> > > On Thu, 2022-04-28 at 12:58 +1200, Kai Huang wrote:
> > > > On Wed, 2022-04-27 at 17:50 -0700, Dave Hansen wrote:
> > > > > On 4/27/22 17:37, Kai Huang wrote:
> > > > > > On Wed, 2022-04-27 at 14:59 -0700, Dave Hansen wrote:
> > > > > > > In 5 years, if someone takes this code and runs it on Intel hardware
> > > > > > > with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens?
> > > > > > 
> > > > > > I thought we could document this in the documentation saying that this code can
> > > > > > only work on TDX machines that don't have above capabilities (SPR for now).  We
> > > > > > can change the code and the documentation  when we add the support of those
> > > > > > features in the future, and update the documentation.
> > > > > > 
> > > > > > If 5 years later someone takes this code, he/she should take a look at the
> > > > > > documentation and figure out that he/she should choose a newer kernel if the
> > > > > > machine support those features.
> > > > > > 
> > > > > > I'll think about design solutions if above doesn't look good for you.
> > > > > 
> > > > > No, it doesn't look good to me.
> > > > > 
> > > > > You can't just say:
> > > > > 
> > > > >   /*
> > > > >    * This code will eat puppies if used on systems with hotplug.
> > > > >    */
> > > > > 
> > > > > and merrily await the puppy bloodbath.
> > > > > 
> > > > > If it's not compatible, then you have to *MAKE* it not compatible in a
> > > > > safe, controlled way.
> > > > > 
> > > > > > > You can't just ignore the problems because they're not present on one
> > > > > > > version of the hardware.
> > > > > 
> > > > > Please, please read this again ^^
> > > > 
> > > > OK.  I'll think about solutions and come back later.
> > > > > 
> > > 
> > > Hi Dave,
> > > 
> > > I think we have two approaches to handle memory hotplug interaction with the TDX
> > > module initialization.
> > > 
> > > The first approach is simple.  We just block memory from being added as system
> > > RAM managed by page allocator when the platform supports TDX [1]. It seems we
> > > can add some arch-specific-check to __add_memory_resource() and reject the new
> > > memory resource if platform supports TDX.  __add_memory_resource() is called by
> > > both __add_memory() and add_memory_driver_managed() so it prevents from adding
> > > NVDIMM as system RAM and normal ACPI memory hotplug [2].
> > 
> > Hi Dave,
> > 
> > Try to close how to handle memory hotplug.  Any comments to below?
> > 
> > For the first approach, I forgot to think about memory hot-remove case.  If we
> > just reject adding new memory resource when TDX is capable on the platform, then
> > if the memory is hot-removed, we won't be able to add it back.  My thinking is
> > we still want to support memory online/offline because it is purely in software
> > but has nothing to do with TDX.  But if one memory resource can be put to
> > offline, it seems we don't have any way to prevent it from being removed.
> > 
> > So if we do above, on the future platforms when memory hotplug can co-exist with
> > TDX, ACPI hot-add and kmem-hot-add memory will be prevented.  However if some
> > memory is hot-removed, it won't be able to be added back (even it is included in
> > CMR, or TDMRs after TDX module is initialized).
> > 
> > Is this behavior acceptable?  Or perhaps I have misunderstanding?
> 
> Memory online at boot uses similar kernel paths as memory-online at
> run time, so it sounds like your question is confusing physical vs
> logical remove. Consider the case of logical offline then re-online
> where the proposed TDX sanity check blocks the memory online, but then
> a new kernel is kexec'd and that kernel again trusts the memory as TD
> convertible again just because it onlines the memory in the boot path.
> For physical memory remove it seems up to the platform to block that
> if it conflicts with TDX, not for the kernel to add extra assumptions
> that logical offline / online is incompatible with TDX.

Hi Dan,

No we don't block memory online, but we block memory add.  The code I mentioned
is add_memory_resource(), while memory online code path is
memory_block_online().  Or am I wrong?

Huang, Kai May 5, 2022, 9:54 a.m. UTC | #30

On Wed, 2022-05-04 at 13:15 +1200, Kai Huang wrote:
> On Tue, 2022-05-03 at 17:25 -0700, Dave Hansen wrote:
> > On 5/3/22 16:59, Kai Huang wrote:
> > > Should be:
> > > 
> > > 	// prevent racing with TDX module initialization */
> > > 	tdx_init_disable();
> > > 
> > > 	if (tdx_module_initialized()) {
> > > 		if (new_memory_resource in TDMRs)
> > > 			// allow memory hot-add
> > > 		else
> > > 			// reject memory hot-add
> > > 	} else if (new_memory_resource in CMR) {
> > > 		// add new memory to TDX memory so it can be
> > > 		// included into TDMRs
> > > 
> > > 		// allow memory hot-add
> > > 	}
> > > 	else
> > > 		// reject memory hot-add
> > > 	
> > > 	tdx_module_enable();
> > > 
> > > And when platform doesn't TDX, always allow memory hot-add.
> > 
> > I don't think it even needs to be *that* complicated.
> > 
> > It could just be winner take all: if TDX is initialized first, don't
> > allow memory hotplug.  If memory hotplug happens first, don't allow TDX
> > to be initialized.
> > 
> > That's fine at least for a minimal patch set.
> 
> OK. This should also work.
> 
> We will need tdx_init_disable() which grabs the mutex to prevent TDX module
> initialization from running concurrently, and to disable TDX module
> initialization once for all.
>  
> 
> > 
> > What you have up above is probably where you want to go eventually, but
> > it means doing things like augmenting the e820 since it's the single
> > source of truth for creating the TMDRs right now.
> > 
> 
> Yes.  But in this case, I am thinking about probably we should switch from
> consulting e820 to consulting memblock.  The advantage of using e820 is it's
> easy to include legacy PMEM as TDX memory, but the disadvantage is (as you can
> see in e820_for_each_mem() loop) I am actually merging contiguous different
> types of RAM entries in order to be consistent with the behavior of
> e820_memblock_setup().  This is not nice.
> 
> If memory hot-add and TDX can only be one winner, legacy PMEM actually won't be
> used as TDX memory anyway now.  The reason is TDX guest will very likely needing
> to use the new fd-based backend (see my reply in other emails), but not just
> some random backend.  To me it's totally fine to not support using legacy PMEM
> directly as TD guest backend (and if we create a TD with real NVDIMM as backend
> using dax, the TD cannot be created anyway).  Given we cannot kmem-hot-add
> legacy PMEM back as system RAM, to me it's pointless to include legacy PMEM into
> TDMRs.
> 
> In this case, we can just create TDMRs based on memblock directly.  One problem
> is memblock will be gone after kernel boots, but this can be solved either by
> keeping the memblock, or build the TDX memory early when memblock is still
> alive.
> 
> Btw, eventually, as it's likely we need to support different source of TDX
> memory (CLX memory, etc), I think eventually we will need some data structures
> to represent TDX memory block and APIs to add those blocks to the whole TDX
> memory so those TDX memory ranges from different source can be added before
> initializing the TDX module.
> 
> 	struct tdx_memblock {
> 		struct list_head list;
> 		phys_addr_t start;
> 		phys_addr_t end;
> 		int nid;
> 		...
> 	};
> 
> 	struct tdx_memory {
> 		struct list_head tmb_list;
> 		...
> 	};
> 
> 	int tdx_memory_add_memblock(start, end, nid, ...);
> 
> And the TDMRs can be created based on 'struct tdx_memory'.
> 
> For now, we only need to add memblock to TDX memory.
> 
> Any comments?
> 

Hi Dave,

Sorry to ping (trying to close this).

Given we don't need to consider kmem-hot-add legacy PMEM after TDX module
initialization, I think for now it's totally fine to exclude legacy PMEMs from
TDMRs.  The worst case is when someone tries to use them as TD guest backend
directly, the TD will fail to create.  IMO it's acceptable, as it is supposedly
that no one should just use some random backend to run TD.

I think w/o needing to include legacy PMEM, it's better to get all TDX memory
blocks based on memblock, but not e820.  The pages managed by page allocator are
from memblock anyway (w/o those from memory hotplug).

And I also think it makes more sense to introduce 'tdx_memblock' and
'tdx_memory' data structures to gather all TDX memory blocks during boot when
memblock is still alive.  When TDX module is initialized during runtime, TDMRs
can be created based on the 'struct tdx_memory' which contains all TDX memory
blocks we gathered based on memblock during boot.  This is also more flexible to
support other TDX memory from other sources such as CLX memory in the future.

Please let me know if you have any objection?  Thanks!

Dan Williams May 5, 2022, 1:51 p.m. UTC | #31

[ add Mike ]

On Thu, May 5, 2022 at 2:54 AM Kai Huang <kai.huang@intel.com> wrote:
[..]
>
> Hi Dave,
>
> Sorry to ping (trying to close this).
>
> Given we don't need to consider kmem-hot-add legacy PMEM after TDX module
> initialization, I think for now it's totally fine to exclude legacy PMEMs from
> TDMRs.  The worst case is when someone tries to use them as TD guest backend
> directly, the TD will fail to create.  IMO it's acceptable, as it is supposedly
> that no one should just use some random backend to run TD.

The platform will already do this, right? I don't understand why this
is trying to take proactive action versus documenting the error
conditions and steps someone needs to take to avoid unconvertible
memory. There is already the CONFIG_HMEM_REPORTING that describes
relative performance properties between initiators and targets, it
seems fitting to also add security properties between initiators and
targets so someone can enumerate the numa-mempolicy that avoids
unconvertible memory.

No, special casing in hotplug code paths needed.

>
> I think w/o needing to include legacy PMEM, it's better to get all TDX memory
> blocks based on memblock, but not e820.  The pages managed by page allocator are
> from memblock anyway (w/o those from memory hotplug).
>
> And I also think it makes more sense to introduce 'tdx_memblock' and
> 'tdx_memory' data structures to gather all TDX memory blocks during boot when
> memblock is still alive.  When TDX module is initialized during runtime, TDMRs
> can be created based on the 'struct tdx_memory' which contains all TDX memory
> blocks we gathered based on memblock during boot.  This is also more flexible to
> support other TDX memory from other sources such as CLX memory in the future.
>
> Please let me know if you have any objection?  Thanks!

It's already the case that x86 maintains sideband structures to
preserve memory after exiting the early memblock code. Mike, correct
me if I am wrong, but adding more is less desirable than just keeping
the memblock around?

Huang, Kai May 5, 2022, 10:14 p.m. UTC | #32

Thanks for feedback!

On Thu, 2022-05-05 at 06:51 -0700, Dan Williams wrote:
> [ add Mike ]
> 
> 
> On Thu, May 5, 2022 at 2:54 AM Kai Huang <kai.huang@intel.com> wrote:
> [..]
> > 
> > Hi Dave,
> > 
> > Sorry to ping (trying to close this).
> > 
> > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module
> > initialization, I think for now it's totally fine to exclude legacy PMEMs from
> > TDMRs.  The worst case is when someone tries to use them as TD guest backend
> > directly, the TD will fail to create.  IMO it's acceptable, as it is supposedly
> > that no one should just use some random backend to run TD.
> 
> The platform will already do this, right? 
> 

In the current v3 implementation, we don't have any code to handle memory
hotplug, therefore nothing prevents people from adding legacy PMEMs as system
RAM using kmem driver.  In order to guarantee all pages managed by page
allocator are all TDX memory, the v3 implementation needs to always include
legacy PMEMs as TDX memory so that even people truly add  legacy PMEMs as system
RAM, we can still guarantee all pages in page allocator are TDX memory.

Of course, a side benefit of always including legacy PMEMs is people
theoretically can use them directly as TD guest backend, but this is just a
bonus but not something that we need to guarantee.

> I don't understand why this
> is trying to take proactive action versus documenting the error
> conditions and steps someone needs to take to avoid unconvertible
> memory. There is already the CONFIG_HMEM_REPORTING that describes
> relative performance properties between initiators and targets, it
> seems fitting to also add security properties between initiators and
> targets so someone can enumerate the numa-mempolicy that avoids
> unconvertible memory.

I don't think there's anything related to performance properties here.  The only
goal here is to make sure all pages in page allocator are TDX memory pages.

> 
> No, special casing in hotplug code paths needed.
> 
> > 
> > I think w/o needing to include legacy PMEM, it's better to get all TDX memory
> > blocks based on memblock, but not e820.  The pages managed by page allocator are
> > from memblock anyway (w/o those from memory hotplug).
> > 
> > And I also think it makes more sense to introduce 'tdx_memblock' and
> > 'tdx_memory' data structures to gather all TDX memory blocks during boot when
> > memblock is still alive.  When TDX module is initialized during runtime, TDMRs
> > can be created based on the 'struct tdx_memory' which contains all TDX memory
> > blocks we gathered based on memblock during boot.  This is also more flexible to
> > support other TDX memory from other sources such as CLX memory in the future.
> > 
> > Please let me know if you have any objection?  Thanks!
> 
> It's already the case that x86 maintains sideband structures to
> preserve memory after exiting the early memblock code. 
> 

May I ask what data structures are you referring to?

Btw, the purpose of 'tdx_memblock' and 'tdx_memory' is not only just to preserve
memblock info during boot.  It is also used to provide a common data structure
that the "constructing TDMRs" code can work on.  If you look at patch 11-14, the
logic (create TDMRs, allocate PAMTs, sets up reserved areas) around how to
construct TDMRs doesn't have hard dependency on e820.  If we construct TDMRs
based on a common 'tdx_memory' like below:

	int construct_tdmrs(struct tdx_memory *tmem, ...);

It would be much easier to support other TDX memory resources in the future.

The thing I am not sure is Dave wants to keep the code minimal (as this series
is already very big in terms of LoC) to make TDX running, and for now in
practice there's only system RAM during boot is TDX capable, so I am not sure we
should introduce those structures now.

> Mike, correct
> me if I am wrong, but adding more is less desirable than just keeping
> the memblock around?

Dan Williams May 6, 2022, 12:22 a.m. UTC | #33

On Thu, May 5, 2022 at 3:14 PM Kai Huang <kai.huang@intel.com> wrote:
>
> Thanks for feedback!
>
> On Thu, 2022-05-05 at 06:51 -0700, Dan Williams wrote:
> > [ add Mike ]
> >
> >
> > On Thu, May 5, 2022 at 2:54 AM Kai Huang <kai.huang@intel.com> wrote:
> > [..]
> > >
> > > Hi Dave,
> > >
> > > Sorry to ping (trying to close this).
> > >
> > > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module
> > > initialization, I think for now it's totally fine to exclude legacy PMEMs from
> > > TDMRs.  The worst case is when someone tries to use them as TD guest backend
> > > directly, the TD will fail to create.  IMO it's acceptable, as it is supposedly
> > > that no one should just use some random backend to run TD.
> >
> > The platform will already do this, right?
> >
>
> In the current v3 implementation, we don't have any code to handle memory
> hotplug, therefore nothing prevents people from adding legacy PMEMs as system
> RAM using kmem driver.  In order to guarantee all pages managed by page

That's the fundamental question I am asking why is "guarantee all
pages managed by page allocator are TDX memory". That seems overkill
compared to indicating the incompatibility after the fact.

> allocator are all TDX memory, the v3 implementation needs to always include
> legacy PMEMs as TDX memory so that even people truly add  legacy PMEMs as system
> RAM, we can still guarantee all pages in page allocator are TDX memory.

Why?

> Of course, a side benefit of always including legacy PMEMs is people
> theoretically can use them directly as TD guest backend, but this is just a
> bonus but not something that we need to guarantee.
>
>
> > I don't understand why this
> > is trying to take proactive action versus documenting the error
> > conditions and steps someone needs to take to avoid unconvertible
> > memory. There is already the CONFIG_HMEM_REPORTING that describes
> > relative performance properties between initiators and targets, it
> > seems fitting to also add security properties between initiators and
> > targets so someone can enumerate the numa-mempolicy that avoids
> > unconvertible memory.
>
> I don't think there's anything related to performance properties here.  The only
> goal here is to make sure all pages in page allocator are TDX memory pages.

Please reconsider or re-clarify that goal.

>
> >
> > No, special casing in hotplug code paths needed.
> >
> > >
> > > I think w/o needing to include legacy PMEM, it's better to get all TDX memory
> > > blocks based on memblock, but not e820.  The pages managed by page allocator are
> > > from memblock anyway (w/o those from memory hotplug).
> > >
> > > And I also think it makes more sense to introduce 'tdx_memblock' and
> > > 'tdx_memory' data structures to gather all TDX memory blocks during boot when
> > > memblock is still alive.  When TDX module is initialized during runtime, TDMRs
> > > can be created based on the 'struct tdx_memory' which contains all TDX memory
> > > blocks we gathered based on memblock during boot.  This is also more flexible to
> > > support other TDX memory from other sources such as CLX memory in the future.
> > >
> > > Please let me know if you have any objection?  Thanks!
> >
> > It's already the case that x86 maintains sideband structures to
> > preserve memory after exiting the early memblock code.
> >
>
> May I ask what data structures are you referring to?

struct numa_meminfo.

> Btw, the purpose of 'tdx_memblock' and 'tdx_memory' is not only just to preserve
> memblock info during boot.  It is also used to provide a common data structure
> that the "constructing TDMRs" code can work on.  If you look at patch 11-14, the
> logic (create TDMRs, allocate PAMTs, sets up reserved areas) around how to
> construct TDMRs doesn't have hard dependency on e820.  If we construct TDMRs
> based on a common 'tdx_memory' like below:
>
>         int construct_tdmrs(struct tdx_memory *tmem, ...);
>
> It would be much easier to support other TDX memory resources in the future.

"in the future" is a prompt to ask "Why not wait until that future /
need arrives before adding new infrastructure?"

> The thing I am not sure is Dave wants to keep the code minimal (as this series
> is already very big in terms of LoC) to make TDX running, and for now in
> practice there's only system RAM during boot is TDX capable, so I am not sure we
> should introduce those structures now.
>
> > Mike, correct
> > me if I am wrong, but adding more is less desirable than just keeping
> > the memblock around?

Huang, Kai May 6, 2022, 12:45 a.m. UTC | #34

On Thu, 2022-05-05 at 17:22 -0700, Dan Williams wrote:
> On Thu, May 5, 2022 at 3:14 PM Kai Huang <kai.huang@intel.com> wrote:
> > 
> > Thanks for feedback!
> > 
> > On Thu, 2022-05-05 at 06:51 -0700, Dan Williams wrote:
> > > [ add Mike ]
> > > 
> > > 
> > > On Thu, May 5, 2022 at 2:54 AM Kai Huang <kai.huang@intel.com> wrote:
> > > [..]
> > > > 
> > > > Hi Dave,
> > > > 
> > > > Sorry to ping (trying to close this).
> > > > 
> > > > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module
> > > > initialization, I think for now it's totally fine to exclude legacy PMEMs from
> > > > TDMRs.  The worst case is when someone tries to use them as TD guest backend
> > > > directly, the TD will fail to create.  IMO it's acceptable, as it is supposedly
> > > > that no one should just use some random backend to run TD.
> > > 
> > > The platform will already do this, right?
> > > 
> > 
> > In the current v3 implementation, we don't have any code to handle memory
> > hotplug, therefore nothing prevents people from adding legacy PMEMs as system
> > RAM using kmem driver.  In order to guarantee all pages managed by page
> 
> That's the fundamental question I am asking why is "guarantee all
> pages managed by page allocator are TDX memory". That seems overkill
> compared to indicating the incompatibility after the fact.

As I explained, the reason is I don't want to modify page allocator to
distinguish TDX and non-TDX allocation, for instance, having to have a ZONE_TDX
and GFP_TDX.

KVM depends on host's page fault handler to allocate the page.  In fact KVM only
consumes PFN from host's page tables.  For now only RAM is TDX memory.  By
guaranteeing all pages in page allocator is TDX memory, we can easily use
anonymous pages as TD guest memory.  This also allows us to easily extend the
shmem to support a new fd-based backend which doesn't require having to mmap()
TD guest memory to host userspace:

https://lore.kernel.org/kvm/20220310140911.50924-1-chao.p.peng@linux.intel.com/

Also, besides TD guest memory, there are some per-TD control data structures
(which must be TDX memory too) need to be allocated for each TD.  Normal memory
allocation APIs can be used for such allocation if we guarantee all pages in
page allocator is TDX memory.

> 
> > allocator are all TDX memory, the v3 implementation needs to always include
> > legacy PMEMs as TDX memory so that even people truly add  legacy PMEMs as system
> > RAM, we can still guarantee all pages in page allocator are TDX memory.
> 
> Why?

If we don't include legacy PMEMs as TDX memory, then after they are hot-added as
system RAM using kmem driver, the assumption of "all pages in page allocator are
TDX memory" is broken.  A TD can be killed during runtime.

> 
> > Of course, a side benefit of always including legacy PMEMs is people
> > theoretically can use them directly as TD guest backend, but this is just a
> > bonus but not something that we need to guarantee.
> > 
> > 
> > > I don't understand why this
> > > is trying to take proactive action versus documenting the error
> > > conditions and steps someone needs to take to avoid unconvertible
> > > memory. There is already the CONFIG_HMEM_REPORTING that describes
> > > relative performance properties between initiators and targets, it
> > > seems fitting to also add security properties between initiators and
> > > targets so someone can enumerate the numa-mempolicy that avoids
> > > unconvertible memory.
> > 
> > I don't think there's anything related to performance properties here.  The only
> > goal here is to make sure all pages in page allocator are TDX memory pages.
> 
> Please reconsider or re-clarify that goal.
> 
> > 
> > > 
> > > No, special casing in hotplug code paths needed.
> > > 
> > > > 
> > > > I think w/o needing to include legacy PMEM, it's better to get all TDX memory
> > > > blocks based on memblock, but not e820.  The pages managed by page allocator are
> > > > from memblock anyway (w/o those from memory hotplug).
> > > > 
> > > > And I also think it makes more sense to introduce 'tdx_memblock' and
> > > > 'tdx_memory' data structures to gather all TDX memory blocks during boot when
> > > > memblock is still alive.  When TDX module is initialized during runtime, TDMRs
> > > > can be created based on the 'struct tdx_memory' which contains all TDX memory
> > > > blocks we gathered based on memblock during boot.  This is also more flexible to
> > > > support other TDX memory from other sources such as CLX memory in the future.
> > > > 
> > > > Please let me know if you have any objection?  Thanks!
> > > 
> > > It's already the case that x86 maintains sideband structures to
> > > preserve memory after exiting the early memblock code.
> > > 
> > 
> > May I ask what data structures are you referring to?
> 
> struct numa_meminfo.
> 
> > Btw, the purpose of 'tdx_memblock' and 'tdx_memory' is not only just to preserve
> > memblock info during boot.  It is also used to provide a common data structure
> > that the "constructing TDMRs" code can work on.  If you look at patch 11-14, the
> > logic (create TDMRs, allocate PAMTs, sets up reserved areas) around how to
> > construct TDMRs doesn't have hard dependency on e820.  If we construct TDMRs
> > based on a common 'tdx_memory' like below:
> > 
> >         int construct_tdmrs(struct tdx_memory *tmem, ...);
> > 
> > It would be much easier to support other TDX memory resources in the future.
> 
> "in the future" is a prompt to ask "Why not wait until that future /
> need arrives before adding new infrastructure?"

Fine to me.

Dan Williams May 6, 2022, 1:15 a.m. UTC | #35

On Thu, May 5, 2022 at 5:46 PM Kai Huang <kai.huang@intel.com> wrote:
>
> On Thu, 2022-05-05 at 17:22 -0700, Dan Williams wrote:
> > On Thu, May 5, 2022 at 3:14 PM Kai Huang <kai.huang@intel.com> wrote:
> > >
> > > Thanks for feedback!
> > >
> > > On Thu, 2022-05-05 at 06:51 -0700, Dan Williams wrote:
> > > > [ add Mike ]
> > > >
> > > >
> > > > On Thu, May 5, 2022 at 2:54 AM Kai Huang <kai.huang@intel.com> wrote:
> > > > [..]
> > > > >
> > > > > Hi Dave,
> > > > >
> > > > > Sorry to ping (trying to close this).
> > > > >
> > > > > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module
> > > > > initialization, I think for now it's totally fine to exclude legacy PMEMs from
> > > > > TDMRs.  The worst case is when someone tries to use them as TD guest backend
> > > > > directly, the TD will fail to create.  IMO it's acceptable, as it is supposedly
> > > > > that no one should just use some random backend to run TD.
> > > >
> > > > The platform will already do this, right?
> > > >
> > >
> > > In the current v3 implementation, we don't have any code to handle memory
> > > hotplug, therefore nothing prevents people from adding legacy PMEMs as system
> > > RAM using kmem driver.  In order to guarantee all pages managed by page
> >
> > That's the fundamental question I am asking why is "guarantee all
> > pages managed by page allocator are TDX memory". That seems overkill
> > compared to indicating the incompatibility after the fact.
>
> As I explained, the reason is I don't want to modify page allocator to
> distinguish TDX and non-TDX allocation, for instance, having to have a ZONE_TDX
> and GFP_TDX.

Right, TDX details do not belong at that level, but it will work
almost all the time if you do nothing to "guarantee" all TDX capable
pages all the time.

> KVM depends on host's page fault handler to allocate the page.  In fact KVM only
> consumes PFN from host's page tables.  For now only RAM is TDX memory.  By
> guaranteeing all pages in page allocator is TDX memory, we can easily use
> anonymous pages as TD guest memory.

Again, TDX capable pages will be the overwhelming default, why are you
worried about cluttering the memory hotplug path for nice corner
cases.

Consider the fact that end users can break the kernel by specifying
invalid memmap= command line options. The memory hotplug code does not
take any steps to add safety in those cases because there are already
too many ways it can go wrong. TDX is just one more corner case where
the memmap= user needs to be careful. Otherwise, it is up to the
platform firmware to make sure everything in the base memory map is
TDX capable, and then all you need is documentation about the failure
mode when extending "System RAM" beyond that baseline.

> shmem to support a new fd-based backend which doesn't require having to mmap()
> TD guest memory to host userspace:
>
> https://lore.kernel.org/kvm/20220310140911.50924-1-chao.p.peng@linux.intel.com/
>
> Also, besides TD guest memory, there are some per-TD control data structures
> (which must be TDX memory too) need to be allocated for each TD.  Normal memory
> allocation APIs can be used for such allocation if we guarantee all pages in
> page allocator is TDX memory.

You don't need that guarantee, just check it after the fact and fail
if that assertion fails. It should almost always be the case that it
succeeds and if it doesn't then something special is happening with
that system and the end user has effectively opt-ed out of TDX
operation.

> > > allocator are all TDX memory, the v3 implementation needs to always include
> > > legacy PMEMs as TDX memory so that even people truly add  legacy PMEMs as system
> > > RAM, we can still guarantee all pages in page allocator are TDX memory.
> >
> > Why?
>
> If we don't include legacy PMEMs as TDX memory, then after they are hot-added as
> system RAM using kmem driver, the assumption of "all pages in page allocator are
> TDX memory" is broken.  A TD can be killed during runtime.

Yes, that is what the end user asked for. If they don't want that to
happen then the policy decision about using kmem needs to be updated
in userspace, not hard code that policy decision towards TDX inside
the kernel.

Huang, Kai May 6, 2022, 1:46 a.m. UTC | #36

On Thu, 2022-05-05 at 18:15 -0700, Dan Williams wrote:
> On Thu, May 5, 2022 at 5:46 PM Kai Huang <kai.huang@intel.com> wrote:
> > 
> > On Thu, 2022-05-05 at 17:22 -0700, Dan Williams wrote:
> > > On Thu, May 5, 2022 at 3:14 PM Kai Huang <kai.huang@intel.com> wrote:
> > > > 
> > > > Thanks for feedback!
> > > > 
> > > > On Thu, 2022-05-05 at 06:51 -0700, Dan Williams wrote:
> > > > > [ add Mike ]
> > > > > 
> > > > > 
> > > > > On Thu, May 5, 2022 at 2:54 AM Kai Huang <kai.huang@intel.com> wrote:
> > > > > [..]
> > > > > > 
> > > > > > Hi Dave,
> > > > > > 
> > > > > > Sorry to ping (trying to close this).
> > > > > > 
> > > > > > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module
> > > > > > initialization, I think for now it's totally fine to exclude legacy PMEMs from
> > > > > > TDMRs.  The worst case is when someone tries to use them as TD guest backend
> > > > > > directly, the TD will fail to create.  IMO it's acceptable, as it is supposedly
> > > > > > that no one should just use some random backend to run TD.
> > > > > 
> > > > > The platform will already do this, right?
> > > > > 
> > > > 
> > > > In the current v3 implementation, we don't have any code to handle memory
> > > > hotplug, therefore nothing prevents people from adding legacy PMEMs as system
> > > > RAM using kmem driver.  In order to guarantee all pages managed by page
> > > 
> > > That's the fundamental question I am asking why is "guarantee all
> > > pages managed by page allocator are TDX memory". That seems overkill
> > > compared to indicating the incompatibility after the fact.
> > 
> > As I explained, the reason is I don't want to modify page allocator to
> > distinguish TDX and non-TDX allocation, for instance, having to have a ZONE_TDX
> > and GFP_TDX.
> 
> Right, TDX details do not belong at that level, but it will work
> almost all the time if you do nothing to "guarantee" all TDX capable
> pages all the time.

"almost all the time" do you mean?

> 
> > KVM depends on host's page fault handler to allocate the page.  In fact KVM only
> > consumes PFN from host's page tables.  For now only RAM is TDX memory.  By
> > guaranteeing all pages in page allocator is TDX memory, we can easily use
> > anonymous pages as TD guest memory.
> 
> Again, TDX capable pages will be the overwhelming default, why are you
> worried about cluttering the memory hotplug path for nice corner
> cases.

Firstly perhaps I forgot to mention there are two concepts about TDX memory, so
let me clarify first:

1) Convertible Memory Regions (CMRs).  This is reported by BIOS (thus static) to
indicate which memory regions *can* be used as TDX memory.  This basically means
all RAM during boot for now.

2) TD Memory Regions (TDMRs).  Memory pages in CMRs are not automatically TDX
usable memory.  The TDX module needs to be configured which (convertible) memory
regions can be used as TDX memory.  Kernel is responsible for choosing the
ranges, and configure to the TDX module.  If a convertible memory page is not
included into TDMRs, the TDX module will report error when it is assigned to  a
TD.

> 
> Consider the fact that end users can break the kernel by specifying
> invalid memmap= command line options. The memory hotplug code does not
> take any steps to add safety in those cases because there are already
> too many ways it can go wrong. TDX is just one more corner case where
> the memmap= user needs to be careful. Otherwise, it is up to the
> platform firmware to make sure everything in the base memory map is
> TDX capable, and then all you need is documentation about the failure
> mode when extending "System RAM" beyond that baseline.

So the fact is, if we don't include legacy PMEMs into TDMRs, and don't do
anything in memory hotplug, then if user does kmem-hot-add legacy PMEMs as
system RAM, a live TD may eventually be killed.

If such case is a corner case that we don't need to guarantee, then even better.
And we have an additional reason that those legacy PMEMs don't need to be in
TDMRs.  As you suggested,  we can add some documentation to point out.

But the point we want to do some code check and prevent memory hotplug is, as
Dave said, we want this piece of code to work on *ANY* TDX capable machines,
including future machines which may, i.e. supports NVDIMM/CLX memory as TDX
memory.  If we don't do any code check in  memory hotplug in this series, then
when this code runs in future platforms, user can plug NVDIMM or CLX memory as
system RAM thus break the assumption "all pages in page allocator are TDX
memory", which eventually leads to live TDs being killed potentially.

Dave said we need to guarantee this code can work on *ANY* TDX machines.  Some
documentation saying it only works one some platforms and you shouldn't do
things on other platforms are not good enough:

https://lore.kernel.org/lkml/cover.1649219184.git.kai.huang@intel.com/T/#m6df45b6e1702bb03dcb027044a0dabf30a86e471

> 
> 
> > shmem to support a new fd-based backend which doesn't require having to mmap()
> > TD guest memory to host userspace:
> > 
> > https://lore.kernel.org/kvm/20220310140911.50924-1-chao.p.peng@linux.intel.com/
> > 
> > Also, besides TD guest memory, there are some per-TD control data structures
> > (which must be TDX memory too) need to be allocated for each TD.  Normal memory
> > allocation APIs can be used for such allocation if we guarantee all pages in
> > page allocator is TDX memory.
> 
> You don't need that guarantee, just check it after the fact and fail
> if that assertion fails. It should almost always be the case that it
> succeeds and if it doesn't then something special is happening with
> that system and the end user has effectively opt-ed out of TDX
> operation.

This doesn't guarantee consistent behaviour.  For instance, for one TD it can be
created, while the second may fail.  We should provide a consistent service.

The thing is anyway we need to configure some memory regions to the TDX module.
To me there's no reason we don't want to guarantee all pages in page allocator
are TDX memory. 

> 
> > > > allocator are all TDX memory, the v3 implementation needs to always include
> > > > legacy PMEMs as TDX memory so that even people truly add  legacy PMEMs as system
> > > > RAM, we can still guarantee all pages in page allocator are TDX memory.
> > > 
> > > Why?
> > 
> > If we don't include legacy PMEMs as TDX memory, then after they are hot-added as
> > system RAM using kmem driver, the assumption of "all pages in page allocator are
> > TDX memory" is broken.  A TD can be killed during runtime.
> 
> Yes, that is what the end user asked for. If they don't want that to
> happen then the policy decision about using kmem needs to be updated
> in userspace, not hard code that policy decision towards TDX inside
> the kernel.

This is also fine to me.  But please also see above Dave's comment.

Thanks for those valuable feedback!

Dan Williams May 6, 2022, 3:57 p.m. UTC | #37

On Thu, May 5, 2022 at 6:47 PM Kai Huang <kai.huang@intel.com> wrote:
>
> On Thu, 2022-05-05 at 18:15 -0700, Dan Williams wrote:
> > On Thu, May 5, 2022 at 5:46 PM Kai Huang <kai.huang@intel.com> wrote:
> > >
> > > On Thu, 2022-05-05 at 17:22 -0700, Dan Williams wrote:
> > > > On Thu, May 5, 2022 at 3:14 PM Kai Huang <kai.huang@intel.com> wrote:
> > > > >
> > > > > Thanks for feedback!
> > > > >
> > > > > On Thu, 2022-05-05 at 06:51 -0700, Dan Williams wrote:
> > > > > > [ add Mike ]
> > > > > >
> > > > > >
> > > > > > On Thu, May 5, 2022 at 2:54 AM Kai Huang <kai.huang@intel.com> wrote:
> > > > > > [..]
> > > > > > >
> > > > > > > Hi Dave,
> > > > > > >
> > > > > > > Sorry to ping (trying to close this).
> > > > > > >
> > > > > > > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module
> > > > > > > initialization, I think for now it's totally fine to exclude legacy PMEMs from
> > > > > > > TDMRs.  The worst case is when someone tries to use them as TD guest backend
> > > > > > > directly, the TD will fail to create.  IMO it's acceptable, as it is supposedly
> > > > > > > that no one should just use some random backend to run TD.
> > > > > >
> > > > > > The platform will already do this, right?
> > > > > >
> > > > >
> > > > > In the current v3 implementation, we don't have any code to handle memory
> > > > > hotplug, therefore nothing prevents people from adding legacy PMEMs as system
> > > > > RAM using kmem driver.  In order to guarantee all pages managed by page
> > > >
> > > > That's the fundamental question I am asking why is "guarantee all
> > > > pages managed by page allocator are TDX memory". That seems overkill
> > > > compared to indicating the incompatibility after the fact.
> > >
> > > As I explained, the reason is I don't want to modify page allocator to
> > > distinguish TDX and non-TDX allocation, for instance, having to have a ZONE_TDX
> > > and GFP_TDX.
> >
> > Right, TDX details do not belong at that level, but it will work
> > almost all the time if you do nothing to "guarantee" all TDX capable
> > pages all the time.
>
> "almost all the time" do you mean?
>
> >
> > > KVM depends on host's page fault handler to allocate the page.  In fact KVM only
> > > consumes PFN from host's page tables.  For now only RAM is TDX memory.  By
> > > guaranteeing all pages in page allocator is TDX memory, we can easily use
> > > anonymous pages as TD guest memory.
> >
> > Again, TDX capable pages will be the overwhelming default, why are you
> > worried about cluttering the memory hotplug path for nice corner
> > cases.
>
> Firstly perhaps I forgot to mention there are two concepts about TDX memory, so
> let me clarify first:
>
> 1) Convertible Memory Regions (CMRs).  This is reported by BIOS (thus static) to
> indicate which memory regions *can* be used as TDX memory.  This basically means
> all RAM during boot for now.
>
> 2) TD Memory Regions (TDMRs).  Memory pages in CMRs are not automatically TDX
> usable memory.  The TDX module needs to be configured which (convertible) memory
> regions can be used as TDX memory.  Kernel is responsible for choosing the
> ranges, and configure to the TDX module.  If a convertible memory page is not
> included into TDMRs, the TDX module will report error when it is assigned to  a
> TD.
>
> >
> > Consider the fact that end users can break the kernel by specifying
> > invalid memmap= command line options. The memory hotplug code does not
> > take any steps to add safety in those cases because there are already
> > too many ways it can go wrong. TDX is just one more corner case where
> > the memmap= user needs to be careful. Otherwise, it is up to the
> > platform firmware to make sure everything in the base memory map is
> > TDX capable, and then all you need is documentation about the failure
> > mode when extending "System RAM" beyond that baseline.
>
> So the fact is, if we don't include legacy PMEMs into TDMRs, and don't do
> anything in memory hotplug, then if user does kmem-hot-add legacy PMEMs as
> system RAM, a live TD may eventually be killed.
>
> If such case is a corner case that we don't need to guarantee, then even better.
> And we have an additional reason that those legacy PMEMs don't need to be in
> TDMRs.  As you suggested,  we can add some documentation to point out.
>
> But the point we want to do some code check and prevent memory hotplug is, as
> Dave said, we want this piece of code to work on *ANY* TDX capable machines,
> including future machines which may, i.e. supports NVDIMM/CLX memory as TDX
> memory.  If we don't do any code check in  memory hotplug in this series, then
> when this code runs in future platforms, user can plug NVDIMM or CLX memory as
> system RAM thus break the assumption "all pages in page allocator are TDX
> memory", which eventually leads to live TDs being killed potentially.
>
> Dave said we need to guarantee this code can work on *ANY* TDX machines.  Some
> documentation saying it only works one some platforms and you shouldn't do
> things on other platforms are not good enough:
>
> https://lore.kernel.org/lkml/cover.1649219184.git.kai.huang@intel.com/T/#m6df45b6e1702bb03dcb027044a0dabf30a86e471

Yes, the incompatible cases cannot be ignored, but I disagree that
they actively need to be prevented. One way to achieve that is to
explicitly enumerate TDX capable memory and document how mempolicy can
be used to avoid killing TDs.

> > > shmem to support a new fd-based backend which doesn't require having to mmap()
> > > TD guest memory to host userspace:
> > >
> > > https://lore.kernel.org/kvm/20220310140911.50924-1-chao.p.peng@linux.intel.com/
> > >
> > > Also, besides TD guest memory, there are some per-TD control data structures
> > > (which must be TDX memory too) need to be allocated for each TD.  Normal memory
> > > allocation APIs can be used for such allocation if we guarantee all pages in
> > > page allocator is TDX memory.
> >
> > You don't need that guarantee, just check it after the fact and fail
> > if that assertion fails. It should almost always be the case that it
> > succeeds and if it doesn't then something special is happening with
> > that system and the end user has effectively opt-ed out of TDX
> > operation.
>
> This doesn't guarantee consistent behaviour.  For instance, for one TD it can be
> created, while the second may fail.  We should provide a consistent service.

Yes, there needs to be enumeration and policy knobs to avoid failures,
hard coded "no memory hotplug" hacks do not seem the right enumeration
and policy knobs to me.

> The thing is anyway we need to configure some memory regions to the TDX module.
> To me there's no reason we don't want to guarantee all pages in page allocator
> are TDX memory.
>
> >
> > > > > allocator are all TDX memory, the v3 implementation needs to always include
> > > > > legacy PMEMs as TDX memory so that even people truly add  legacy PMEMs as system
> > > > > RAM, we can still guarantee all pages in page allocator are TDX memory.
> > > >
> > > > Why?
> > >
> > > If we don't include legacy PMEMs as TDX memory, then after they are hot-added as
> > > system RAM using kmem driver, the assumption of "all pages in page allocator are
> > > TDX memory" is broken.  A TD can be killed during runtime.
> >
> > Yes, that is what the end user asked for. If they don't want that to
> > happen then the policy decision about using kmem needs to be updated
> > in userspace, not hard code that policy decision towards TDX inside
> > the kernel.
>
> This is also fine to me.  But please also see above Dave's comment.

Dave is right, the implementation can not just ignore the conflict. To
me, enumeration plus error reporting allows for flexibility without
hard coding policy in the kernel.

Mike Rapoport May 7, 2022, 12:09 a.m. UTC | #38

On Thu, May 05, 2022 at 06:51:20AM -0700, Dan Williams wrote:
> [ add Mike ]
> 
> On Thu, May 5, 2022 at 2:54 AM Kai Huang <kai.huang@intel.com> wrote:
> [..]
> >
> > Hi Dave,
> >
> > Sorry to ping (trying to close this).
> >
> > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module
> > initialization, I think for now it's totally fine to exclude legacy PMEMs from
> > TDMRs.  The worst case is when someone tries to use them as TD guest backend
> > directly, the TD will fail to create.  IMO it's acceptable, as it is supposedly
> > that no one should just use some random backend to run TD.
> 
> The platform will already do this, right? I don't understand why this
> is trying to take proactive action versus documenting the error
> conditions and steps someone needs to take to avoid unconvertible
> memory. There is already the CONFIG_HMEM_REPORTING that describes
> relative performance properties between initiators and targets, it
> seems fitting to also add security properties between initiators and
> targets so someone can enumerate the numa-mempolicy that avoids
> unconvertible memory.
> 
> No, special casing in hotplug code paths needed.
> 
> >
> > I think w/o needing to include legacy PMEM, it's better to get all TDX memory
> > blocks based on memblock, but not e820.  The pages managed by page allocator are
> > from memblock anyway (w/o those from memory hotplug).
> >
> > And I also think it makes more sense to introduce 'tdx_memblock' and
> > 'tdx_memory' data structures to gather all TDX memory blocks during boot when
> > memblock is still alive.  When TDX module is initialized during runtime, TDMRs
> > can be created based on the 'struct tdx_memory' which contains all TDX memory
> > blocks we gathered based on memblock during boot.  This is also more flexible to
> > support other TDX memory from other sources such as CLX memory in the future.
> >
> > Please let me know if you have any objection?  Thanks!
> 
> It's already the case that x86 maintains sideband structures to
> preserve memory after exiting the early memblock code. Mike, correct
> me if I am wrong, but adding more is less desirable than just keeping
> the memblock around?

TBH, I didn't read the entire thread yet, but at the first glance, keeping
memblock around is much more preferable that adding yet another { .start,
.end, .flags } data structure. To keep memblock after boot all is needed is
something like

	select ARCH_KEEP_MEMBLOCK if INTEL_TDX_HOST

I'll take a closer look next week on the entire series, maybe I'm missing
some details.

Huang, Kai May 8, 2022, 10 a.m. UTC | #39

On Fri, 2022-05-06 at 20:09 -0400, Mike Rapoport wrote:
> On Thu, May 05, 2022 at 06:51:20AM -0700, Dan Williams wrote:
> > [ add Mike ]
> > 
> > On Thu, May 5, 2022 at 2:54 AM Kai Huang <kai.huang@intel.com> wrote:
> > [..]
> > > 
> > > Hi Dave,
> > > 
> > > Sorry to ping (trying to close this).
> > > 
> > > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module
> > > initialization, I think for now it's totally fine to exclude legacy PMEMs from
> > > TDMRs.  The worst case is when someone tries to use them as TD guest backend
> > > directly, the TD will fail to create.  IMO it's acceptable, as it is supposedly
> > > that no one should just use some random backend to run TD.
> > 
> > The platform will already do this, right? I don't understand why this
> > is trying to take proactive action versus documenting the error
> > conditions and steps someone needs to take to avoid unconvertible
> > memory. There is already the CONFIG_HMEM_REPORTING that describes
> > relative performance properties between initiators and targets, it
> > seems fitting to also add security properties between initiators and
> > targets so someone can enumerate the numa-mempolicy that avoids
> > unconvertible memory.
> > 
> > No, special casing in hotplug code paths needed.
> > 
> > > 
> > > I think w/o needing to include legacy PMEM, it's better to get all TDX memory
> > > blocks based on memblock, but not e820.  The pages managed by page allocator are
> > > from memblock anyway (w/o those from memory hotplug).
> > > 
> > > And I also think it makes more sense to introduce 'tdx_memblock' and
> > > 'tdx_memory' data structures to gather all TDX memory blocks during boot when
> > > memblock is still alive.  When TDX module is initialized during runtime, TDMRs
> > > can be created based on the 'struct tdx_memory' which contains all TDX memory
> > > blocks we gathered based on memblock during boot.  This is also more flexible to
> > > support other TDX memory from other sources such as CLX memory in the future.
> > > 
> > > Please let me know if you have any objection?  Thanks!
> > 
> > It's already the case that x86 maintains sideband structures to
> > preserve memory after exiting the early memblock code. Mike, correct
> > me if I am wrong, but adding more is less desirable than just keeping
> > the memblock around?
> 
> TBH, I didn't read the entire thread yet, but at the first glance, keeping
> memblock around is much more preferable that adding yet another { .start,
> .end, .flags } data structure. To keep memblock after boot all is needed is
> something like
> 
> 	select ARCH_KEEP_MEMBLOCK if INTEL_TDX_HOST
> 
> I'll take a closer look next week on the entire series, maybe I'm missing
> some details.
> 

Hi Mike,

Thanks for feedback.

Perhaps I haven't put a lot details of the new TDX data structures, so let me
point out that the new two data structures 'struct tdx_memblock' and 'struct
tdx_memory' that I am proposing are mostly supposed to be used by TDX code only,
which is pretty standalone.  They are not supposed to be some basic
infrastructure that can be widely used by other random kernel components. 

In fact, currently the only operation we need is to allow memblock to register
all memory regions as TDX memory blocks when the memblock is still alive. 
Therefore, in fact, the new data structures can even be completely invisible to
other kernel components.  For instance, TDX code can provide below API w/o
exposing any data structures to other kernel components:

int tdx_add_memory_block(phys_addr_t start, phys_addr_t end, int nid);

And we call above API for each memory region in memblock when it is alive.

TDX code internally manages those memory regions via the new data structures
that I mentioned above, so we don't need to keep memblock after boot.  The
advantage of this approach is it is more flexible to support other potential TDX
memory resources (such as CLX memory) in the future.

Otherwise, we can do as you suggested to select ARCH_KEEP_MEMBLOCK when
INTEL_TDX_HOST is on and TDX code internally uses memblock API directly.

Huang, Kai May 9, 2022, 2:46 a.m. UTC | #40

On Fri, 2022-05-06 at 08:57 -0700, Dan Williams wrote:
> On Thu, May 5, 2022 at 6:47 PM Kai Huang <kai.huang@intel.com> wrote:
> > 
> > On Thu, 2022-05-05 at 18:15 -0700, Dan Williams wrote:
> > > On Thu, May 5, 2022 at 5:46 PM Kai Huang <kai.huang@intel.com> wrote:
> > > > 
> > > > On Thu, 2022-05-05 at 17:22 -0700, Dan Williams wrote:
> > > > > On Thu, May 5, 2022 at 3:14 PM Kai Huang <kai.huang@intel.com> wrote:
> > > > > > 
> > > > > > Thanks for feedback!
> > > > > > 
> > > > > > On Thu, 2022-05-05 at 06:51 -0700, Dan Williams wrote:
> > > > > > > [ add Mike ]
> > > > > > > 
> > > > > > > 
> > > > > > > On Thu, May 5, 2022 at 2:54 AM Kai Huang <kai.huang@intel.com> wrote:
> > > > > > > [..]
> > > > > > > > 
> > > > > > > > Hi Dave,
> > > > > > > > 
> > > > > > > > Sorry to ping (trying to close this).
> > > > > > > > 
> > > > > > > > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module
> > > > > > > > initialization, I think for now it's totally fine to exclude legacy PMEMs from
> > > > > > > > TDMRs.  The worst case is when someone tries to use them as TD guest backend
> > > > > > > > directly, the TD will fail to create.  IMO it's acceptable, as it is supposedly
> > > > > > > > that no one should just use some random backend to run TD.
> > > > > > > 
> > > > > > > The platform will already do this, right?
> > > > > > > 
> > > > > > 
> > > > > > In the current v3 implementation, we don't have any code to handle memory
> > > > > > hotplug, therefore nothing prevents people from adding legacy PMEMs as system
> > > > > > RAM using kmem driver.  In order to guarantee all pages managed by page
> > > > > 
> > > > > That's the fundamental question I am asking why is "guarantee all
> > > > > pages managed by page allocator are TDX memory". That seems overkill
> > > > > compared to indicating the incompatibility after the fact.
> > > > 
> > > > As I explained, the reason is I don't want to modify page allocator to
> > > > distinguish TDX and non-TDX allocation, for instance, having to have a ZONE_TDX
> > > > and GFP_TDX.
> > > 
> > > Right, TDX details do not belong at that level, but it will work
> > > almost all the time if you do nothing to "guarantee" all TDX capable
> > > pages all the time.
> > 
> > "almost all the time" do you mean?
> > 
> > > 
> > > > KVM depends on host's page fault handler to allocate the page.  In fact KVM only
> > > > consumes PFN from host's page tables.  For now only RAM is TDX memory.  By
> > > > guaranteeing all pages in page allocator is TDX memory, we can easily use
> > > > anonymous pages as TD guest memory.
> > > 
> > > Again, TDX capable pages will be the overwhelming default, why are you
> > > worried about cluttering the memory hotplug path for nice corner
> > > cases.
> > 
> > Firstly perhaps I forgot to mention there are two concepts about TDX memory, so
> > let me clarify first:
> > 
> > 1) Convertible Memory Regions (CMRs).  This is reported by BIOS (thus static) to
> > indicate which memory regions *can* be used as TDX memory.  This basically means
> > all RAM during boot for now.
> > 
> > 2) TD Memory Regions (TDMRs).  Memory pages in CMRs are not automatically TDX
> > usable memory.  The TDX module needs to be configured which (convertible) memory
> > regions can be used as TDX memory.  Kernel is responsible for choosing the
> > ranges, and configure to the TDX module.  If a convertible memory page is not
> > included into TDMRs, the TDX module will report error when it is assigned to  a
> > TD.
> > 
> > > 
> > > Consider the fact that end users can break the kernel by specifying
> > > invalid memmap= command line options. The memory hotplug code does not
> > > take any steps to add safety in those cases because there are already
> > > too many ways it can go wrong. TDX is just one more corner case where
> > > the memmap= user needs to be careful. Otherwise, it is up to the
> > > platform firmware to make sure everything in the base memory map is
> > > TDX capable, and then all you need is documentation about the failure
> > > mode when extending "System RAM" beyond that baseline.
> > 
> > So the fact is, if we don't include legacy PMEMs into TDMRs, and don't do
> > anything in memory hotplug, then if user does kmem-hot-add legacy PMEMs as
> > system RAM, a live TD may eventually be killed.
> > 
> > If such case is a corner case that we don't need to guarantee, then even better.
> > And we have an additional reason that those legacy PMEMs don't need to be in
> > TDMRs.  As you suggested,  we can add some documentation to point out.
> > 
> > But the point we want to do some code check and prevent memory hotplug is, as
> > Dave said, we want this piece of code to work on *ANY* TDX capable machines,
> > including future machines which may, i.e. supports NVDIMM/CLX memory as TDX
> > memory.  If we don't do any code check in  memory hotplug in this series, then
> > when this code runs in future platforms, user can plug NVDIMM or CLX memory as
> > system RAM thus break the assumption "all pages in page allocator are TDX
> > memory", which eventually leads to live TDs being killed potentially.
> > 
> > Dave said we need to guarantee this code can work on *ANY* TDX machines.  Some
> > documentation saying it only works one some platforms and you shouldn't do
> > things on other platforms are not good enough:
> > 
> > https://lore.kernel.org/lkml/cover.1649219184.git.kai.huang@intel.com/T/#m6df45b6e1702bb03dcb027044a0dabf30a86e471
> 
> Yes, the incompatible cases cannot be ignored, but I disagree that
> they actively need to be prevented. One way to achieve that is to
> explicitly enumerate TDX capable memory and document how mempolicy can
> be used to avoid killing TDs.

Hi Dan,

Thanks for feedback.

Could you elaborate what does "explicitly enumerate TDX capable memory" mean? 
How to enumerate exactly?

And for "document how mempolicy can be used to avoid killing TDs", what
mempolicy (and error reporting you mentioned below) are you referring to?

I skipped to reply your below your two replies as I think they are referring to
the same "enumerate" and "mempolicy" that I am asking above.

> 
> > > > shmem to support a new fd-based backend which doesn't require having to mmap()
> > > > TD guest memory to host userspace:
> > > > 
> > > > https://lore.kernel.org/kvm/20220310140911.50924-1-chao.p.peng@linux.intel.com/
> > > > 
> > > > Also, besides TD guest memory, there are some per-TD control data structures
> > > > (which must be TDX memory too) need to be allocated for each TD.  Normal memory
> > > > allocation APIs can be used for such allocation if we guarantee all pages in
> > > > page allocator is TDX memory.
> > > 
> > > You don't need that guarantee, just check it after the fact and fail
> > > if that assertion fails. It should almost always be the case that it
> > > succeeds and if it doesn't then something special is happening with
> > > that system and the end user has effectively opt-ed out of TDX
> > > operation.
> > 
> > This doesn't guarantee consistent behaviour.  For instance, for one TD it can be
> > created, while the second may fail.  We should provide a consistent service.
> 
> Yes, there needs to be enumeration and policy knobs to avoid failures,
> hard coded "no memory hotplug" hacks do not seem the right enumeration
> and policy knobs to me.
> 
> > The thing is anyway we need to configure some memory regions to the TDX module.
> > To me there's no reason we don't want to guarantee all pages in page allocator
> > are TDX memory.
> > 
> > > 
> > > > > > allocator are all TDX memory, the v3 implementation needs to always include
> > > > > > legacy PMEMs as TDX memory so that even people truly add  legacy PMEMs as system
> > > > > > RAM, we can still guarantee all pages in page allocator are TDX memory.
> > > > > 
> > > > > Why?
> > > > 
> > > > If we don't include legacy PMEMs as TDX memory, then after they are hot-added as
> > > > system RAM using kmem driver, the assumption of "all pages in page allocator are
> > > > TDX memory" is broken.  A TD can be killed during runtime.
> > > 
> > > Yes, that is what the end user asked for. If they don't want that to
> > > happen then the policy decision about using kmem needs to be updated
> > > in userspace, not hard code that policy decision towards TDX inside
> > > the kernel.
> > 
> > This is also fine to me.  But please also see above Dave's comment.
> 
> Dave is right, the implementation can not just ignore the conflict. To
> me, enumeration plus error reporting allows for flexibility without
> hard coding policy in the kernel.

Mike Rapoport May 9, 2022, 10:33 a.m. UTC | #41

On Sun, May 08, 2022 at 10:00:39PM +1200, Kai Huang wrote:
> On Fri, 2022-05-06 at 20:09 -0400, Mike Rapoport wrote:
> > On Thu, May 05, 2022 at 06:51:20AM -0700, Dan Williams wrote:
> > > [ add Mike ]
> > > 
> > > On Thu, May 5, 2022 at 2:54 AM Kai Huang <kai.huang@intel.com> wrote:
> > > [..]
> > > > 
> > > > Hi Dave,
> > > > 
> > > > Sorry to ping (trying to close this).
> > > > 
> > > > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module
> > > > initialization, I think for now it's totally fine to exclude legacy PMEMs from
> > > > TDMRs.  The worst case is when someone tries to use them as TD guest backend
> > > > directly, the TD will fail to create.  IMO it's acceptable, as it is supposedly
> > > > that no one should just use some random backend to run TD.
> > > 
> > > The platform will already do this, right? I don't understand why this
> > > is trying to take proactive action versus documenting the error
> > > conditions and steps someone needs to take to avoid unconvertible
> > > memory. There is already the CONFIG_HMEM_REPORTING that describes
> > > relative performance properties between initiators and targets, it
> > > seems fitting to also add security properties between initiators and
> > > targets so someone can enumerate the numa-mempolicy that avoids
> > > unconvertible memory.
> > > 
> > > No, special casing in hotplug code paths needed.
> > > 
> > > > 
> > > > I think w/o needing to include legacy PMEM, it's better to get all TDX memory
> > > > blocks based on memblock, but not e820.  The pages managed by page allocator are
> > > > from memblock anyway (w/o those from memory hotplug).
> > > > 
> > > > And I also think it makes more sense to introduce 'tdx_memblock' and
> > > > 'tdx_memory' data structures to gather all TDX memory blocks during boot when
> > > > memblock is still alive.  When TDX module is initialized during runtime, TDMRs
> > > > can be created based on the 'struct tdx_memory' which contains all TDX memory
> > > > blocks we gathered based on memblock during boot.  This is also more flexible to
> > > > support other TDX memory from other sources such as CLX memory in the future.
> > > > 
> > > > Please let me know if you have any objection?  Thanks!
> > > 
> > > It's already the case that x86 maintains sideband structures to
> > > preserve memory after exiting the early memblock code. Mike, correct
> > > me if I am wrong, but adding more is less desirable than just keeping
> > > the memblock around?
> > 
> > TBH, I didn't read the entire thread yet, but at the first glance, keeping
> > memblock around is much more preferable that adding yet another { .start,
> > .end, .flags } data structure. To keep memblock after boot all is needed is
> > something like
> > 
> > 	select ARCH_KEEP_MEMBLOCK if INTEL_TDX_HOST
> > 
> > I'll take a closer look next week on the entire series, maybe I'm missing
> > some details.
> > 
> 
> Hi Mike,
> 
> Thanks for feedback.
> 
> Perhaps I haven't put a lot details of the new TDX data structures, so let me
> point out that the new two data structures 'struct tdx_memblock' and 'struct
> tdx_memory' that I am proposing are mostly supposed to be used by TDX code only,
> which is pretty standalone.  They are not supposed to be some basic
> infrastructure that can be widely used by other random kernel components. 

We already have "pretty standalone" numa_meminfo that originally was used
to setup NUMA memory topology, but now it's used by other code as well.
And e820 tables also contain similar data and they are supposedly should be
used only at boot time, but in reality there are too much callbacks into
e820 way after the system is booted.

So any additional memory representation will only add to the overall
complexity and well have even more "eventually consistent" collections of 
{ .start, .end, .flags } structures.
 
> In fact, currently the only operation we need is to allow memblock to register
> all memory regions as TDX memory blocks when the memblock is still alive. 
> Therefore, in fact, the new data structures can even be completely invisible to
> other kernel components.  For instance, TDX code can provide below API w/o
> exposing any data structures to other kernel components:
> 
> int tdx_add_memory_block(phys_addr_t start, phys_addr_t end, int nid);
> 
> And we call above API for each memory region in memblock when it is alive.
> 
> TDX code internally manages those memory regions via the new data structures
> that I mentioned above, so we don't need to keep memblock after boot.  The
> advantage of this approach is it is more flexible to support other potential TDX
> memory resources (such as CLX memory) in the future.

Please let keep things simple. If other TDX memory resources will need
different handling it can be implemented then. For now, just enable
ARCH_KEEP_MEMBLOCK and use memblock to track TDX memory.
 
> Otherwise, we can do as you suggested to select ARCH_KEEP_MEMBLOCK when
> INTEL_TDX_HOST is on and TDX code internally uses memblock API directly.
> 
> -- 
> Thanks,
> -Kai
> 
>

Huang, Kai May 9, 2022, 11:27 p.m. UTC | #42

> > 
> > Hi Mike,
> > 
> > Thanks for feedback.
> > 
> > Perhaps I haven't put a lot details of the new TDX data structures, so let me
> > point out that the new two data structures 'struct tdx_memblock' and 'struct
> > tdx_memory' that I am proposing are mostly supposed to be used by TDX code only,
> > which is pretty standalone.  They are not supposed to be some basic
> > infrastructure that can be widely used by other random kernel components. 
> 
> We already have "pretty standalone" numa_meminfo that originally was used
> to setup NUMA memory topology, but now it's used by other code as well.
> And e820 tables also contain similar data and they are supposedly should be
> used only at boot time, but in reality there are too much callbacks into
> e820 way after the system is booted.
> 
> So any additional memory representation will only add to the overall
> complexity and well have even more "eventually consistent" collections of 
> { .start, .end, .flags } structures.
>  
> > In fact, currently the only operation we need is to allow memblock to register
> > all memory regions as TDX memory blocks when the memblock is still alive. 
> > Therefore, in fact, the new data structures can even be completely invisible to
> > other kernel components.  For instance, TDX code can provide below API w/o
> > exposing any data structures to other kernel components:
> > 
> > int tdx_add_memory_block(phys_addr_t start, phys_addr_t end, int nid);
> > 
> > And we call above API for each memory region in memblock when it is alive.
> > 
> > TDX code internally manages those memory regions via the new data structures
> > that I mentioned above, so we don't need to keep memblock after boot.  The
> > advantage of this approach is it is more flexible to support other potential TDX
> > memory resources (such as CLX memory) in the future.
> 
> Please let keep things simple. If other TDX memory resources will need
> different handling it can be implemented then. For now, just enable
> ARCH_KEEP_MEMBLOCK and use memblock to track TDX memory.
>  

Looks good to me.  Thanks for the feedback.

Huang, Kai May 10, 2022, 10:25 a.m. UTC | #43

> > > 
> > > > 
> > > > Consider the fact that end users can break the kernel by specifying
> > > > invalid memmap= command line options. The memory hotplug code does not
> > > > take any steps to add safety in those cases because there are already
> > > > too many ways it can go wrong. TDX is just one more corner case where
> > > > the memmap= user needs to be careful. Otherwise, it is up to the
> > > > platform firmware to make sure everything in the base memory map is
> > > > TDX capable, and then all you need is documentation about the failure
> > > > mode when extending "System RAM" beyond that baseline.
> > > 
> > > So the fact is, if we don't include legacy PMEMs into TDMRs, and don't do
> > > anything in memory hotplug, then if user does kmem-hot-add legacy PMEMs as
> > > system RAM, a live TD may eventually be killed.
> > > 
> > > If such case is a corner case that we don't need to guarantee, then even better.
> > > And we have an additional reason that those legacy PMEMs don't need to be in
> > > TDMRs.  As you suggested,  we can add some documentation to point out.
> > > 
> > > But the point we want to do some code check and prevent memory hotplug is, as
> > > Dave said, we want this piece of code to work on *ANY* TDX capable machines,
> > > including future machines which may, i.e. supports NVDIMM/CLX memory as TDX
> > > memory.  If we don't do any code check in  memory hotplug in this series, then
> > > when this code runs in future platforms, user can plug NVDIMM or CLX memory as
> > > system RAM thus break the assumption "all pages in page allocator are TDX
> > > memory", which eventually leads to live TDs being killed potentially.
> > > 
> > > Dave said we need to guarantee this code can work on *ANY* TDX machines.  Some
> > > documentation saying it only works one some platforms and you shouldn't do
> > > things on other platforms are not good enough:
> > > 
> > > https://lore.kernel.org/lkml/cover.1649219184.git.kai.huang@intel.com/T/#m6df45b6e1702bb03dcb027044a0dabf30a86e471
> > 
> > Yes, the incompatible cases cannot be ignored, but I disagree that
> > they actively need to be prevented. One way to achieve that is to
> > explicitly enumerate TDX capable memory and document how mempolicy can
> > be used to avoid killing TDs.
> 
> Hi Dan,
> 
> Thanks for feedback.
> 
> Could you elaborate what does "explicitly enumerate TDX capable memory" mean? 
> How to enumerate exactly?
> 
> And for "document how mempolicy can be used to avoid killing TDs", what
> mempolicy (and error reporting you mentioned below) are you referring to?
> 
> I skipped to reply your below your two replies as I think they are referring to
> the same "enumerate" and "mempolicy" that I am asking above.
> 
> 

Hi Dan,

I guess "explicitly enumerate TDX capable memory" means getting the Convertible
Memory Regions (CMR).  And "document how mempolicy can be used to avoid killing
TDs" means we say something like below in the documentation?

	Any non TDX capable memory hot-add will result in non TDX capable pages
	being potentially allocated to a TD, in which case a TD may fail to be
	created or a live TD may be killed at runtime.

And "error reporting" do you mean in memory hot-add code path, we check whether
the new memory resource is TDX capable, if not we print some error similar to
above message in documentation, but still allow the memory hot-add to happen?

Something like below in add_memory_resource()?

	if (platform_has_tdx() && new memory resource NOT in CMRs)
		pr_err("Hot-add non-TDX memory on TDX capable system. TD may
			fail to be created, or a live TD may be killed during
			runtime.\n");

	// allow memory hot-add anyway


I have below concerns of this approach:

1) I think we should provide a consistent service to user, that is, we either to
guarantee that TD won't be failed to be created randomly and a running TD won't
be killed during runtime, or we don't provide any TDX functionality at all.  So
I am not sure only "document how mempolicy can be use to avoid killing TDs" is
good enough.

2) Above code to check whether a new memory resource is in CMRs or not requires
the kernel to get CMRs during kernel boot.  However getting CMRs requires
calling SEAMCALL which requires kernel to support VMXON/VMXOFF.  VMXON/VMXOFF is
currently only handled by KVM.  We'd like to avoid adding VMXON/VMXOFF to core-
kernel now if not mandatory, as eventually we will very likely need to have a
reference-based approach to call VMXON/VMXOFF.  This part is explained in the
cover letter in this series.

Dave suggested for now to keep things simple, we can use "winner take all"
approach:  If TDX is initialized first, don't allow memory hotplug. If memory
hotplug happens first, don't allow TDX to be initialized.

https://lore.kernel.org/lkml/cover.1649219184.git.kai.huang@intel.com/T/#mfa6b5dcc536d8a7b78522f46ccd1230f84d52ae0

I think this is perhaps more reasonable as we are at least providing some
consistent service to user.  And in this approach we don't need to handle
VMXON/VMXOFF in core-kernel.

Comments?

[v3,00/21] TDX host kernel support

Message

Comments