Message ID | cover.1649219184.git.kai.huang@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | TDX host kernel support | expand |
On Wed, 2022-04-06 at 16:49 +1200, Kai Huang wrote: > Intel Trusted Domain Extensions (TDX) protects guest VMs from malicious > host and certain physical attacks. This series provides support for > initializing the TDX module in the host kernel. KVM support for TDX is > being developed separately[1]. > > The code has been tested on couple of TDX-capable machines. I would > consider it as ready for review. I highly appreciate if anyone can help > to review this series (from high level design to detail implementations). > For Intel reviewers (CC'ed), please help to review, and I would > appreciate Reviewed-by or Acked-by tags if the patches look good to you. Hi Intel reviewers, Kindly ping. Could you help to review? -- Thanks, -Kai
On 4/5/22 21:49, Kai Huang wrote: > SEAM VMX root operation is designed to host a CPU-attested, software > module called the 'TDX module' which implements functions to manage > crypto protected VMs called Trust Domains (TD). SEAM VMX root is also "crypto protected"? What the heck is that? > designed to host a CPU-attested, software module called the 'Intel > Persistent SEAMLDR (Intel P-SEAMLDR)' to load and update the TDX module. > > Host kernel transits to either the P-SEAMLDR or the TDX module via a new ^ The > SEAMCALL instruction. SEAMCALLs are host-side interface functions > defined by the P-SEAMLDR and the TDX module around the new SEAMCALL > instruction. They are similar to a hypercall, except they are made by > host kernel to the SEAM software modules. This is still missing some important high-level things, like that the TDX module is protected from the untrusted VMM. Heck, it forgets to mention that the VMM itself is untrusted and the TDX module replaces things that the VMM usually does. It would also be nice to mention here how this compares with SEV-SNP. Where is the TDX module in that design? Why doesn't SEV need all this code? > TDX leverages Intel Multi-Key Total Memory Encryption (MKTME) to crypto > protect TD guests. TDX reserves part of MKTME KeyID space as TDX private > KeyIDs, which can only be used by software runs in SEAM. The physical ^ which > address bits for encoding TDX private KeyID are treated as reserved bits > when not in SEAM operation. The partitioning of MKTME KeyIDs and TDX > private KeyIDs is configured by BIOS. > > Before being able to manage TD guests, the TDX module must be loaded > and properly initialized using SEAMCALLs defined by TDX architecture. > This series assumes both the P-SEAMLDR and the TDX module are loaded by > BIOS before the kernel boots. > > There's no CPUID or MSR to detect either the P-SEAMLDR or the TDX module. > Instead, detecting them can be done by using P-SEAMLDR's SEAMLDR.INFO > SEAMCALL to detect P-SEAMLDR. The success of this SEAMCALL means the > P-SEAMLDR is loaded. The P-SEAMLDR information returned by this > SEAMCALL further tells whether TDX module is loaded. There's a bit of information missing here. The kernel might not know the state of things being loaded. A previous kernel might have loaded it and left it in an unknown state. > The TDX module is initialized in multiple steps: > > 1) Global initialization; > 2) Logical-CPU scope initialization; > 3) Enumerate the TDX module capabilities; > 4) Configure the TDX module about usable memory ranges and > global KeyID information; > 5) Package-scope configuration for the global KeyID; > 6) Initialize TDX metadata for usable memory ranges based on 4). > > Step 2) requires calling some SEAMCALL on all "BIOS-enabled" (in MADT > table) logical cpus, otherwise step 4) will fail. Step 5) requires > calling SEAMCALL on at least one cpu on all packages. > > TDX module can also be shut down at any time during module's lifetime, by > calling SEAMCALL on all "BIOS-enabled" logical cpus. > > == Design Considerations == > > 1. Lazy TDX module initialization on-demand by caller This doesn't really tell us what "lazy" is or what the alternatives are. There are basically two ways the TDX module could be loaded. Either: * In early boot or * At runtime just before the first TDX guest is run This series implements the runtime loading. > None of the steps in the TDX module initialization process must be done > during kernel boot. This series doesn't initialize TDX at boot time, but > instead, provides two functions to allow caller to detect and initialize > TDX on demand: > > if (tdx_detect()) > goto no_tdx; > if (tdx_init()) > goto no_tdx; > > This approach has below pros: > > 1) Initializing the TDX module requires to reserve ~1/256th system RAM as > metadata. Enabling TDX on demand allows only to consume this memory when > TDX is truly needed (i.e. when KVM wants to create TD guests). > > 2) Both detecting and initializing the TDX module require calling > SEAMCALL. However, SEAMCALL requires CPU being already in VMX operation > (VMXON has been done). So far, KVM is the only user of TDX, and it > already handles VMXON/VMXOFF. Therefore, letting KVM to initialize TDX > on-demand avoids handling VMXON/VMXOFF (which is not that trivial) in > core-kernel. Also, in long term, likely a reference based VMXON/VMXOFF > approach is needed since more kernel components will need to handle > VMXON/VMXONFF. > > 3) It is more flexible to support "TDX module runtime update" (not in > this series). After updating to the new module at runtime, kernel needs > to go through the initialization process again. For the new module, > it's possible the metadata allocated for the old module cannot be reused > for the new module, and needs to be re-allocated again. > > 2. Kernel policy on TDX memory > > Host kernel is responsible for choosing which memory regions can be used > as TDX memory, and configuring those memory regions to the TDX module by > using an array of "TD Memory Regions" (TDMR), which is a data structure > defined by TDX architecture. This is putting the cart before the horse. Don't define the details up front. The TDX architecture allows the VMM to designate specific memory as usable for TDX private memory. This series chooses to designate _all_ system RAM as TDX to avoid having to modify the page allocator to distinguish TDX and non-TDX-capable memory ... then go on to explain the details. > The first generation of TDX essentially guarantees that all system RAM > memory regions (excluding the memory below 1MB) can be used as TDX > memory. To avoid having to modify the page allocator to distinguish TDX > and non-TDX allocation, this series chooses to use all system RAM as TDX > memory. > > E820 table is used to find all system RAM entries. Following > e820__memblock_setup(), both E820_TYPE_RAM and E820_TYPE_RESERVED_KERN > types are treated as TDX memory, and contiguous ranges in the same NUMA > node are merged together (similar to memblock_add()) before trimming the > non-page-aligned part. This e820 cruft is too much detail for a cover letter. In general, once you start talking about individual functions, you've gone too far in the cover letter. > 3. Memory hotplug > > The first generation of TDX architecturally doesn't support memory > hotplug. And the first generation of TDX-capable platforms don't support > physical memory hotplug. Since it physically cannot happen, this series > doesn't add any check in ACPI memory hotplug code path to disable it. > > A special case of memory hotplug is adding NVDIMM as system RAM using > kmem driver. However the first generation of TDX-capable platforms > cannot enable TDX and NVDIMM simultaneously, so in practice this cannot > happen either. What prevents this code from today's code being run on tomorrow's platforms and breaking these assumptions? > Another case is admin can use 'memmap' kernel command line to create > legacy PMEMs and use them as TD guest memory, or theoretically, can use > kmem driver to add them as system RAM. To avoid having to change memory > hotplug code to prevent this from happening, this series always include > legacy PMEMs when constructing TDMRs so they are also TDX memory. > > 4. CPU hotplug > > The first generation of TDX architecturally doesn't support ACPI CPU > hotplug. All logical cpus are enabled by BIOS in MADT table. Also, the > first generation of TDX-capable platforms don't support ACPI CPU hotplug > either. Since this physically cannot happen, this series doesn't add any > check in ACPI CPU hotplug code path to disable it. > > Also, only TDX module initialization requires all BIOS-enabled cpus are > online. After the initialization, any logical cpu can be brought down > and brought up to online again later. Therefore this series doesn't > change logical CPU hotplug either. > > 5. TDX interaction with kexec() > > If TDX is ever enabled and/or used to run any TD guests, the cachelines > of TDX private memory, including PAMTs, used by TDX module need to be > flushed before transiting to the new kernel otherwise they may silently > corrupt the new kernel. Similar to SME, this series flushes cache in > stop_this_cpu(). What does this have to do with kexec()? What's a PAMT? > The TDX module can be initialized only once during its lifetime. The > first generation of TDX doesn't have interface to reset TDX module to ^ an > uninitialized state so it can be initialized again. > > This implies: > > - If the old kernel fails to initialize TDX, the new kernel cannot > use TDX too unless the new kernel fixes the bug which leads to > initialization failure in the old kernel and can resume from where > the old kernel stops. This requires certain coordination between > the two kernels. OK, but what does this *MEAN*? > - If the old kernel has initialized TDX successfully, the new kernel > may be able to use TDX if the two kernels have the exactly same > configurations on the TDX module. It further requires the new kernel > to reserve the TDX metadata pages (allocated by the old kernel) in > its page allocator. It also requires coordination between the two > kernels. Furthermore, if kexec() is done when there are active TD > guests running, the new kernel cannot use TDX because it's extremely > hard for the old kernel to pass all TDX private pages to the new > kernel. > > Given that, this series doesn't support TDX after kexec() (except the > old kernel doesn't attempt to initialize TDX at all). > > And this series doesn't shut down TDX module but leaves it open during > kexec(). It is because shutting down TDX module requires CPU being in > VMX operation but there's no guarantee of this during kexec(). Leaving > the TDX module open is not the best case, but it is OK since the new > kernel won't be able to use TDX anyway (therefore TDX module won't run > at all). tl;dr: kexec() doesn't work with this code. Right? That doesn't seem good.
On Tue, 2022-04-26 at 13:13 -0700, Dave Hansen wrote: > On 4/5/22 21:49, Kai Huang wrote: > > SEAM VMX root operation is designed to host a CPU-attested, software > > module called the 'TDX module' which implements functions to manage > > crypto protected VMs called Trust Domains (TD). SEAM VMX root is also > > "crypto protected"? What the heck is that? How about "crypto-protected"? I googled and it seems it is used by someone else. > > > designed to host a CPU-attested, software module called the 'Intel > > Persistent SEAMLDR (Intel P-SEAMLDR)' to load and update the TDX module. > > > > Host kernel transits to either the P-SEAMLDR or the TDX module via a new > > ^ The Thanks. > > > SEAMCALL instruction. SEAMCALLs are host-side interface functions > > defined by the P-SEAMLDR and the TDX module around the new SEAMCALL > > instruction. They are similar to a hypercall, except they are made by > > host kernel to the SEAM software modules. > > This is still missing some important high-level things, like that the > TDX module is protected from the untrusted VMM. Heck, it forgets to > mention that the VMM itself is untrusted and the TDX module replaces > things that the VMM usually does. > > It would also be nice to mention here how this compares with SEV-SNP. > Where is the TDX module in that design? Why doesn't SEV need all this code? > > > TDX leverages Intel Multi-Key Total Memory Encryption (MKTME) to crypto > > protect TD guests. TDX reserves part of MKTME KeyID space as TDX private > > KeyIDs, which can only be used by software runs in SEAM. The physical > > ^ which Thanks. > > > address bits for encoding TDX private KeyID are treated as reserved bits > > when not in SEAM operation. The partitioning of MKTME KeyIDs and TDX > > private KeyIDs is configured by BIOS. > > > > Before being able to manage TD guests, the TDX module must be loaded > > and properly initialized using SEAMCALLs defined by TDX architecture. > > This series assumes both the P-SEAMLDR and the TDX module are loaded by > > BIOS before the kernel boots. > > > > There's no CPUID or MSR to detect either the P-SEAMLDR or the TDX module. > > Instead, detecting them can be done by using P-SEAMLDR's SEAMLDR.INFO > > SEAMCALL to detect P-SEAMLDR. The success of this SEAMCALL means the > > P-SEAMLDR is loaded. The P-SEAMLDR information returned by this > > SEAMCALL further tells whether TDX module is loaded. > > There's a bit of information missing here. The kernel might not know > the state of things being loaded. A previous kernel might have loaded > it and left it in an unknown state. > > > The TDX module is initialized in multiple steps: > > > > 1) Global initialization; > > 2) Logical-CPU scope initialization; > > 3) Enumerate the TDX module capabilities; > > 4) Configure the TDX module about usable memory ranges and > > global KeyID information; > > 5) Package-scope configuration for the global KeyID; > > 6) Initialize TDX metadata for usable memory ranges based on 4). > > > > Step 2) requires calling some SEAMCALL on all "BIOS-enabled" (in MADT > > table) logical cpus, otherwise step 4) will fail. Step 5) requires > > calling SEAMCALL on at least one cpu on all packages. > > > > TDX module can also be shut down at any time during module's lifetime, by > > calling SEAMCALL on all "BIOS-enabled" logical cpus. > > > > == Design Considerations == > > > > 1. Lazy TDX module initialization on-demand by caller > > This doesn't really tell us what "lazy" is or what the alternatives are. > > There are basically two ways the TDX module could be loaded. Either: > * In early boot > or > * At runtime just before the first TDX guest is run > > This series implements the runtime loading. OK will do. > > > None of the steps in the TDX module initialization process must be done > > during kernel boot. This series doesn't initialize TDX at boot time, but > > instead, provides two functions to allow caller to detect and initialize > > TDX on demand: > > > > if (tdx_detect()) > > goto no_tdx; > > if (tdx_init()) > > goto no_tdx; > > > > This approach has below pros: > > > > 1) Initializing the TDX module requires to reserve ~1/256th system RAM as > > metadata. Enabling TDX on demand allows only to consume this memory when > > TDX is truly needed (i.e. when KVM wants to create TD guests). > > > > 2) Both detecting and initializing the TDX module require calling > > SEAMCALL. However, SEAMCALL requires CPU being already in VMX operation > > (VMXON has been done). So far, KVM is the only user of TDX, and it > > already handles VMXON/VMXOFF. Therefore, letting KVM to initialize TDX > > on-demand avoids handling VMXON/VMXOFF (which is not that trivial) in > > core-kernel. Also, in long term, likely a reference based VMXON/VMXOFF > > approach is needed since more kernel components will need to handle > > VMXON/VMXONFF. > > > > 3) It is more flexible to support "TDX module runtime update" (not in > > this series). After updating to the new module at runtime, kernel needs > > to go through the initialization process again. For the new module, > > it's possible the metadata allocated for the old module cannot be reused > > for the new module, and needs to be re-allocated again. > > > > 2. Kernel policy on TDX memory > > > > Host kernel is responsible for choosing which memory regions can be used > > as TDX memory, and configuring those memory regions to the TDX module by > > using an array of "TD Memory Regions" (TDMR), which is a data structure > > defined by TDX architecture. > > > This is putting the cart before the horse. Don't define the details up > front. > > The TDX architecture allows the VMM to designate specific memory > as usable for TDX private memory. This series chooses to > designate _all_ system RAM as TDX to avoid having to modify the > page allocator to distinguish TDX and non-TDX-capable memory > > ... then go on to explain the details. Thanks. Will update. > > > The first generation of TDX essentially guarantees that all system RAM > > memory regions (excluding the memory below 1MB) can be used as TDX > > memory. To avoid having to modify the page allocator to distinguish TDX > > and non-TDX allocation, this series chooses to use all system RAM as TDX > > memory. > > > > E820 table is used to find all system RAM entries. Following > > e820__memblock_setup(), both E820_TYPE_RAM and E820_TYPE_RESERVED_KERN > > types are treated as TDX memory, and contiguous ranges in the same NUMA > > node are merged together (similar to memblock_add()) before trimming the > > non-page-aligned part. > > This e820 cruft is too much detail for a cover letter. In general, once > you start talking about individual functions, you've gone too far in the > cover letter. Will remove. > > > 3. Memory hotplug > > > > The first generation of TDX architecturally doesn't support memory > > hotplug. And the first generation of TDX-capable platforms don't support > > physical memory hotplug. Since it physically cannot happen, this series > > doesn't add any check in ACPI memory hotplug code path to disable it. > > > > A special case of memory hotplug is adding NVDIMM as system RAM using > > kmem driver. However the first generation of TDX-capable platforms > > cannot enable TDX and NVDIMM simultaneously, so in practice this cannot > > happen either. > > What prevents this code from today's code being run on tomorrow's > platforms and breaking these assumptions? I forgot to add below (which is in the documentation patch): "This can be enhanced when future generation of TDX starts to support ACPI memory hotplug, or NVDIMM and TDX can be enabled simultaneously on the same platform." Is this acceptable? > > > Another case is admin can use 'memmap' kernel command line to create > > legacy PMEMs and use them as TD guest memory, or theoretically, can use > > kmem driver to add them as system RAM. To avoid having to change memory > > hotplug code to prevent this from happening, this series always include > > legacy PMEMs when constructing TDMRs so they are also TDX memory. > > > > 4. CPU hotplug > > > > The first generation of TDX architecturally doesn't support ACPI CPU > > hotplug. All logical cpus are enabled by BIOS in MADT table. Also, the > > first generation of TDX-capable platforms don't support ACPI CPU hotplug > > either. Since this physically cannot happen, this series doesn't add any > > check in ACPI CPU hotplug code path to disable it. > > > > Also, only TDX module initialization requires all BIOS-enabled cpus are > > online. After the initialization, any logical cpu can be brought down > > and brought up to online again later. Therefore this series doesn't > > change logical CPU hotplug either. > > > > 5. TDX interaction with kexec() > > > > If TDX is ever enabled and/or used to run any TD guests, the cachelines > > of TDX private memory, including PAMTs, used by TDX module need to be > > flushed before transiting to the new kernel otherwise they may silently > > corrupt the new kernel. Similar to SME, this series flushes cache in > > stop_this_cpu(). > > What does this have to do with kexec()? What's a PAMT? The point is the dirty cachelines of TDX private memory must be flushed otherwise they may slightly corrupt the new kexec()-ed kernel. Will use "TDX metadata" instead of "PAMT". The former has already been mentioned above. > > > The TDX module can be initialized only once during its lifetime. The > > first generation of TDX doesn't have interface to reset TDX module to > > ^ an Thanks. > > > uninitialized state so it can be initialized again. > > > > This implies: > > > > - If the old kernel fails to initialize TDX, the new kernel cannot > > use TDX too unless the new kernel fixes the bug which leads to > > initialization failure in the old kernel and can resume from where > > the old kernel stops. This requires certain coordination between > > the two kernels. > > OK, but what does this *MEAN*? This means we need to extend the information which the old kernel passes to the new kernel. But I don't think it's feasible. I'll refine this kexec() section to make it more concise next version. > > > - If the old kernel has initialized TDX successfully, the new kernel > > may be able to use TDX if the two kernels have the exactly same > > configurations on the TDX module. It further requires the new kernel > > to reserve the TDX metadata pages (allocated by the old kernel) in > > its page allocator. It also requires coordination between the two > > kernels. Furthermore, if kexec() is done when there are active TD > > guests running, the new kernel cannot use TDX because it's extremely > > hard for the old kernel to pass all TDX private pages to the new > > kernel. > > > > Given that, this series doesn't support TDX after kexec() (except the > > old kernel doesn't attempt to initialize TDX at all). > > > > And this series doesn't shut down TDX module but leaves it open during > > kexec(). It is because shutting down TDX module requires CPU being in > > VMX operation but there's no guarantee of this during kexec(). Leaving > > the TDX module open is not the best case, but it is OK since the new > > kernel won't be able to use TDX anyway (therefore TDX module won't run > > at all). > > tl;dr: kexec() doesn't work with this code. > > Right? > > That doesn't seem good. It can work in my understanding. We just need to flush cache before booting to the new kernel.
On 4/26/22 18:15, Kai Huang wrote: > On Tue, 2022-04-26 at 13:13 -0700, Dave Hansen wrote: >> On 4/5/22 21:49, Kai Huang wrote: >>> SEAM VMX root operation is designed to host a CPU-attested, software >>> module called the 'TDX module' which implements functions to manage >>> crypto protected VMs called Trust Domains (TD). SEAM VMX root is also >> >> "crypto protected"? What the heck is that? > > How about "crypto-protected"? I googled and it seems it is used by someone > else. Cryptography itself doesn't provide (much) protection in the TDX architecture. TDX guests are isolated from the VMM in ways that traditional guests are not, but that has almost nothing to do with cryptography. Is it cryptography that keeps the host from reading guest private data in the clear? Is it cryptography that keeps the host from reading guest ciphertext? Does cryptography enforce the extra rules of Secure-EPT? >>> 3. Memory hotplug >>> >>> The first generation of TDX architecturally doesn't support memory >>> hotplug. And the first generation of TDX-capable platforms don't support >>> physical memory hotplug. Since it physically cannot happen, this series >>> doesn't add any check in ACPI memory hotplug code path to disable it. >>> >>> A special case of memory hotplug is adding NVDIMM as system RAM using >>> kmem driver. However the first generation of TDX-capable platforms >>> cannot enable TDX and NVDIMM simultaneously, so in practice this cannot >>> happen either. >> >> What prevents this code from today's code being run on tomorrow's >> platforms and breaking these assumptions? > > I forgot to add below (which is in the documentation patch): > > "This can be enhanced when future generation of TDX starts to support ACPI > memory hotplug, or NVDIMM and TDX can be enabled simultaneously on the > same platform." > > Is this acceptable? No, Kai. You're basically saying: *this* code doesn't work with feature A, B and C. Then, you're pivoting to say that it doesn't matter because one version of Intel's hardware doesn't support A, B, or C. I don't care about this *ONE* version of the hardware. I care about *ALL* the hardware that this code will ever support. *ALL* the hardware on which this code will run. In 5 years, if someone takes this code and runs it on Intel hardware with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens? You can't just ignore the problems because they're not present on one version of the hardware. >>> Another case is admin can use 'memmap' kernel command line to create >>> legacy PMEMs and use them as TD guest memory, or theoretically, can use >>> kmem driver to add them as system RAM. To avoid having to change memory >>> hotplug code to prevent this from happening, this series always include >>> legacy PMEMs when constructing TDMRs so they are also TDX memory. >>> >>> 4. CPU hotplug >>> >>> The first generation of TDX architecturally doesn't support ACPI CPU >>> hotplug. All logical cpus are enabled by BIOS in MADT table. Also, the >>> first generation of TDX-capable platforms don't support ACPI CPU hotplug >>> either. Since this physically cannot happen, this series doesn't add any >>> check in ACPI CPU hotplug code path to disable it. >>> >>> Also, only TDX module initialization requires all BIOS-enabled cpus are >>> online. After the initialization, any logical cpu can be brought down >>> and brought up to online again later. Therefore this series doesn't >>> change logical CPU hotplug either. >>> >>> 5. TDX interaction with kexec() >>> >>> If TDX is ever enabled and/or used to run any TD guests, the cachelines >>> of TDX private memory, including PAMTs, used by TDX module need to be >>> flushed before transiting to the new kernel otherwise they may silently >>> corrupt the new kernel. Similar to SME, this series flushes cache in >>> stop_this_cpu(). >> >> What does this have to do with kexec()? What's a PAMT? > > The point is the dirty cachelines of TDX private memory must be flushed > otherwise they may slightly corrupt the new kexec()-ed kernel. > > Will use "TDX metadata" instead of "PAMT". The former has already been > mentioned above. Longer description for the patch itself: TDX memory encryption is built on top of MKTME which uses physical address aliases to designate encryption keys. This architecture is not cache coherent. Software is responsible for flushing the CPU caches when memory changes keys. When kexec()'ing, memory can be repurposed from TDX use to non-TDX use, changing the effective encryption key. Cover-letter-level description: Just like SME, TDX hosts require special cache flushing before kexec(). >>> uninitialized state so it can be initialized again. >>> >>> This implies: >>> >>> - If the old kernel fails to initialize TDX, the new kernel cannot >>> use TDX too unless the new kernel fixes the bug which leads to >>> initialization failure in the old kernel and can resume from where >>> the old kernel stops. This requires certain coordination between >>> the two kernels. >> >> OK, but what does this *MEAN*? > > This means we need to extend the information which the old kernel passes to the > new kernel. But I don't think it's feasible. I'll refine this kexec() section > to make it more concise next version. > >> >>> - If the old kernel has initialized TDX successfully, the new kernel >>> may be able to use TDX if the two kernels have the exactly same >>> configurations on the TDX module. It further requires the new kernel >>> to reserve the TDX metadata pages (allocated by the old kernel) in >>> its page allocator. It also requires coordination between the two >>> kernels. Furthermore, if kexec() is done when there are active TD >>> guests running, the new kernel cannot use TDX because it's extremely >>> hard for the old kernel to pass all TDX private pages to the new >>> kernel. >>> >>> Given that, this series doesn't support TDX after kexec() (except the >>> old kernel doesn't attempt to initialize TDX at all). >>> >>> And this series doesn't shut down TDX module but leaves it open during >>> kexec(). It is because shutting down TDX module requires CPU being in >>> VMX operation but there's no guarantee of this during kexec(). Leaving >>> the TDX module open is not the best case, but it is OK since the new >>> kernel won't be able to use TDX anyway (therefore TDX module won't run >>> at all). >> >> tl;dr: kexec() doesn't work with this code. >> >> Right? >> >> That doesn't seem good. > > It can work in my understanding. We just need to flush cache before booting to > the new kernel. What about all the concerns about TDX module configuration changing?
On Wed, 2022-04-27 at 14:59 -0700, Dave Hansen wrote: > On 4/26/22 18:15, Kai Huang wrote: > > On Tue, 2022-04-26 at 13:13 -0700, Dave Hansen wrote: > > > On 4/5/22 21:49, Kai Huang wrote: > > > > SEAM VMX root operation is designed to host a CPU-attested, software > > > > module called the 'TDX module' which implements functions to manage > > > > crypto protected VMs called Trust Domains (TD). SEAM VMX root is also > > > > > > "crypto protected"? What the heck is that? > > > > How about "crypto-protected"? I googled and it seems it is used by someone > > else. > > Cryptography itself doesn't provide (much) protection in the TDX > architecture. TDX guests are isolated from the VMM in ways that > traditional guests are not, but that has almost nothing to do with > cryptography. > > Is it cryptography that keeps the host from reading guest private data > in the clear? Is it cryptography that keeps the host from reading guest > ciphertext? Does cryptography enforce the extra rules of Secure-EPT? OK will change to "protected VMs" in this entire series. > > > > > 3. Memory hotplug > > > > > > > > The first generation of TDX architecturally doesn't support memory > > > > hotplug. And the first generation of TDX-capable platforms don't support > > > > physical memory hotplug. Since it physically cannot happen, this series > > > > doesn't add any check in ACPI memory hotplug code path to disable it. > > > > > > > > A special case of memory hotplug is adding NVDIMM as system RAM using > > > > kmem driver. However the first generation of TDX-capable platforms > > > > cannot enable TDX and NVDIMM simultaneously, so in practice this cannot > > > > happen either. > > > > > > What prevents this code from today's code being run on tomorrow's > > > platforms and breaking these assumptions? > > > > I forgot to add below (which is in the documentation patch): > > > > "This can be enhanced when future generation of TDX starts to support ACPI > > memory hotplug, or NVDIMM and TDX can be enabled simultaneously on the > > same platform." > > > > Is this acceptable? > > No, Kai. > > You're basically saying: *this* code doesn't work with feature A, B and > C. Then, you're pivoting to say that it doesn't matter because one > version of Intel's hardware doesn't support A, B, or C. > > I don't care about this *ONE* version of the hardware. I care about > *ALL* the hardware that this code will ever support. *ALL* the hardware > on which this code will run. > > In 5 years, if someone takes this code and runs it on Intel hardware > with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens? I thought we could document this in the documentation saying that this code can only work on TDX machines that don't have above capabilities (SPR for now). We can change the code and the documentation when we add the support of those features in the future, and update the documentation. If 5 years later someone takes this code, he/she should take a look at the documentation and figure out that he/she should choose a newer kernel if the machine support those features. I'll think about design solutions if above doesn't look good for you. > > You can't just ignore the problems because they're not present on one > version of the hardware. > > > > > Another case is admin can use 'memmap' kernel command line to create > > > > legacy PMEMs and use them as TD guest memory, or theoretically, can use > > > > kmem driver to add them as system RAM. To avoid having to change memory > > > > hotplug code to prevent this from happening, this series always include > > > > legacy PMEMs when constructing TDMRs so they are also TDX memory. > > > > > > > > 4. CPU hotplug > > > > > > > > The first generation of TDX architecturally doesn't support ACPI CPU > > > > hotplug. All logical cpus are enabled by BIOS in MADT table. Also, the > > > > first generation of TDX-capable platforms don't support ACPI CPU hotplug > > > > either. Since this physically cannot happen, this series doesn't add any > > > > check in ACPI CPU hotplug code path to disable it. > > > > > > > > Also, only TDX module initialization requires all BIOS-enabled cpus are > > > > online. After the initialization, any logical cpu can be brought down > > > > and brought up to online again later. Therefore this series doesn't > > > > change logical CPU hotplug either. > > > > > > > > 5. TDX interaction with kexec() > > > > > > > > If TDX is ever enabled and/or used to run any TD guests, the cachelines > > > > of TDX private memory, including PAMTs, used by TDX module need to be > > > > flushed before transiting to the new kernel otherwise they may silently > > > > corrupt the new kernel. Similar to SME, this series flushes cache in > > > > stop_this_cpu(). > > > > > > What does this have to do with kexec()? What's a PAMT? > > > > The point is the dirty cachelines of TDX private memory must be flushed > > otherwise they may slightly corrupt the new kexec()-ed kernel. > > > > Will use "TDX metadata" instead of "PAMT". The former has already been > > mentioned above. > > Longer description for the patch itself: > > TDX memory encryption is built on top of MKTME which uses physical > address aliases to designate encryption keys. This architecture is not > cache coherent. Software is responsible for flushing the CPU caches > when memory changes keys. When kexec()'ing, memory can be repurposed > from TDX use to non-TDX use, changing the effective encryption key. > > Cover-letter-level description: > > Just like SME, TDX hosts require special cache flushing before kexec(). Thanks. > > > > > uninitialized state so it can be initialized again. > > > > > > > > This implies: > > > > > > > > - If the old kernel fails to initialize TDX, the new kernel cannot > > > > use TDX too unless the new kernel fixes the bug which leads to > > > > initialization failure in the old kernel and can resume from where > > > > the old kernel stops. This requires certain coordination between > > > > the two kernels. > > > > > > OK, but what does this *MEAN*? > > > > This means we need to extend the information which the old kernel passes to the > > new kernel. But I don't think it's feasible. I'll refine this kexec() section > > to make it more concise next version. > > > > > > > > > - If the old kernel has initialized TDX successfully, the new kernel > > > > may be able to use TDX if the two kernels have the exactly same > > > > configurations on the TDX module. It further requires the new kernel > > > > to reserve the TDX metadata pages (allocated by the old kernel) in > > > > its page allocator. It also requires coordination between the two > > > > kernels. Furthermore, if kexec() is done when there are active TD > > > > guests running, the new kernel cannot use TDX because it's extremely > > > > hard for the old kernel to pass all TDX private pages to the new > > > > kernel. > > > > > > > > Given that, this series doesn't support TDX after kexec() (except the > > > > old kernel doesn't attempt to initialize TDX at all). > > > > > > > > And this series doesn't shut down TDX module but leaves it open during > > > > kexec(). It is because shutting down TDX module requires CPU being in > > > > VMX operation but there's no guarantee of this during kexec(). Leaving > > > > the TDX module open is not the best case, but it is OK since the new > > > > kernel won't be able to use TDX anyway (therefore TDX module won't run > > > > at all). > > > > > > tl;dr: kexec() doesn't work with this code. > > > > > > Right? > > > > > > That doesn't seem good. > > > > It can work in my understanding. We just need to flush cache before booting to > > the new kernel. > > What about all the concerns about TDX module configuration changing? > Leaving the TDX module in fully initialized state or shutdown state (in case of error during it's initialization) to the new kernel is fine. If the new kernel doesn't use TDX at all, then the TDX module won't access memory using it's global TDX KeyID. If the new kernel wants to use TDX, it will fail on the very first SEAMCALL when it tries to initialize the TDX module, and won't use SEAMCALL to call the TDX module again. If the new kernel doesn't follow this, then it is a bug in the new kernel, or the new kernel is malicious, in which case it can potentially corrupt the data. But I don't think we need to consider this as if the new kernel is malicious, then it can corrupt data anyway. Does this make sense? Is there any other concerns that I missed?
On 4/27/22 17:37, Kai Huang wrote: > On Wed, 2022-04-27 at 14:59 -0700, Dave Hansen wrote: >> In 5 years, if someone takes this code and runs it on Intel hardware >> with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens? > > I thought we could document this in the documentation saying that this code can > only work on TDX machines that don't have above capabilities (SPR for now). We > can change the code and the documentation when we add the support of those > features in the future, and update the documentation. > > If 5 years later someone takes this code, he/she should take a look at the > documentation and figure out that he/she should choose a newer kernel if the > machine support those features. > > I'll think about design solutions if above doesn't look good for you. No, it doesn't look good to me. You can't just say: /* * This code will eat puppies if used on systems with hotplug. */ and merrily await the puppy bloodbath. If it's not compatible, then you have to *MAKE* it not compatible in a safe, controlled way. >> You can't just ignore the problems because they're not present on one >> version of the hardware. Please, please read this again ^^ >> What about all the concerns about TDX module configuration changing? > > Leaving the TDX module in fully initialized state or shutdown state (in case of > error during it's initialization) to the new kernel is fine. If the new kernel > doesn't use TDX at all, then the TDX module won't access memory using it's > global TDX KeyID. If the new kernel wants to use TDX, it will fail on the very > first SEAMCALL when it tries to initialize the TDX module, and won't use > SEAMCALL to call the TDX module again. If the new kernel doesn't follow this, > then it is a bug in the new kernel, or the new kernel is malicious, in which > case it can potentially corrupt the data. But I don't think we need to consider > this as if the new kernel is malicious, then it can corrupt data anyway. > > Does this make sense? No, I'm pretty lost. But, I'll look at the next version of this with fresh eyes and hopefully you'll have had time to streamline the text by then.
On Wed, 2022-04-27 at 17:50 -0700, Dave Hansen wrote: > On 4/27/22 17:37, Kai Huang wrote: > > On Wed, 2022-04-27 at 14:59 -0700, Dave Hansen wrote: > > > In 5 years, if someone takes this code and runs it on Intel hardware > > > with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens? > > > > I thought we could document this in the documentation saying that this code can > > only work on TDX machines that don't have above capabilities (SPR for now). We > > can change the code and the documentation when we add the support of those > > features in the future, and update the documentation. > > > > If 5 years later someone takes this code, he/she should take a look at the > > documentation and figure out that he/she should choose a newer kernel if the > > machine support those features. > > > > I'll think about design solutions if above doesn't look good for you. > > No, it doesn't look good to me. > > You can't just say: > > /* > * This code will eat puppies if used on systems with hotplug. > */ > > and merrily await the puppy bloodbath. > > If it's not compatible, then you have to *MAKE* it not compatible in a > safe, controlled way. > > > > You can't just ignore the problems because they're not present on one > > > version of the hardware. > > Please, please read this again ^^ OK. I'll think about solutions and come back later. > > > > What about all the concerns about TDX module configuration changing? > > > > Leaving the TDX module in fully initialized state or shutdown state (in case of > > error during it's initialization) to the new kernel is fine. If the new kernel > > doesn't use TDX at all, then the TDX module won't access memory using it's > > global TDX KeyID. If the new kernel wants to use TDX, it will fail on the very > > first SEAMCALL when it tries to initialize the TDX module, and won't use > > SEAMCALL to call the TDX module again. If the new kernel doesn't follow this, > > then it is a bug in the new kernel, or the new kernel is malicious, in which > > case it can potentially corrupt the data. But I don't think we need to consider > > this as if the new kernel is malicious, then it can corrupt data anyway. > > > > Does this make sense? > > No, I'm pretty lost. But, I'll look at the next version of this with > fresh eyes and hopefully you'll have had time to streamline the text by > then. OK thanks.
On Tue, Apr 26, 2022 at 1:10 PM Dave Hansen <dave.hansen@intel.com> wrote: [..] > > 3. Memory hotplug > > > > The first generation of TDX architecturally doesn't support memory > > hotplug. And the first generation of TDX-capable platforms don't support > > physical memory hotplug. Since it physically cannot happen, this series > > doesn't add any check in ACPI memory hotplug code path to disable it. > > > > A special case of memory hotplug is adding NVDIMM as system RAM using Saw "NVDIMM" mentioned while browsing this, so stopped to make a comment... > > kmem driver. However the first generation of TDX-capable platforms > > cannot enable TDX and NVDIMM simultaneously, so in practice this cannot > > happen either. > > What prevents this code from today's code being run on tomorrow's > platforms and breaking these assumptions? The assumption is already broken today with NVDIMM-N. The lack of DDR-T support on TDX enabled platforms has zero effect on DDR-based persistent memory solutions. In other words, please describe the actual software and hardware conflicts at play here, and do not make the mistake of assuming that "no DDR-T support on TDX platforms" == "no NVDIMM support". > > Another case is admin can use 'memmap' kernel command line to create > > legacy PMEMs and use them as TD guest memory, or theoretically, can use > > kmem driver to add them as system RAM. To avoid having to change memory > > hotplug code to prevent this from happening, this series always include > > legacy PMEMs when constructing TDMRs so they are also TDX memory. I am not sure what you are trying to say here? > > 4. CPU hotplug > > > > The first generation of TDX architecturally doesn't support ACPI CPU > > hotplug. All logical cpus are enabled by BIOS in MADT table. Also, the > > first generation of TDX-capable platforms don't support ACPI CPU hotplug > > either. Since this physically cannot happen, this series doesn't add any > > check in ACPI CPU hotplug code path to disable it. What are the actual challenges posed to TDX with respect to CPU hotplug? > > Also, only TDX module initialization requires all BIOS-enabled cpus are Please define "BIOS-enabled" cpus. There is no "BIOS-enabled" line in /proc/cpuinfo for example.
On Wed, 2022-04-27 at 18:01 -0700, Dan Williams wrote: > On Tue, Apr 26, 2022 at 1:10 PM Dave Hansen <dave.hansen@intel.com> wrote: > [..] > > > 3. Memory hotplug > > > > > > The first generation of TDX architecturally doesn't support memory > > > hotplug. And the first generation of TDX-capable platforms don't support > > > physical memory hotplug. Since it physically cannot happen, this series > > > doesn't add any check in ACPI memory hotplug code path to disable it. > > > > > > A special case of memory hotplug is adding NVDIMM as system RAM using > > Saw "NVDIMM" mentioned while browsing this, so stopped to make a comment... > > > > kmem driver. However the first generation of TDX-capable platforms > > > cannot enable TDX and NVDIMM simultaneously, so in practice this cannot > > > happen either. > > > > What prevents this code from today's code being run on tomorrow's > > platforms and breaking these assumptions? > > The assumption is already broken today with NVDIMM-N. The lack of > DDR-T support on TDX enabled platforms has zero effect on DDR-based > persistent memory solutions. In other words, please describe the > actual software and hardware conflicts at play here, and do not make > the mistake of assuming that "no DDR-T support on TDX platforms" == > "no NVDIMM support". Sorry I got this information from planning team or execution team I guess. I was told NVDIMM and TDX cannot "co-exist" on the first generation of TDX capable machine. "co-exist" means they cannot be turned on simultaneously on the same platform. I am also not aware NVDIMM-N, nor the difference between DDR based and DDR-T based persistent memory. Could you give some more background here so I can take a look? > > > > Another case is admin can use 'memmap' kernel command line to create > > > legacy PMEMs and use them as TD guest memory, or theoretically, can use > > > kmem driver to add them as system RAM. To avoid having to change memory > > > hotplug code to prevent this from happening, this series always include > > > legacy PMEMs when constructing TDMRs so they are also TDX memory. > > I am not sure what you are trying to say here? We want to always make sure the memory managed by page allocator is TDX memory. So if the legacy PMEMs are unconditionally configured as TDX memory, then we don't need to prevent them from being added as system memory via kmem driver. > > > > 4. CPU hotplug > > > > > > The first generation of TDX architecturally doesn't support ACPI CPU > > > hotplug. All logical cpus are enabled by BIOS in MADT table. Also, the > > > first generation of TDX-capable platforms don't support ACPI CPU hotplug > > > either. Since this physically cannot happen, this series doesn't add any > > > check in ACPI CPU hotplug code path to disable it. > > What are the actual challenges posed to TDX with respect to CPU hotplug? During the TDX module initialization, there is a step to call SEAMCALL on all logical cpus to initialize per-cpu TDX staff. TDX doesn't support initializing the new hot-added CPUs after the initialization. There are MCHECK/BIOS changes to enforce this check too I guess but I don't know details about this. > > > > Also, only TDX module initialization requires all BIOS-enabled cpus are > > Please define "BIOS-enabled" cpus. There is no "BIOS-enabled" line in > /proc/cpuinfo for example. It means the CPUs with "enable" bit set in the MADT table.
On Thu, 2022-04-28 at 12:58 +1200, Kai Huang wrote: > On Wed, 2022-04-27 at 17:50 -0700, Dave Hansen wrote: > > On 4/27/22 17:37, Kai Huang wrote: > > > On Wed, 2022-04-27 at 14:59 -0700, Dave Hansen wrote: > > > > In 5 years, if someone takes this code and runs it on Intel hardware > > > > with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens? > > > > > > I thought we could document this in the documentation saying that this code can > > > only work on TDX machines that don't have above capabilities (SPR for now). We > > > can change the code and the documentation when we add the support of those > > > features in the future, and update the documentation. > > > > > > If 5 years later someone takes this code, he/she should take a look at the > > > documentation and figure out that he/she should choose a newer kernel if the > > > machine support those features. > > > > > > I'll think about design solutions if above doesn't look good for you. > > > > No, it doesn't look good to me. > > > > You can't just say: > > > > /* > > * This code will eat puppies if used on systems with hotplug. > > */ > > > > and merrily await the puppy bloodbath. > > > > If it's not compatible, then you have to *MAKE* it not compatible in a > > safe, controlled way. > > > > > > You can't just ignore the problems because they're not present on one > > > > version of the hardware. > > > > Please, please read this again ^^ > > OK. I'll think about solutions and come back later. > > Hi Dave, I think we have two approaches to handle memory hotplug interaction with the TDX module initialization. The first approach is simple. We just block memory from being added as system RAM managed by page allocator when the platform supports TDX [1]. It seems we can add some arch-specific-check to __add_memory_resource() and reject the new memory resource if platform supports TDX. __add_memory_resource() is called by both __add_memory() and add_memory_driver_managed() so it prevents from adding NVDIMM as system RAM and normal ACPI memory hotplug [2]. The second approach is relatively more complicated. Instead of directly rejecting the new memory resource in __add_memory_resource(), we check whether the memory resource can be added based on CMR and the TDX module initialization status. This is feasible as with the latest public P-SEAMLDR spec, we can get CMR from P-SEAMLDR SEAMCALL[3]. So we can detect P-SEAMLDR and get CMR info during kernel boots. And in __add_memory_resource() we do below check: tdx_init_disable(); /*similar to cpu_hotplug_disable() */ if (tdx_module_initialized()) // reject memory hotplug else if (new_memory_resource NOT in CMRs) // reject memory hotplug else allow memory hotplug tdx_init_enable(); /*similar to cpu_hotplug_enable() */ tdx_init_disable() temporarily disables TDX module initialization by trying to grab the mutex. If the TDX module initialization is already on going, then it waits until it completes. This should work better for future platforms, but would requires non-trivial more code as we need to add VMXON/VMXOFF support to the core-kernel to detect CMR using SEAMCALL. A side advantage is with VMXON in core-kernel we can shutdown the TDX module in kexec(). But for this series I think the second approach is overkill and we can choose to use the first simple approach? Any suggestions? [1] Platform supports TDX means SEAMRR is enabled, and there are at least 2 TDX keyIDs. Or we can just check SEAMRR is enabled, as in practice a SEAMRR is enabled means the machine is TDX-capable, and for now a TDX-capable machine doesn't support ACPI memory hotplug. [2] It prevents adding legacy PMEM as system RAM too but I think it's fine. If user wants legacy PMEM then it is unlikely user will add it back and use as system RAM. User is unlikely to use legacy PMEM as TD guest memory directly as TD guests is likely to use a new memfd backend which allows private page not accessible from usrspace, so in this way we can exclude legacy PMEM from TDMRs. [3] Please refer to SEAMLDR.SEAMINFO SEAMCALL in latest P-SEAMLDR spec: https://www.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-seamldr-interface-specification.pdf > > >
On Wed, Apr 27, 2022 at 6:21 PM Kai Huang <kai.huang@intel.com> wrote: > > On Wed, 2022-04-27 at 18:01 -0700, Dan Williams wrote: > > On Tue, Apr 26, 2022 at 1:10 PM Dave Hansen <dave.hansen@intel.com> wrote: > > [..] > > > > 3. Memory hotplug > > > > > > > > The first generation of TDX architecturally doesn't support memory > > > > hotplug. And the first generation of TDX-capable platforms don't support > > > > physical memory hotplug. Since it physically cannot happen, this series > > > > doesn't add any check in ACPI memory hotplug code path to disable it. > > > > > > > > A special case of memory hotplug is adding NVDIMM as system RAM using > > > > Saw "NVDIMM" mentioned while browsing this, so stopped to make a comment... > > > > > > kmem driver. However the first generation of TDX-capable platforms > > > > cannot enable TDX and NVDIMM simultaneously, so in practice this cannot > > > > happen either. > > > > > > What prevents this code from today's code being run on tomorrow's > > > platforms and breaking these assumptions? > > > > The assumption is already broken today with NVDIMM-N. The lack of > > DDR-T support on TDX enabled platforms has zero effect on DDR-based > > persistent memory solutions. In other words, please describe the > > actual software and hardware conflicts at play here, and do not make > > the mistake of assuming that "no DDR-T support on TDX platforms" == > > "no NVDIMM support". > > Sorry I got this information from planning team or execution team I guess. I was > told NVDIMM and TDX cannot "co-exist" on the first generation of TDX capable > machine. "co-exist" means they cannot be turned on simultaneously on the same > platform. I am also not aware NVDIMM-N, nor the difference between DDR based > and DDR-T based persistent memory. Could you give some more background here so > I can take a look? My rough understanding is that TDX makes use of metadata communicated "on the wire" for DDR, but that infrastructure is not there for DDR-T. However, there are plenty of DDR based NVDIMMs that use super-caps / batteries and flash to save contents. I believe the concern for TDX is that the kernel needs to know not use TDX accepted PMEM as PMEM because the contents saved by the DIMM's onboard energy source are unreadable outside of a TD. Here is one of the links that comes up in a search for NVDIMM-N. https://www.snia.org/educational-library/what-you-can-do-nvdimm-n-and-nvdimm-p-2019 > > > > > > > Another case is admin can use 'memmap' kernel command line to create > > > > legacy PMEMs and use them as TD guest memory, or theoretically, can use > > > > kmem driver to add them as system RAM. To avoid having to change memory > > > > hotplug code to prevent this from happening, this series always include > > > > legacy PMEMs when constructing TDMRs so they are also TDX memory. > > > > I am not sure what you are trying to say here? > > We want to always make sure the memory managed by page allocator is TDX memory. That only seems possible if the kernel is given a TDX capable physical address map at the beginning of time. > So if the legacy PMEMs are unconditionally configured as TDX memory, then we > don't need to prevent them from being added as system memory via kmem driver. I think that is too narrow of a focus. Does a memory map exist for the physical address ranges that are TDX capable? Please don't say EFI_MEMORY_CPU_CRYPTO, as that single bit is ambiguous beyond the point of utility across the industry's entire range of confidential computing memory capabilities. One strawman would be an ACPI table with contents like: struct acpi_protected_memory { struct range range; uuid_t platform_mem_crypto_capability; }; With some way to map those uuids to a set of platform vendor specific constraints and specifications. Some would be shared across confidential computing vendors, some might be unique. Otherwise, I do not see how you enforce the expectation of "all memory in the page allocator is TDX capable". The other alternative is that *none* of the memory in the page allocator is TDX capable and a special memory allocation device is used to map memory for TDs. In either case a map of all possible TDX memory is needed and the discussion above seems like an incomplete / "hopeful" proposal about the memory dax_kmem, or other sources, might online. See the CXL CEDT CFWMS (CXL Fixed Memory Window Structure) as an example of an ACPI table that sets the kernel's expectations about how a physical address range might be used. https://www.computeexpresslink.org/spec-landing > > > > > > > 4. CPU hotplug > > > > > > > > The first generation of TDX architecturally doesn't support ACPI CPU > > > > hotplug. All logical cpus are enabled by BIOS in MADT table. Also, the > > > > first generation of TDX-capable platforms don't support ACPI CPU hotplug > > > > either. Since this physically cannot happen, this series doesn't add any > > > > check in ACPI CPU hotplug code path to disable it. > > > > What are the actual challenges posed to TDX with respect to CPU hotplug? > > During the TDX module initialization, there is a step to call SEAMCALL on all > logical cpus to initialize per-cpu TDX staff. TDX doesn't support initializing > the new hot-added CPUs after the initialization. There are MCHECK/BIOS changes > to enforce this check too I guess but I don't know details about this. Is there an ACPI table that indicates CPU-x passed the check? Or since the BIOS is invoked in the CPU-online path, is it trusted to suppress those events for CPUs outside of the mcheck domain? > > > > Also, only TDX module initialization requires all BIOS-enabled cpus are > > > > Please define "BIOS-enabled" cpus. There is no "BIOS-enabled" line in > > /proc/cpuinfo for example. > > It means the CPUs with "enable" bit set in the MADT table. That just indicates to the present CPUs and then a hot add event changes the state of now present CPUs to enabled. Per above is the BIOS responsible for rejecting those new CPUs, or is the kernel?
On Thu, Apr 28, 2022 at 6:40 PM Kai Huang <kai.huang@intel.com> wrote: > > On Thu, 2022-04-28 at 12:58 +1200, Kai Huang wrote: > > On Wed, 2022-04-27 at 17:50 -0700, Dave Hansen wrote: > > > On 4/27/22 17:37, Kai Huang wrote: > > > > On Wed, 2022-04-27 at 14:59 -0700, Dave Hansen wrote: > > > > > In 5 years, if someone takes this code and runs it on Intel hardware > > > > > with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens? > > > > > > > > I thought we could document this in the documentation saying that this code can > > > > only work on TDX machines that don't have above capabilities (SPR for now). We > > > > can change the code and the documentation when we add the support of those > > > > features in the future, and update the documentation. > > > > > > > > If 5 years later someone takes this code, he/she should take a look at the > > > > documentation and figure out that he/she should choose a newer kernel if the > > > > machine support those features. > > > > > > > > I'll think about design solutions if above doesn't look good for you. > > > > > > No, it doesn't look good to me. > > > > > > You can't just say: > > > > > > /* > > > * This code will eat puppies if used on systems with hotplug. > > > */ > > > > > > and merrily await the puppy bloodbath. > > > > > > If it's not compatible, then you have to *MAKE* it not compatible in a > > > safe, controlled way. > > > > > > > > You can't just ignore the problems because they're not present on one > > > > > version of the hardware. > > > > > > Please, please read this again ^^ > > > > OK. I'll think about solutions and come back later. > > > > > Hi Dave, > > I think we have two approaches to handle memory hotplug interaction with the TDX > module initialization. > > The first approach is simple. We just block memory from being added as system > RAM managed by page allocator when the platform supports TDX [1]. It seems we > can add some arch-specific-check to __add_memory_resource() and reject the new > memory resource if platform supports TDX. __add_memory_resource() is called by > both __add_memory() and add_memory_driver_managed() so it prevents from adding > NVDIMM as system RAM and normal ACPI memory hotplug [2]. What if the memory being added *is* TDX capable? What if someone wanted to manage a memory range as soft-reserved and move it back and forth from the core-mm to device access. That should be perfectly acceptable as long as the memory is TDX capable. > The second approach is relatively more complicated. Instead of directly > rejecting the new memory resource in __add_memory_resource(), we check whether > the memory resource can be added based on CMR and the TDX module initialization > status. This is feasible as with the latest public P-SEAMLDR spec, we can get > CMR from P-SEAMLDR SEAMCALL[3]. So we can detect P-SEAMLDR and get CMR info > during kernel boots. And in __add_memory_resource() we do below check: > > tdx_init_disable(); /*similar to cpu_hotplug_disable() */ > if (tdx_module_initialized()) > // reject memory hotplug > else if (new_memory_resource NOT in CMRs) > // reject memory hotplug > else > allow memory hotplug > tdx_init_enable(); /*similar to cpu_hotplug_enable() */ > > tdx_init_disable() temporarily disables TDX module initialization by trying to > grab the mutex. If the TDX module initialization is already on going, then it > waits until it completes. > > This should work better for future platforms, but would requires non-trivial > more code as we need to add VMXON/VMXOFF support to the core-kernel to detect > CMR using SEAMCALL. A side advantage is with VMXON in core-kernel we can > shutdown the TDX module in kexec(). > > But for this series I think the second approach is overkill and we can choose to > use the first simple approach? This still sounds like it is trying to solve symptoms and not the root problem. Why must the core-mm never have non-TDX memory when VMs are fine to operate with either core-mm pages or memory from other sources like hugetlbfs and device-dax?
On Thu, 2022-04-28 at 20:04 -0700, Dan Williams wrote: > On Thu, Apr 28, 2022 at 6:40 PM Kai Huang <kai.huang@intel.com> wrote: > > > > On Thu, 2022-04-28 at 12:58 +1200, Kai Huang wrote: > > > On Wed, 2022-04-27 at 17:50 -0700, Dave Hansen wrote: > > > > On 4/27/22 17:37, Kai Huang wrote: > > > > > On Wed, 2022-04-27 at 14:59 -0700, Dave Hansen wrote: > > > > > > In 5 years, if someone takes this code and runs it on Intel hardware > > > > > > with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens? > > > > > > > > > > I thought we could document this in the documentation saying that this code can > > > > > only work on TDX machines that don't have above capabilities (SPR for now). We > > > > > can change the code and the documentation when we add the support of those > > > > > features in the future, and update the documentation. > > > > > > > > > > If 5 years later someone takes this code, he/she should take a look at the > > > > > documentation and figure out that he/she should choose a newer kernel if the > > > > > machine support those features. > > > > > > > > > > I'll think about design solutions if above doesn't look good for you. > > > > > > > > No, it doesn't look good to me. > > > > > > > > You can't just say: > > > > > > > > /* > > > > * This code will eat puppies if used on systems with hotplug. > > > > */ > > > > > > > > and merrily await the puppy bloodbath. > > > > > > > > If it's not compatible, then you have to *MAKE* it not compatible in a > > > > safe, controlled way. > > > > > > > > > > You can't just ignore the problems because they're not present on one > > > > > > version of the hardware. > > > > > > > > Please, please read this again ^^ > > > > > > OK. I'll think about solutions and come back later. > > > > > > > > Hi Dave, > > > > I think we have two approaches to handle memory hotplug interaction with the TDX > > module initialization. > > > > The first approach is simple. We just block memory from being added as system > > RAM managed by page allocator when the platform supports TDX [1]. It seems we > > can add some arch-specific-check to __add_memory_resource() and reject the new > > memory resource if platform supports TDX. __add_memory_resource() is called by > > both __add_memory() and add_memory_driver_managed() so it prevents from adding > > NVDIMM as system RAM and normal ACPI memory hotplug [2]. > > What if the memory being added *is* TDX capable? What if someone > wanted to manage a memory range as soft-reserved and move it back and > forth from the core-mm to device access. That should be perfectly > acceptable as long as the memory is TDX capable. Please see below. > > > The second approach is relatively more complicated. Instead of directly > > rejecting the new memory resource in __add_memory_resource(), we check whether > > the memory resource can be added based on CMR and the TDX module initialization > > status. This is feasible as with the latest public P-SEAMLDR spec, we can get > > CMR from P-SEAMLDR SEAMCALL[3]. So we can detect P-SEAMLDR and get CMR info > > during kernel boots. And in __add_memory_resource() we do below check: > > > > tdx_init_disable(); /*similar to cpu_hotplug_disable() */ > > if (tdx_module_initialized()) > > // reject memory hotplug > > else if (new_memory_resource NOT in CMRs) > > // reject memory hotplug > > else > > allow memory hotplug > > tdx_init_enable(); /*similar to cpu_hotplug_enable() */ > > > > tdx_init_disable() temporarily disables TDX module initialization by trying to > > grab the mutex. If the TDX module initialization is already on going, then it > > waits until it completes. > > > > This should work better for future platforms, but would requires non-trivial > > more code as we need to add VMXON/VMXOFF support to the core-kernel to detect > > CMR using SEAMCALL. A side advantage is with VMXON in core-kernel we can > > shutdown the TDX module in kexec(). > > > > But for this series I think the second approach is overkill and we can choose to > > use the first simple approach? > > This still sounds like it is trying to solve symptoms and not the root > problem. Why must the core-mm never have non-TDX memory when VMs are > fine to operate with either core-mm pages or memory from other sources > like hugetlbfs and device-dax? Basically we don't want to modify page allocator API to distinguish TDX and non- TDX allocation. For instance, we don't want a new GFP_TDX. There's another series done by Chao "KVM: mm: fd-based approach for supporting KVM guest private memory" which essentially allows KVM to ask guest memory backend to allocate page w/o having to mmap() to userspace. https://lore.kernel.org/kvm/20220310140911.50924-1-chao.p.peng@linux.intel.com/ More specifically, memfd will support a new MFD_INACCESSIBLE flag when it is created so all pages associated with this memfd will be TDX capable memory. The backend will need to implement a new memfile_notifier_ops to allow KVM to get and put the memory page. struct memfile_pfn_ops { long (*get_lock_pfn)(struct inode *inode, pgoff_t offset, int *order); void (*put_unlock_pfn)(unsigned long pfn); }; With that, it is backend's responsibility to implement get_lock_pfn() callback in which the backend needs to ensure a TDX private page is allocated. For TD guest, KVM should enforced to only use those fd-based backend. I am not sure whether anonymous pages should be supported anymore. Sean, please correct me if I am wrong? Currently only shmem is extended to support it. By ensuring pages in page allocator are all TDX memory, shmem can be extended easily to support TD guests. If device-dax and hugetlbfs wants to support TD guests then they should implement those callbacks and ensure only TDX memory is allocated. For instance, when future TDX supports NVDIMM (i.e. NVDIMM is included to CMRs), then device-dax pages can be included as TDX memory when initializing the TDX module and device-dax can implement it's own to support allocating page for TD guests. But TDX architecture can be changed to support memory hotplug in a more graceful way in the future. For instance, it can choose to support dynamically adding any convertible memory as TDX memory *after* TDX module initialization. But this is just my brainstorming. Anyway, for now, since only shmem (or + anonymous pages) can be used to create TD guests, I think we can just reject any memory hot-add when platform supports TDX as described in the first simple approach. Eventually we may need something like the second approach but TDX architecture can evolve too.
On Thu, 2022-04-28 at 19:58 -0700, Dan Williams wrote: > On Wed, Apr 27, 2022 at 6:21 PM Kai Huang <kai.huang@intel.com> wrote: > > > > On Wed, 2022-04-27 at 18:01 -0700, Dan Williams wrote: > > > On Tue, Apr 26, 2022 at 1:10 PM Dave Hansen <dave.hansen@intel.com> wrote: > > > [..] > > > > > 3. Memory hotplug > > > > > > > > > > The first generation of TDX architecturally doesn't support memory > > > > > hotplug. And the first generation of TDX-capable platforms don't support > > > > > physical memory hotplug. Since it physically cannot happen, this series > > > > > doesn't add any check in ACPI memory hotplug code path to disable it. > > > > > > > > > > A special case of memory hotplug is adding NVDIMM as system RAM using > > > > > > Saw "NVDIMM" mentioned while browsing this, so stopped to make a comment... > > > > > > > > kmem driver. However the first generation of TDX-capable platforms > > > > > cannot enable TDX and NVDIMM simultaneously, so in practice this cannot > > > > > happen either. > > > > > > > > What prevents this code from today's code being run on tomorrow's > > > > platforms and breaking these assumptions? > > > > > > The assumption is already broken today with NVDIMM-N. The lack of > > > DDR-T support on TDX enabled platforms has zero effect on DDR-based > > > persistent memory solutions. In other words, please describe the > > > actual software and hardware conflicts at play here, and do not make > > > the mistake of assuming that "no DDR-T support on TDX platforms" == > > > "no NVDIMM support". > > > > Sorry I got this information from planning team or execution team I guess. I was > > told NVDIMM and TDX cannot "co-exist" on the first generation of TDX capable > > machine. "co-exist" means they cannot be turned on simultaneously on the same > > platform. I am also not aware NVDIMM-N, nor the difference between DDR based > > and DDR-T based persistent memory. Could you give some more background here so > > I can take a look? > > My rough understanding is that TDX makes use of metadata communicated > "on the wire" for DDR, but that infrastructure is not there for DDR-T. > However, there are plenty of DDR based NVDIMMs that use super-caps / > batteries and flash to save contents. I believe the concern for TDX is > that the kernel needs to know not use TDX accepted PMEM as PMEM > because the contents saved by the DIMM's onboard energy source are > unreadable outside of a TD. > > Here is one of the links that comes up in a search for NVDIMM-N. > > https://www.snia.org/educational-library/what-you-can-do-nvdimm-n-and-nvdimm-p-2019 Thanks for the info. I need some more time to digest those different types of DDRs and NVDIMMs. However I guess they are not quite relevant since TDX has a concept of "Convertible Memory Region". Please see below. > > > > > > > > > > > Another case is admin can use 'memmap' kernel command line to create > > > > > legacy PMEMs and use them as TD guest memory, or theoretically, can use > > > > > kmem driver to add them as system RAM. To avoid having to change memory > > > > > hotplug code to prevent this from happening, this series always include > > > > > legacy PMEMs when constructing TDMRs so they are also TDX memory. > > > > > > I am not sure what you are trying to say here? > > > > We want to always make sure the memory managed by page allocator is TDX memory. > > That only seems possible if the kernel is given a TDX capable physical > address map at the beginning of time. Yes TDX architecture has a concept "Convertible Memory Region" (CMR). The memory used by TDX must be convertible memory. BIOS generates an array of CMR entries during boot and they are verified by MCHECK. CMRs are static during machine's runtime. > > > So if the legacy PMEMs are unconditionally configured as TDX memory, then we > > don't need to prevent them from being added as system memory via kmem driver. > > I think that is too narrow of a focus. > > Does a memory map exist for the physical address ranges that are TDX > capable? Please don't say EFI_MEMORY_CPU_CRYPTO, as that single bit is > ambiguous beyond the point of utility across the industry's entire > range of confidential computing memory capabilities. > > One strawman would be an ACPI table with contents like: > > struct acpi_protected_memory { > struct range range; > uuid_t platform_mem_crypto_capability; > }; > > With some way to map those uuids to a set of platform vendor specific > constraints and specifications. Some would be shared across > confidential computing vendors, some might be unique. Otherwise, I do > not see how you enforce the expectation of "all memory in the page > allocator is TDX capable". > Please see above. TDX has CMR. > The other alternative is that *none* of the > memory in the page allocator is TDX capable and a special memory > allocation device is used to map memory for TDs. In either case a map > of all possible TDX memory is needed and the discussion above seems > like an incomplete / "hopeful" proposal about the memory dax_kmem, or > other sources, might online. Yes we are also developing a new memfd based approach to support TD guest memory. Please see my another reply to you. > See the CXL CEDT CFWMS (CXL Fixed Memory > Window Structure) as an example of an ACPI table that sets the > kernel's expectations about how a physical address range might be > used. > > https://www.computeexpresslink.org/spec-landing Thanks for the info. I'll take a look to get some background. > > > > > > > > > > > 4. CPU hotplug > > > > > > > > > > The first generation of TDX architecturally doesn't support ACPI CPU > > > > > hotplug. All logical cpus are enabled by BIOS in MADT table. Also, the > > > > > first generation of TDX-capable platforms don't support ACPI CPU hotplug > > > > > either. Since this physically cannot happen, this series doesn't add any > > > > > check in ACPI CPU hotplug code path to disable it. > > > > > > What are the actual challenges posed to TDX with respect to CPU hotplug? > > > > During the TDX module initialization, there is a step to call SEAMCALL on all > > logical cpus to initialize per-cpu TDX staff. TDX doesn't support initializing > > the new hot-added CPUs after the initialization. There are MCHECK/BIOS changes > > to enforce this check too I guess but I don't know details about this. > > Is there an ACPI table that indicates CPU-x passed the check? Or since > the BIOS is invoked in the CPU-online path, is it trusted to suppress > those events for CPUs outside of the mcheck domain? No the TDX module (and the P-SEAMLDR) internally maintains some data to record the total number of LPs and packages, and which logical cpu has been initialized, etc. I asked Intel guys whether BIOS would suppress an ACPI CPU hotplug event but I never got a concrete answer. I'll try again. > > > > > > Also, only TDX module initialization requires all BIOS-enabled cpus are > > > > > > Please define "BIOS-enabled" cpus. There is no "BIOS-enabled" line in > > > /proc/cpuinfo for example. > > > > It means the CPUs with "enable" bit set in the MADT table. > > That just indicates to the present CPUs and then a hot add event > changes the state of now present CPUs to enabled. Per above is the > BIOS responsible for rejecting those new CPUs, or is the kernel? I'll ask BIOS guys again to see whether BIOS will suppress ACPI CPU hotplug event. But I think we can have a simple patch to reject ACPI CPU hotplug if platform is TDX-capable? Or do you think we don't need to explicitly reject ACPI CPU hotplug if we can confirm with BIOS guys that it will suppress on TDX capable machine?
On 4/28/22 19:58, Dan Williams wrote: > That only seems possible if the kernel is given a TDX capable physical > address map at the beginning of time. TDX actually brings along its own memory map. The "EAS"[1]. has a lot of info on it, if you know where to find it. Here's the relevant chunk: CMR - Convertible Memory Range - A range of physical memory configured by BIOS and verified by MCHECK. MCHECK verificatio n is intended to help ensure that a CMR may be used to hold TDX memory pages encrypted with a private HKID. So, the BIOS has the platform knowledge to enumerate this range. It stashes the information off somewhere that the TDX module can find it. Then, during OS boot, the OS makes a SEAMCALL (TDH.SYS.CONFIG) to the TDX module and gets the list of CMRs. The OS then has to reconcile this CMR "memory map" against the regular old BIOS-provided memory map, tossing out any memory regions which are RAM, but not covered by a CMR, or disabling TDX entirely. Fun, eh? I'm still grappling with how this series handles the policy of what memory to throw away when. 1. https://www.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-module-1eas.pdf
On Fri, Apr 29, 2022 at 7:39 AM Dave Hansen <dave.hansen@intel.com> wrote: > > On 4/28/22 19:58, Dan Williams wrote: > > That only seems possible if the kernel is given a TDX capable physical > > address map at the beginning of time. > > TDX actually brings along its own memory map. The "EAS"[1]. has a lot > of info on it, if you know where to find it. Here's the relevant chunk: > > CMR - Convertible Memory Range - > A range of physical memory configured by BIOS and verified by > MCHECK. MCHECK verificatio n is intended to help ensure that a > CMR may be used to hold TDX memory pages encrypted with a > private HKID. > > So, the BIOS has the platform knowledge to enumerate this range. It > stashes the information off somewhere that the TDX module can find it. > Then, during OS boot, the OS makes a SEAMCALL (TDH.SYS.CONFIG) to the > TDX module and gets the list of CMRs. > > The OS then has to reconcile this CMR "memory map" against the regular > old BIOS-provided memory map, tossing out any memory regions which are > RAM, but not covered by a CMR, or disabling TDX entirely. > > Fun, eh? Yes, I want to challenge the idea that all core-mm memory must be TDX capable. Instead, this feels more like something that wants a hugetlbfs / dax-device like capability to ask the kernel to gather / set-aside the enumerated TDX memory out of all the general purpose memory it knows about and then VMs use that ABI to get access to convertible memory. Trying to ensure that all page allocator memory is TDX capable feels too restrictive with all the different ways pfns can get into the allocator.
On 4/29/22 08:18, Dan Williams wrote: > Yes, I want to challenge the idea that all core-mm memory must be TDX > capable. Instead, this feels more like something that wants a > hugetlbfs / dax-device like capability to ask the kernel to gather / > set-aside the enumerated TDX memory out of all the general purpose > memory it knows about and then VMs use that ABI to get access to > convertible memory. Trying to ensure that all page allocator memory is > TDX capable feels too restrictive with all the different ways pfns can > get into the allocator. The KVM users are the problem here. They use a variety of ABIs to get memory and then hand it to KVM. KVM basically just consumes the physical addresses from the page tables. Also, there's no _practical_ problem here today. I can't actually think of a case where any memory that ends up in the allocator on today's TDX systems is not TDX capable. Tomorrow's systems are going to be the problem. They'll (presumably) have a mix of CXL devices that will have varying capabilities. Some will surely lack the metadata storage for checksums and TD-owner bits. TDX use will be *safe* on those systems: if you take this code and run it on one tomorrow's systems, it will notice the TDX-incompatible memory and will disable TDX. The only way around this that I can see is to introduce ABI today that anticipates the needs of the future systems. We could require that all the KVM memory be "validated" before handing it to TDX. Maybe a new syscall that says: "make sure this mapping works for TDX". It could be new sysfs ABI which specifies which NUMA nodes contain TDX-capable memory. But, neither of those really help with, say, a device-DAX mapping of TDX-*IN*capable memory handed to KVM. The "new syscall" would just throw up its hands and leave users with the same result: TDX can't be used. The new sysfs ABI for NUMA nodes wouldn't clearly apply to device-DAX because they don't respect the NUMA policy ABI. I'm open to ideas here. If there's a viable ABI we can introduce to train TDX users today that will work tomorrow too, I'm all for it.
On Fri, Apr 29, 2022 at 10:18 AM Dave Hansen <dave.hansen@intel.com> wrote: > > On 4/29/22 08:18, Dan Williams wrote: > > Yes, I want to challenge the idea that all core-mm memory must be TDX > > capable. Instead, this feels more like something that wants a > > hugetlbfs / dax-device like capability to ask the kernel to gather / > > set-aside the enumerated TDX memory out of all the general purpose > > memory it knows about and then VMs use that ABI to get access to > > convertible memory. Trying to ensure that all page allocator memory is > > TDX capable feels too restrictive with all the different ways pfns can > > get into the allocator. > > The KVM users are the problem here. They use a variety of ABIs to get > memory and then hand it to KVM. KVM basically just consumes the > physical addresses from the page tables. > > Also, there's no _practical_ problem here today. I can't actually think > of a case where any memory that ends up in the allocator on today's TDX > systems is not TDX capable. > > Tomorrow's systems are going to be the problem. They'll (presumably) > have a mix of CXL devices that will have varying capabilities. Some > will surely lack the metadata storage for checksums and TD-owner bits. > TDX use will be *safe* on those systems: if you take this code and run > it on one tomorrow's systems, it will notice the TDX-incompatible memory > and will disable TDX. > > The only way around this that I can see is to introduce ABI today that > anticipates the needs of the future systems. We could require that all > the KVM memory be "validated" before handing it to TDX. Maybe a new > syscall that says: "make sure this mapping works for TDX". It could be > new sysfs ABI which specifies which NUMA nodes contain TDX-capable memory. Yes, node-id seems the only reasonable handle that can be used, and it does not seem too onerous for a KVM user to have to set a node policy preferring all the TDX / confidential-computing capable nodes. > But, neither of those really help with, say, a device-DAX mapping of > TDX-*IN*capable memory handed to KVM. The "new syscall" would just > throw up its hands and leave users with the same result: TDX can't be > used. The new sysfs ABI for NUMA nodes wouldn't clearly apply to > device-DAX because they don't respect the NUMA policy ABI. They do have "target_node" attributes to associate node specific metadata, and could certainly express target_node capabilities in its own ABI. Then it's just a matter of making pfn_to_nid() do the right thing so KVM kernel side can validate the capabilities of all inbound pfns. > I'm open to ideas here. If there's a viable ABI we can introduce to > train TDX users today that will work tomorrow too, I'm all for it. In general, expressing NUMA node perf and node capabilities is something Linux needs to get better at. HMAT data for example still exists as sideband information ignored by numactl, but it feels inevitable that perf and capability details become more of a first class citizen for applications that have these mem-allocation-policy constraints in the presence of disparate memory types.
On 4/29/22 10:48, Dan Williams wrote: >> But, neither of those really help with, say, a device-DAX mapping of >> TDX-*IN*capable memory handed to KVM. The "new syscall" would just >> throw up its hands and leave users with the same result: TDX can't be >> used. The new sysfs ABI for NUMA nodes wouldn't clearly apply to >> device-DAX because they don't respect the NUMA policy ABI. > They do have "target_node" attributes to associate node specific > metadata, and could certainly express target_node capabilities in its > own ABI. Then it's just a matter of making pfn_to_nid() do the right > thing so KVM kernel side can validate the capabilities of all inbound > pfns. Let's walk through how this would work with today's kernel on tomorrow's hardware, without KVM validating PFNs: 1. daxaddr mmap("/dev/dax1234") 2. kvmfd = open("/dev/kvm") 3. ioctl(KVM_SET_USER_MEMORY_REGION, { daxaddr }; 4. guest starts running 5. guest touches 'daxaddr' 6. Page fault handler maps 'daxaddr' 7. KVM finds new 'daxaddr' PTE 8. TDX code tries to add physical address to Secure-EPT 9. TDX "SEAMCALL" fails because page is not convertible 10. Guest dies All we can do to improve on that is call something that pledges to only map convertible memory at 'daxaddr'. We can't *actually* validate the physical addresses at mmap() time or even KVM_SET_USER_MEMORY_REGION-time because the memory might not have been allocated. Those pledges are hard for anonymous memory though. To fulfill the pledge, we not only have to validate that the NUMA policy is compatible at KVM_SET_USER_MEMORY_REGION, we also need to decline changes to the policy that might undermine the pledge.
On Fri, Apr 29, 2022 at 11:34 AM Dave Hansen <dave.hansen@intel.com> wrote: > > On 4/29/22 10:48, Dan Williams wrote: > >> But, neither of those really help with, say, a device-DAX mapping of > >> TDX-*IN*capable memory handed to KVM. The "new syscall" would just > >> throw up its hands and leave users with the same result: TDX can't be > >> used. The new sysfs ABI for NUMA nodes wouldn't clearly apply to > >> device-DAX because they don't respect the NUMA policy ABI. > > They do have "target_node" attributes to associate node specific > > metadata, and could certainly express target_node capabilities in its > > own ABI. Then it's just a matter of making pfn_to_nid() do the right > > thing so KVM kernel side can validate the capabilities of all inbound > > pfns. > > Let's walk through how this would work with today's kernel on tomorrow's > hardware, without KVM validating PFNs: > > 1. daxaddr mmap("/dev/dax1234") > 2. kvmfd = open("/dev/kvm") > 3. ioctl(KVM_SET_USER_MEMORY_REGION, { daxaddr }; At least for a file backed mapping the capability lookup could be done here, no need to wait for the fault. > 4. guest starts running > 5. guest touches 'daxaddr' > 6. Page fault handler maps 'daxaddr' > 7. KVM finds new 'daxaddr' PTE > 8. TDX code tries to add physical address to Secure-EPT > 9. TDX "SEAMCALL" fails because page is not convertible > 10. Guest dies > > All we can do to improve on that is call something that pledges to only > map convertible memory at 'daxaddr'. We can't *actually* validate the > physical addresses at mmap() time or even > KVM_SET_USER_MEMORY_REGION-time because the memory might not have been > allocated. > > Those pledges are hard for anonymous memory though. To fulfill the > pledge, we not only have to validate that the NUMA policy is compatible > at KVM_SET_USER_MEMORY_REGION, we also need to decline changes to the > policy that might undermine the pledge. I think it's less that the kernel needs to enforce a pledge and more that an interface is needed to communicate the guest death reason. I.e. "here is the impossible thing you asked for, next time set this policy to avoid this problem".
On 4/29/22 11:47, Dan Williams wrote: > On Fri, Apr 29, 2022 at 11:34 AM Dave Hansen <dave.hansen@intel.com> wrote: >> >> On 4/29/22 10:48, Dan Williams wrote: >>>> But, neither of those really help with, say, a device-DAX mapping of >>>> TDX-*IN*capable memory handed to KVM. The "new syscall" would just >>>> throw up its hands and leave users with the same result: TDX can't be >>>> used. The new sysfs ABI for NUMA nodes wouldn't clearly apply to >>>> device-DAX because they don't respect the NUMA policy ABI. >>> They do have "target_node" attributes to associate node specific >>> metadata, and could certainly express target_node capabilities in its >>> own ABI. Then it's just a matter of making pfn_to_nid() do the right >>> thing so KVM kernel side can validate the capabilities of all inbound >>> pfns. >> >> Let's walk through how this would work with today's kernel on tomorrow's >> hardware, without KVM validating PFNs: >> >> 1. daxaddr mmap("/dev/dax1234") >> 2. kvmfd = open("/dev/kvm") >> 3. ioctl(KVM_SET_USER_MEMORY_REGION, { daxaddr }; > > At least for a file backed mapping the capability lookup could be done > here, no need to wait for the fault. For DAX mappings, sure. But, anything that's backed by page cache, you can't know until the RAM is allocated. ... >> Those pledges are hard for anonymous memory though. To fulfill the >> pledge, we not only have to validate that the NUMA policy is compatible >> at KVM_SET_USER_MEMORY_REGION, we also need to decline changes to the >> policy that might undermine the pledge. > > I think it's less that the kernel needs to enforce a pledge and more > that an interface is needed to communicate the guest death reason. > I.e. "here is the impossible thing you asked for, next time set this > policy to avoid this problem". IF this code is booted on a system where non-TDX-capable memory is discovered, do we: 1. Disable TDX, printk() some nasty message, then boot as normal or, 2a. Boot normally with TDX enabled 2b. Add enhanced error messages in case of TDH.MEM.PAGE.AUG/ADD failure (the "SEAMCALLs" which are the last line of defense and will reject the request to add non-TDX-capable memory to a guest). Or maybe an even earlier message. For #1, if TDX is on, we are quite sure it will work. But, it will probably throw up its hands on tomorrow's hardware. (This patch set). For #2, TDX might break (guests get killed) at runtime on tomorrow's hardware, but it also might be just fine. Users might be able to work around things by, for instance, figuring out a NUMA policy which excludes TDX-incapable memory. (I think what Dan is looking for) Is that a fair summary?
On Fri, Apr 29, 2022 at 12:20 PM Dave Hansen <dave.hansen@intel.com> wrote: > > On 4/29/22 11:47, Dan Williams wrote: > > On Fri, Apr 29, 2022 at 11:34 AM Dave Hansen <dave.hansen@intel.com> wrote: > >> > >> On 4/29/22 10:48, Dan Williams wrote: > >>>> But, neither of those really help with, say, a device-DAX mapping of > >>>> TDX-*IN*capable memory handed to KVM. The "new syscall" would just > >>>> throw up its hands and leave users with the same result: TDX can't be > >>>> used. The new sysfs ABI for NUMA nodes wouldn't clearly apply to > >>>> device-DAX because they don't respect the NUMA policy ABI. > >>> They do have "target_node" attributes to associate node specific > >>> metadata, and could certainly express target_node capabilities in its > >>> own ABI. Then it's just a matter of making pfn_to_nid() do the right > >>> thing so KVM kernel side can validate the capabilities of all inbound > >>> pfns. > >> > >> Let's walk through how this would work with today's kernel on tomorrow's > >> hardware, without KVM validating PFNs: > >> > >> 1. daxaddr mmap("/dev/dax1234") > >> 2. kvmfd = open("/dev/kvm") > >> 3. ioctl(KVM_SET_USER_MEMORY_REGION, { daxaddr }; > > > > At least for a file backed mapping the capability lookup could be done > > here, no need to wait for the fault. > > For DAX mappings, sure. But, anything that's backed by page cache, you > can't know until the RAM is allocated. > > ... > >> Those pledges are hard for anonymous memory though. To fulfill the > >> pledge, we not only have to validate that the NUMA policy is compatible > >> at KVM_SET_USER_MEMORY_REGION, we also need to decline changes to the > >> policy that might undermine the pledge. > > > > I think it's less that the kernel needs to enforce a pledge and more > > that an interface is needed to communicate the guest death reason. > > I.e. "here is the impossible thing you asked for, next time set this > > policy to avoid this problem". > > IF this code is booted on a system where non-TDX-capable memory is > discovered, do we: > 1. Disable TDX, printk() some nasty message, then boot as normal > or, > 2a. Boot normally with TDX enabled > 2b. Add enhanced error messages in case of TDH.MEM.PAGE.AUG/ADD failure > (the "SEAMCALLs" which are the last line of defense and will reject > the request to add non-TDX-capable memory to a guest). Or maybe > an even earlier message. > > For #1, if TDX is on, we are quite sure it will work. But, it will > probably throw up its hands on tomorrow's hardware. (This patch set). > > For #2, TDX might break (guests get killed) at runtime on tomorrow's > hardware, but it also might be just fine. Users might be able to work > around things by, for instance, figuring out a NUMA policy which > excludes TDX-incapable memory. (I think what Dan is looking for) > > Is that a fair summary? Yes, just the option for TDX and non-TDX to live alongside each other... although in the past I have argued to do option-1 and enforce it at the lowest level [1]. Like platform BIOS is responsible to disable CXL if CXL support for a given CPU security feature is missing. However, I think end users will want to have their confidential computing and capacity too. As long as that is not precluded to be added after the fact, option-1 can be a way forward until a concrete user for mixed mode shows up. Is there something already like this today for people that, for example, attempt to use PCI BAR mappings as memory? Or does KVM simply allow for garbage-in garbage-out? In the end the patches shouldn't talk about whether or not PMEM is supported on a platform or not, that's irrelevant. What matters is that misconfigurations can happen, should be rare to non-existent on current platforms, and if it becomes a problem the kernel can grow ABI to let userspace enumerate the conflicts. [1]: https://lore.kernel.org/linux-cxl/CAPcyv4jMQbHYQssaDDDQFEbOR1v14VUnejcSwOP9VGUnZSsCKw@mail.gmail.com/
On 4/29/22 14:20, Dan Williams wrote: > Is there something already like this today for people that, for > example, attempt to use PCI BAR mappings as memory? Or does KVM simply > allow for garbage-in garbage-out? I'm just guessing, but I _assume_ those garbage PCI BAR mappings are how KVM does device passthrough. I know that some KVM users even use mem= to chop down the kernel-owned 'struct page'-backed memory, then have a kind of /dev/mem driver to let the memory get mapped back into userspace. KVM is happy to pass through those mappings.
On Fri, 2022-04-29 at 11:34 -0700, Dave Hansen wrote: > On 4/29/22 10:48, Dan Williams wrote: > > > But, neither of those really help with, say, a device-DAX mapping of > > > TDX-*IN*capable memory handed to KVM. The "new syscall" would just > > > throw up its hands and leave users with the same result: TDX can't be > > > used. The new sysfs ABI for NUMA nodes wouldn't clearly apply to > > > device-DAX because they don't respect the NUMA policy ABI. > > They do have "target_node" attributes to associate node specific > > metadata, and could certainly express target_node capabilities in its > > own ABI. Then it's just a matter of making pfn_to_nid() do the right > > thing so KVM kernel side can validate the capabilities of all inbound > > pfns. > > Let's walk through how this would work with today's kernel on tomorrow's > hardware, without KVM validating PFNs: > > 1. daxaddr mmap("/dev/dax1234") > 2. kvmfd = open("/dev/kvm") > 3. ioctl(KVM_SET_USER_MEMORY_REGION, { daxaddr }; > 4. guest starts running > 5. guest touches 'daxaddr' > 6. Page fault handler maps 'daxaddr' > 7. KVM finds new 'daxaddr' PTE > 8. TDX code tries to add physical address to Secure-EPT > 9. TDX "SEAMCALL" fails because page is not convertible > 10. Guest dies > > All we can do to improve on that is call something that pledges to only > map convertible memory at 'daxaddr'. We can't *actually* validate the > physical addresses at mmap() time or even > KVM_SET_USER_MEMORY_REGION-time because the memory might not have been > allocated. > > Those pledges are hard for anonymous memory though. To fulfill the > pledge, we not only have to validate that the NUMA policy is compatible > at KVM_SET_USER_MEMORY_REGION, we also need to decline changes to the > policy that might undermine the pledge. Hi Dave, There's another series done by Chao "KVM: mm: fd-based approach for supporting KVM guest private memory" which essentially allows KVM to ask guest memory backend to allocate page w/o having to mmap() to userspace. Please see my reply below: https://lore.kernel.org/lkml/cover.1649219184.git.kai.huang@intel.com/T/#mf9bf10a63eaaf0968c46ab33bdaf06bd2cfdd948 My understanding is for TDX guest KVM will be enforced to use the new mechanism. So when TDX supports NVDIMM in the future, dax can be extended to support the new mechanism to support using it as TD guest backend. Sean, Paolo, Isaku, Chao, Please correct me if I am wrong?
On Fri, 2022-04-29 at 13:40 +1200, Kai Huang wrote: > On Thu, 2022-04-28 at 12:58 +1200, Kai Huang wrote: > > On Wed, 2022-04-27 at 17:50 -0700, Dave Hansen wrote: > > > On 4/27/22 17:37, Kai Huang wrote: > > > > On Wed, 2022-04-27 at 14:59 -0700, Dave Hansen wrote: > > > > > In 5 years, if someone takes this code and runs it on Intel hardware > > > > > with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens? > > > > > > > > I thought we could document this in the documentation saying that this code can > > > > only work on TDX machines that don't have above capabilities (SPR for now). We > > > > can change the code and the documentation when we add the support of those > > > > features in the future, and update the documentation. > > > > > > > > If 5 years later someone takes this code, he/she should take a look at the > > > > documentation and figure out that he/she should choose a newer kernel if the > > > > machine support those features. > > > > > > > > I'll think about design solutions if above doesn't look good for you. > > > > > > No, it doesn't look good to me. > > > > > > You can't just say: > > > > > > /* > > > * This code will eat puppies if used on systems with hotplug. > > > */ > > > > > > and merrily await the puppy bloodbath. > > > > > > If it's not compatible, then you have to *MAKE* it not compatible in a > > > safe, controlled way. > > > > > > > > You can't just ignore the problems because they're not present on one > > > > > version of the hardware. > > > > > > Please, please read this again ^^ > > > > OK. I'll think about solutions and come back later. > > > > > Hi Dave, > > I think we have two approaches to handle memory hotplug interaction with the TDX > module initialization. > > The first approach is simple. We just block memory from being added as system > RAM managed by page allocator when the platform supports TDX [1]. It seems we > can add some arch-specific-check to __add_memory_resource() and reject the new > memory resource if platform supports TDX. __add_memory_resource() is called by > both __add_memory() and add_memory_driver_managed() so it prevents from adding > NVDIMM as system RAM and normal ACPI memory hotplug [2]. Hi Dave, Try to close how to handle memory hotplug. Any comments to below? For the first approach, I forgot to think about memory hot-remove case. If we just reject adding new memory resource when TDX is capable on the platform, then if the memory is hot-removed, we won't be able to add it back. My thinking is we still want to support memory online/offline because it is purely in software but has nothing to do with TDX. But if one memory resource can be put to offline, it seems we don't have any way to prevent it from being removed. So if we do above, on the future platforms when memory hotplug can co-exist with TDX, ACPI hot-add and kmem-hot-add memory will be prevented. However if some memory is hot-removed, it won't be able to be added back (even it is included in CMR, or TDMRs after TDX module is initialized). Is this behavior acceptable? Or perhaps I have misunderstanding? The second approach will behave more nicely, but I don't know whether it is worth to do it now. Btw, below logic when adding a new memory resource has a minor problem, please see below... > > The second approach is relatively more complicated. Instead of directly > rejecting the new memory resource in __add_memory_resource(), we check whether > the memory resource can be added based on CMR and the TDX module initialization > status. This is feasible as with the latest public P-SEAMLDR spec, we can get > CMR from P-SEAMLDR SEAMCALL[3]. So we can detect P-SEAMLDR and get CMR info > during kernel boots. And in __add_memory_resource() we do below check: > > tdx_init_disable(); /*similar to cpu_hotplug_disable() */ > if (tdx_module_initialized()) > // reject memory hotplug > else if (new_memory_resource NOT in CMRs) > // reject memory hotplug > else > allow memory hotplug > tdx_init_enable(); /*similar to cpu_hotplug_enable() */ ... Should be: // prevent racing with TDX module initialization */ tdx_init_disable(); if (tdx_module_initialized()) { if (new_memory_resource in TDMRs) // allow memory hot-add else // reject memory hot-add } else if (new_memory_resource in CMR) { // add new memory to TDX memory so it can be // included into TDMRs // allow memory hot-add } else // reject memory hot-add tdx_module_enable(); And when platform doesn't TDX, always allow memory hot-add. > > tdx_init_disable() temporarily disables TDX module initialization by trying to > grab the mutex. If the TDX module initialization is already on going, then it > waits until it completes. > > This should work better for future platforms, but would requires non-trivial > more code as we need to add VMXON/VMXOFF support to the core-kernel to detect > CMR using SEAMCALL. A side advantage is with VMXON in core-kernel we can > shutdown the TDX module in kexec(). > > But for this series I think the second approach is overkill and we can choose to > use the first simple approach? > > Any suggestions? > > [1] Platform supports TDX means SEAMRR is enabled, and there are at least 2 TDX > keyIDs. Or we can just check SEAMRR is enabled, as in practice a SEAMRR is > enabled means the machine is TDX-capable, and for now a TDX-capable machine > doesn't support ACPI memory hotplug. > > [2] It prevents adding legacy PMEM as system RAM too but I think it's fine. If > user wants legacy PMEM then it is unlikely user will add it back and use as > system RAM. User is unlikely to use legacy PMEM as TD guest memory directly as > TD guests is likely to use a new memfd backend which allows private page not > accessible from usrspace, so in this way we can exclude legacy PMEM from TDMRs. > > [3] Please refer to SEAMLDR.SEAMINFO SEAMCALL in latest P-SEAMLDR spec: > https://www.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-seamldr-interface-specification.pdf > > > >
On 5/3/22 16:59, Kai Huang wrote: > Should be: > > // prevent racing with TDX module initialization */ > tdx_init_disable(); > > if (tdx_module_initialized()) { > if (new_memory_resource in TDMRs) > // allow memory hot-add > else > // reject memory hot-add > } else if (new_memory_resource in CMR) { > // add new memory to TDX memory so it can be > // included into TDMRs > > // allow memory hot-add > } > else > // reject memory hot-add > > tdx_module_enable(); > > And when platform doesn't TDX, always allow memory hot-add. I don't think it even needs to be *that* complicated. It could just be winner take all: if TDX is initialized first, don't allow memory hotplug. If memory hotplug happens first, don't allow TDX to be initialized. That's fine at least for a minimal patch set. What you have up above is probably where you want to go eventually, but it means doing things like augmenting the e820 since it's the single source of truth for creating the TMDRs right now.
On Tue, 2022-05-03 at 17:25 -0700, Dave Hansen wrote: > On 5/3/22 16:59, Kai Huang wrote: > > Should be: > > > > // prevent racing with TDX module initialization */ > > tdx_init_disable(); > > > > if (tdx_module_initialized()) { > > if (new_memory_resource in TDMRs) > > // allow memory hot-add > > else > > // reject memory hot-add > > } else if (new_memory_resource in CMR) { > > // add new memory to TDX memory so it can be > > // included into TDMRs > > > > // allow memory hot-add > > } > > else > > // reject memory hot-add > > > > tdx_module_enable(); > > > > And when platform doesn't TDX, always allow memory hot-add. > > I don't think it even needs to be *that* complicated. > > It could just be winner take all: if TDX is initialized first, don't > allow memory hotplug. If memory hotplug happens first, don't allow TDX > to be initialized. > > That's fine at least for a minimal patch set. OK. This should also work. We will need tdx_init_disable() which grabs the mutex to prevent TDX module initialization from running concurrently, and to disable TDX module initialization once for all. > > What you have up above is probably where you want to go eventually, but > it means doing things like augmenting the e820 since it's the single > source of truth for creating the TMDRs right now. > Yes. But in this case, I am thinking about probably we should switch from consulting e820 to consulting memblock. The advantage of using e820 is it's easy to include legacy PMEM as TDX memory, but the disadvantage is (as you can see in e820_for_each_mem() loop) I am actually merging contiguous different types of RAM entries in order to be consistent with the behavior of e820_memblock_setup(). This is not nice. If memory hot-add and TDX can only be one winner, legacy PMEM actually won't be used as TDX memory anyway now. The reason is TDX guest will very likely needing to use the new fd-based backend (see my reply in other emails), but not just some random backend. To me it's totally fine to not support using legacy PMEM directly as TD guest backend (and if we create a TD with real NVDIMM as backend using dax, the TD cannot be created anyway). Given we cannot kmem-hot-add legacy PMEM back as system RAM, to me it's pointless to include legacy PMEM into TDMRs. In this case, we can just create TDMRs based on memblock directly. One problem is memblock will be gone after kernel boots, but this can be solved either by keeping the memblock, or build the TDX memory early when memblock is still alive. Btw, eventually, as it's likely we need to support different source of TDX memory (CLX memory, etc), I think eventually we will need some data structures to represent TDX memory block and APIs to add those blocks to the whole TDX memory so those TDX memory ranges from different source can be added before initializing the TDX module. struct tdx_memblock { struct list_head list; phys_addr_t start; phys_addr_t end; int nid; ... }; struct tdx_memory { struct list_head tmb_list; ... }; int tdx_memory_add_memblock(start, end, nid, ...); And the TDMRs can be created based on 'struct tdx_memory'. For now, we only need to add memblock to TDX memory. Any comments?
On Tue, May 3, 2022 at 4:59 PM Kai Huang <kai.huang@intel.com> wrote: > > On Fri, 2022-04-29 at 13:40 +1200, Kai Huang wrote: > > On Thu, 2022-04-28 at 12:58 +1200, Kai Huang wrote: > > > On Wed, 2022-04-27 at 17:50 -0700, Dave Hansen wrote: > > > > On 4/27/22 17:37, Kai Huang wrote: > > > > > On Wed, 2022-04-27 at 14:59 -0700, Dave Hansen wrote: > > > > > > In 5 years, if someone takes this code and runs it on Intel hardware > > > > > > with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens? > > > > > > > > > > I thought we could document this in the documentation saying that this code can > > > > > only work on TDX machines that don't have above capabilities (SPR for now). We > > > > > can change the code and the documentation when we add the support of those > > > > > features in the future, and update the documentation. > > > > > > > > > > If 5 years later someone takes this code, he/she should take a look at the > > > > > documentation and figure out that he/she should choose a newer kernel if the > > > > > machine support those features. > > > > > > > > > > I'll think about design solutions if above doesn't look good for you. > > > > > > > > No, it doesn't look good to me. > > > > > > > > You can't just say: > > > > > > > > /* > > > > * This code will eat puppies if used on systems with hotplug. > > > > */ > > > > > > > > and merrily await the puppy bloodbath. > > > > > > > > If it's not compatible, then you have to *MAKE* it not compatible in a > > > > safe, controlled way. > > > > > > > > > > You can't just ignore the problems because they're not present on one > > > > > > version of the hardware. > > > > > > > > Please, please read this again ^^ > > > > > > OK. I'll think about solutions and come back later. > > > > > > > > Hi Dave, > > > > I think we have two approaches to handle memory hotplug interaction with the TDX > > module initialization. > > > > The first approach is simple. We just block memory from being added as system > > RAM managed by page allocator when the platform supports TDX [1]. It seems we > > can add some arch-specific-check to __add_memory_resource() and reject the new > > memory resource if platform supports TDX. __add_memory_resource() is called by > > both __add_memory() and add_memory_driver_managed() so it prevents from adding > > NVDIMM as system RAM and normal ACPI memory hotplug [2]. > > Hi Dave, > > Try to close how to handle memory hotplug. Any comments to below? > > For the first approach, I forgot to think about memory hot-remove case. If we > just reject adding new memory resource when TDX is capable on the platform, then > if the memory is hot-removed, we won't be able to add it back. My thinking is > we still want to support memory online/offline because it is purely in software > but has nothing to do with TDX. But if one memory resource can be put to > offline, it seems we don't have any way to prevent it from being removed. > > So if we do above, on the future platforms when memory hotplug can co-exist with > TDX, ACPI hot-add and kmem-hot-add memory will be prevented. However if some > memory is hot-removed, it won't be able to be added back (even it is included in > CMR, or TDMRs after TDX module is initialized). > > Is this behavior acceptable? Or perhaps I have misunderstanding? Memory online at boot uses similar kernel paths as memory-online at run time, so it sounds like your question is confusing physical vs logical remove. Consider the case of logical offline then re-online where the proposed TDX sanity check blocks the memory online, but then a new kernel is kexec'd and that kernel again trusts the memory as TD convertible again just because it onlines the memory in the boot path. For physical memory remove it seems up to the platform to block that if it conflicts with TDX, not for the kernel to add extra assumptions that logical offline / online is incompatible with TDX.
On Wed, 2022-05-04 at 07:31 -0700, Dan Williams wrote: > On Tue, May 3, 2022 at 4:59 PM Kai Huang <kai.huang@intel.com> wrote: > > > > On Fri, 2022-04-29 at 13:40 +1200, Kai Huang wrote: > > > On Thu, 2022-04-28 at 12:58 +1200, Kai Huang wrote: > > > > On Wed, 2022-04-27 at 17:50 -0700, Dave Hansen wrote: > > > > > On 4/27/22 17:37, Kai Huang wrote: > > > > > > On Wed, 2022-04-27 at 14:59 -0700, Dave Hansen wrote: > > > > > > > In 5 years, if someone takes this code and runs it on Intel hardware > > > > > > > with memory hotplug, CPU hotplug, NVDIMMs *AND* TDX support, what happens? > > > > > > > > > > > > I thought we could document this in the documentation saying that this code can > > > > > > only work on TDX machines that don't have above capabilities (SPR for now). We > > > > > > can change the code and the documentation when we add the support of those > > > > > > features in the future, and update the documentation. > > > > > > > > > > > > If 5 years later someone takes this code, he/she should take a look at the > > > > > > documentation and figure out that he/she should choose a newer kernel if the > > > > > > machine support those features. > > > > > > > > > > > > I'll think about design solutions if above doesn't look good for you. > > > > > > > > > > No, it doesn't look good to me. > > > > > > > > > > You can't just say: > > > > > > > > > > /* > > > > > * This code will eat puppies if used on systems with hotplug. > > > > > */ > > > > > > > > > > and merrily await the puppy bloodbath. > > > > > > > > > > If it's not compatible, then you have to *MAKE* it not compatible in a > > > > > safe, controlled way. > > > > > > > > > > > > You can't just ignore the problems because they're not present on one > > > > > > > version of the hardware. > > > > > > > > > > Please, please read this again ^^ > > > > > > > > OK. I'll think about solutions and come back later. > > > > > > > > > > > Hi Dave, > > > > > > I think we have two approaches to handle memory hotplug interaction with the TDX > > > module initialization. > > > > > > The first approach is simple. We just block memory from being added as system > > > RAM managed by page allocator when the platform supports TDX [1]. It seems we > > > can add some arch-specific-check to __add_memory_resource() and reject the new > > > memory resource if platform supports TDX. __add_memory_resource() is called by > > > both __add_memory() and add_memory_driver_managed() so it prevents from adding > > > NVDIMM as system RAM and normal ACPI memory hotplug [2]. > > > > Hi Dave, > > > > Try to close how to handle memory hotplug. Any comments to below? > > > > For the first approach, I forgot to think about memory hot-remove case. If we > > just reject adding new memory resource when TDX is capable on the platform, then > > if the memory is hot-removed, we won't be able to add it back. My thinking is > > we still want to support memory online/offline because it is purely in software > > but has nothing to do with TDX. But if one memory resource can be put to > > offline, it seems we don't have any way to prevent it from being removed. > > > > So if we do above, on the future platforms when memory hotplug can co-exist with > > TDX, ACPI hot-add and kmem-hot-add memory will be prevented. However if some > > memory is hot-removed, it won't be able to be added back (even it is included in > > CMR, or TDMRs after TDX module is initialized). > > > > Is this behavior acceptable? Or perhaps I have misunderstanding? > > Memory online at boot uses similar kernel paths as memory-online at > run time, so it sounds like your question is confusing physical vs > logical remove. Consider the case of logical offline then re-online > where the proposed TDX sanity check blocks the memory online, but then > a new kernel is kexec'd and that kernel again trusts the memory as TD > convertible again just because it onlines the memory in the boot path. > For physical memory remove it seems up to the platform to block that > if it conflicts with TDX, not for the kernel to add extra assumptions > that logical offline / online is incompatible with TDX. Hi Dan, No we don't block memory online, but we block memory add. The code I mentioned is add_memory_resource(), while memory online code path is memory_block_online(). Or am I wrong?
On Wed, 2022-05-04 at 13:15 +1200, Kai Huang wrote: > On Tue, 2022-05-03 at 17:25 -0700, Dave Hansen wrote: > > On 5/3/22 16:59, Kai Huang wrote: > > > Should be: > > > > > > // prevent racing with TDX module initialization */ > > > tdx_init_disable(); > > > > > > if (tdx_module_initialized()) { > > > if (new_memory_resource in TDMRs) > > > // allow memory hot-add > > > else > > > // reject memory hot-add > > > } else if (new_memory_resource in CMR) { > > > // add new memory to TDX memory so it can be > > > // included into TDMRs > > > > > > // allow memory hot-add > > > } > > > else > > > // reject memory hot-add > > > > > > tdx_module_enable(); > > > > > > And when platform doesn't TDX, always allow memory hot-add. > > > > I don't think it even needs to be *that* complicated. > > > > It could just be winner take all: if TDX is initialized first, don't > > allow memory hotplug. If memory hotplug happens first, don't allow TDX > > to be initialized. > > > > That's fine at least for a minimal patch set. > > OK. This should also work. > > We will need tdx_init_disable() which grabs the mutex to prevent TDX module > initialization from running concurrently, and to disable TDX module > initialization once for all. > > > > > > What you have up above is probably where you want to go eventually, but > > it means doing things like augmenting the e820 since it's the single > > source of truth for creating the TMDRs right now. > > > > Yes. But in this case, I am thinking about probably we should switch from > consulting e820 to consulting memblock. The advantage of using e820 is it's > easy to include legacy PMEM as TDX memory, but the disadvantage is (as you can > see in e820_for_each_mem() loop) I am actually merging contiguous different > types of RAM entries in order to be consistent with the behavior of > e820_memblock_setup(). This is not nice. > > If memory hot-add and TDX can only be one winner, legacy PMEM actually won't be > used as TDX memory anyway now. The reason is TDX guest will very likely needing > to use the new fd-based backend (see my reply in other emails), but not just > some random backend. To me it's totally fine to not support using legacy PMEM > directly as TD guest backend (and if we create a TD with real NVDIMM as backend > using dax, the TD cannot be created anyway). Given we cannot kmem-hot-add > legacy PMEM back as system RAM, to me it's pointless to include legacy PMEM into > TDMRs. > > In this case, we can just create TDMRs based on memblock directly. One problem > is memblock will be gone after kernel boots, but this can be solved either by > keeping the memblock, or build the TDX memory early when memblock is still > alive. > > Btw, eventually, as it's likely we need to support different source of TDX > memory (CLX memory, etc), I think eventually we will need some data structures > to represent TDX memory block and APIs to add those blocks to the whole TDX > memory so those TDX memory ranges from different source can be added before > initializing the TDX module. > > struct tdx_memblock { > struct list_head list; > phys_addr_t start; > phys_addr_t end; > int nid; > ... > }; > > struct tdx_memory { > struct list_head tmb_list; > ... > }; > > int tdx_memory_add_memblock(start, end, nid, ...); > > And the TDMRs can be created based on 'struct tdx_memory'. > > For now, we only need to add memblock to TDX memory. > > Any comments? > Hi Dave, Sorry to ping (trying to close this). Given we don't need to consider kmem-hot-add legacy PMEM after TDX module initialization, I think for now it's totally fine to exclude legacy PMEMs from TDMRs. The worst case is when someone tries to use them as TD guest backend directly, the TD will fail to create. IMO it's acceptable, as it is supposedly that no one should just use some random backend to run TD. I think w/o needing to include legacy PMEM, it's better to get all TDX memory blocks based on memblock, but not e820. The pages managed by page allocator are from memblock anyway (w/o those from memory hotplug). And I also think it makes more sense to introduce 'tdx_memblock' and 'tdx_memory' data structures to gather all TDX memory blocks during boot when memblock is still alive. When TDX module is initialized during runtime, TDMRs can be created based on the 'struct tdx_memory' which contains all TDX memory blocks we gathered based on memblock during boot. This is also more flexible to support other TDX memory from other sources such as CLX memory in the future. Please let me know if you have any objection? Thanks!
[ add Mike ] On Thu, May 5, 2022 at 2:54 AM Kai Huang <kai.huang@intel.com> wrote: [..] > > Hi Dave, > > Sorry to ping (trying to close this). > > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module > initialization, I think for now it's totally fine to exclude legacy PMEMs from > TDMRs. The worst case is when someone tries to use them as TD guest backend > directly, the TD will fail to create. IMO it's acceptable, as it is supposedly > that no one should just use some random backend to run TD. The platform will already do this, right? I don't understand why this is trying to take proactive action versus documenting the error conditions and steps someone needs to take to avoid unconvertible memory. There is already the CONFIG_HMEM_REPORTING that describes relative performance properties between initiators and targets, it seems fitting to also add security properties between initiators and targets so someone can enumerate the numa-mempolicy that avoids unconvertible memory. No, special casing in hotplug code paths needed. > > I think w/o needing to include legacy PMEM, it's better to get all TDX memory > blocks based on memblock, but not e820. The pages managed by page allocator are > from memblock anyway (w/o those from memory hotplug). > > And I also think it makes more sense to introduce 'tdx_memblock' and > 'tdx_memory' data structures to gather all TDX memory blocks during boot when > memblock is still alive. When TDX module is initialized during runtime, TDMRs > can be created based on the 'struct tdx_memory' which contains all TDX memory > blocks we gathered based on memblock during boot. This is also more flexible to > support other TDX memory from other sources such as CLX memory in the future. > > Please let me know if you have any objection? Thanks! It's already the case that x86 maintains sideband structures to preserve memory after exiting the early memblock code. Mike, correct me if I am wrong, but adding more is less desirable than just keeping the memblock around?
Thanks for feedback! On Thu, 2022-05-05 at 06:51 -0700, Dan Williams wrote: > [ add Mike ] > > > On Thu, May 5, 2022 at 2:54 AM Kai Huang <kai.huang@intel.com> wrote: > [..] > > > > Hi Dave, > > > > Sorry to ping (trying to close this). > > > > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module > > initialization, I think for now it's totally fine to exclude legacy PMEMs from > > TDMRs. The worst case is when someone tries to use them as TD guest backend > > directly, the TD will fail to create. IMO it's acceptable, as it is supposedly > > that no one should just use some random backend to run TD. > > The platform will already do this, right? > In the current v3 implementation, we don't have any code to handle memory hotplug, therefore nothing prevents people from adding legacy PMEMs as system RAM using kmem driver. In order to guarantee all pages managed by page allocator are all TDX memory, the v3 implementation needs to always include legacy PMEMs as TDX memory so that even people truly add legacy PMEMs as system RAM, we can still guarantee all pages in page allocator are TDX memory. Of course, a side benefit of always including legacy PMEMs is people theoretically can use them directly as TD guest backend, but this is just a bonus but not something that we need to guarantee. > I don't understand why this > is trying to take proactive action versus documenting the error > conditions and steps someone needs to take to avoid unconvertible > memory. There is already the CONFIG_HMEM_REPORTING that describes > relative performance properties between initiators and targets, it > seems fitting to also add security properties between initiators and > targets so someone can enumerate the numa-mempolicy that avoids > unconvertible memory. I don't think there's anything related to performance properties here. The only goal here is to make sure all pages in page allocator are TDX memory pages. > > No, special casing in hotplug code paths needed. > > > > > I think w/o needing to include legacy PMEM, it's better to get all TDX memory > > blocks based on memblock, but not e820. The pages managed by page allocator are > > from memblock anyway (w/o those from memory hotplug). > > > > And I also think it makes more sense to introduce 'tdx_memblock' and > > 'tdx_memory' data structures to gather all TDX memory blocks during boot when > > memblock is still alive. When TDX module is initialized during runtime, TDMRs > > can be created based on the 'struct tdx_memory' which contains all TDX memory > > blocks we gathered based on memblock during boot. This is also more flexible to > > support other TDX memory from other sources such as CLX memory in the future. > > > > Please let me know if you have any objection? Thanks! > > It's already the case that x86 maintains sideband structures to > preserve memory after exiting the early memblock code. > May I ask what data structures are you referring to? Btw, the purpose of 'tdx_memblock' and 'tdx_memory' is not only just to preserve memblock info during boot. It is also used to provide a common data structure that the "constructing TDMRs" code can work on. If you look at patch 11-14, the logic (create TDMRs, allocate PAMTs, sets up reserved areas) around how to construct TDMRs doesn't have hard dependency on e820. If we construct TDMRs based on a common 'tdx_memory' like below: int construct_tdmrs(struct tdx_memory *tmem, ...); It would be much easier to support other TDX memory resources in the future. The thing I am not sure is Dave wants to keep the code minimal (as this series is already very big in terms of LoC) to make TDX running, and for now in practice there's only system RAM during boot is TDX capable, so I am not sure we should introduce those structures now. > Mike, correct > me if I am wrong, but adding more is less desirable than just keeping > the memblock around?
On Thu, May 5, 2022 at 3:14 PM Kai Huang <kai.huang@intel.com> wrote: > > Thanks for feedback! > > On Thu, 2022-05-05 at 06:51 -0700, Dan Williams wrote: > > [ add Mike ] > > > > > > On Thu, May 5, 2022 at 2:54 AM Kai Huang <kai.huang@intel.com> wrote: > > [..] > > > > > > Hi Dave, > > > > > > Sorry to ping (trying to close this). > > > > > > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module > > > initialization, I think for now it's totally fine to exclude legacy PMEMs from > > > TDMRs. The worst case is when someone tries to use them as TD guest backend > > > directly, the TD will fail to create. IMO it's acceptable, as it is supposedly > > > that no one should just use some random backend to run TD. > > > > The platform will already do this, right? > > > > In the current v3 implementation, we don't have any code to handle memory > hotplug, therefore nothing prevents people from adding legacy PMEMs as system > RAM using kmem driver. In order to guarantee all pages managed by page That's the fundamental question I am asking why is "guarantee all pages managed by page allocator are TDX memory". That seems overkill compared to indicating the incompatibility after the fact. > allocator are all TDX memory, the v3 implementation needs to always include > legacy PMEMs as TDX memory so that even people truly add legacy PMEMs as system > RAM, we can still guarantee all pages in page allocator are TDX memory. Why? > Of course, a side benefit of always including legacy PMEMs is people > theoretically can use them directly as TD guest backend, but this is just a > bonus but not something that we need to guarantee. > > > > I don't understand why this > > is trying to take proactive action versus documenting the error > > conditions and steps someone needs to take to avoid unconvertible > > memory. There is already the CONFIG_HMEM_REPORTING that describes > > relative performance properties between initiators and targets, it > > seems fitting to also add security properties between initiators and > > targets so someone can enumerate the numa-mempolicy that avoids > > unconvertible memory. > > I don't think there's anything related to performance properties here. The only > goal here is to make sure all pages in page allocator are TDX memory pages. Please reconsider or re-clarify that goal. > > > > > No, special casing in hotplug code paths needed. > > > > > > > > I think w/o needing to include legacy PMEM, it's better to get all TDX memory > > > blocks based on memblock, but not e820. The pages managed by page allocator are > > > from memblock anyway (w/o those from memory hotplug). > > > > > > And I also think it makes more sense to introduce 'tdx_memblock' and > > > 'tdx_memory' data structures to gather all TDX memory blocks during boot when > > > memblock is still alive. When TDX module is initialized during runtime, TDMRs > > > can be created based on the 'struct tdx_memory' which contains all TDX memory > > > blocks we gathered based on memblock during boot. This is also more flexible to > > > support other TDX memory from other sources such as CLX memory in the future. > > > > > > Please let me know if you have any objection? Thanks! > > > > It's already the case that x86 maintains sideband structures to > > preserve memory after exiting the early memblock code. > > > > May I ask what data structures are you referring to? struct numa_meminfo. > Btw, the purpose of 'tdx_memblock' and 'tdx_memory' is not only just to preserve > memblock info during boot. It is also used to provide a common data structure > that the "constructing TDMRs" code can work on. If you look at patch 11-14, the > logic (create TDMRs, allocate PAMTs, sets up reserved areas) around how to > construct TDMRs doesn't have hard dependency on e820. If we construct TDMRs > based on a common 'tdx_memory' like below: > > int construct_tdmrs(struct tdx_memory *tmem, ...); > > It would be much easier to support other TDX memory resources in the future. "in the future" is a prompt to ask "Why not wait until that future / need arrives before adding new infrastructure?" > The thing I am not sure is Dave wants to keep the code minimal (as this series > is already very big in terms of LoC) to make TDX running, and for now in > practice there's only system RAM during boot is TDX capable, so I am not sure we > should introduce those structures now. > > > Mike, correct > > me if I am wrong, but adding more is less desirable than just keeping > > the memblock around?
On Thu, 2022-05-05 at 17:22 -0700, Dan Williams wrote: > On Thu, May 5, 2022 at 3:14 PM Kai Huang <kai.huang@intel.com> wrote: > > > > Thanks for feedback! > > > > On Thu, 2022-05-05 at 06:51 -0700, Dan Williams wrote: > > > [ add Mike ] > > > > > > > > > On Thu, May 5, 2022 at 2:54 AM Kai Huang <kai.huang@intel.com> wrote: > > > [..] > > > > > > > > Hi Dave, > > > > > > > > Sorry to ping (trying to close this). > > > > > > > > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module > > > > initialization, I think for now it's totally fine to exclude legacy PMEMs from > > > > TDMRs. The worst case is when someone tries to use them as TD guest backend > > > > directly, the TD will fail to create. IMO it's acceptable, as it is supposedly > > > > that no one should just use some random backend to run TD. > > > > > > The platform will already do this, right? > > > > > > > In the current v3 implementation, we don't have any code to handle memory > > hotplug, therefore nothing prevents people from adding legacy PMEMs as system > > RAM using kmem driver. In order to guarantee all pages managed by page > > That's the fundamental question I am asking why is "guarantee all > pages managed by page allocator are TDX memory". That seems overkill > compared to indicating the incompatibility after the fact. As I explained, the reason is I don't want to modify page allocator to distinguish TDX and non-TDX allocation, for instance, having to have a ZONE_TDX and GFP_TDX. KVM depends on host's page fault handler to allocate the page. In fact KVM only consumes PFN from host's page tables. For now only RAM is TDX memory. By guaranteeing all pages in page allocator is TDX memory, we can easily use anonymous pages as TD guest memory. This also allows us to easily extend the shmem to support a new fd-based backend which doesn't require having to mmap() TD guest memory to host userspace: https://lore.kernel.org/kvm/20220310140911.50924-1-chao.p.peng@linux.intel.com/ Also, besides TD guest memory, there are some per-TD control data structures (which must be TDX memory too) need to be allocated for each TD. Normal memory allocation APIs can be used for such allocation if we guarantee all pages in page allocator is TDX memory. > > > allocator are all TDX memory, the v3 implementation needs to always include > > legacy PMEMs as TDX memory so that even people truly add legacy PMEMs as system > > RAM, we can still guarantee all pages in page allocator are TDX memory. > > Why? If we don't include legacy PMEMs as TDX memory, then after they are hot-added as system RAM using kmem driver, the assumption of "all pages in page allocator are TDX memory" is broken. A TD can be killed during runtime. > > > Of course, a side benefit of always including legacy PMEMs is people > > theoretically can use them directly as TD guest backend, but this is just a > > bonus but not something that we need to guarantee. > > > > > > > I don't understand why this > > > is trying to take proactive action versus documenting the error > > > conditions and steps someone needs to take to avoid unconvertible > > > memory. There is already the CONFIG_HMEM_REPORTING that describes > > > relative performance properties between initiators and targets, it > > > seems fitting to also add security properties between initiators and > > > targets so someone can enumerate the numa-mempolicy that avoids > > > unconvertible memory. > > > > I don't think there's anything related to performance properties here. The only > > goal here is to make sure all pages in page allocator are TDX memory pages. > > Please reconsider or re-clarify that goal. > > > > > > > > > No, special casing in hotplug code paths needed. > > > > > > > > > > > I think w/o needing to include legacy PMEM, it's better to get all TDX memory > > > > blocks based on memblock, but not e820. The pages managed by page allocator are > > > > from memblock anyway (w/o those from memory hotplug). > > > > > > > > And I also think it makes more sense to introduce 'tdx_memblock' and > > > > 'tdx_memory' data structures to gather all TDX memory blocks during boot when > > > > memblock is still alive. When TDX module is initialized during runtime, TDMRs > > > > can be created based on the 'struct tdx_memory' which contains all TDX memory > > > > blocks we gathered based on memblock during boot. This is also more flexible to > > > > support other TDX memory from other sources such as CLX memory in the future. > > > > > > > > Please let me know if you have any objection? Thanks! > > > > > > It's already the case that x86 maintains sideband structures to > > > preserve memory after exiting the early memblock code. > > > > > > > May I ask what data structures are you referring to? > > struct numa_meminfo. > > > Btw, the purpose of 'tdx_memblock' and 'tdx_memory' is not only just to preserve > > memblock info during boot. It is also used to provide a common data structure > > that the "constructing TDMRs" code can work on. If you look at patch 11-14, the > > logic (create TDMRs, allocate PAMTs, sets up reserved areas) around how to > > construct TDMRs doesn't have hard dependency on e820. If we construct TDMRs > > based on a common 'tdx_memory' like below: > > > > int construct_tdmrs(struct tdx_memory *tmem, ...); > > > > It would be much easier to support other TDX memory resources in the future. > > "in the future" is a prompt to ask "Why not wait until that future / > need arrives before adding new infrastructure?" Fine to me.
On Thu, May 5, 2022 at 5:46 PM Kai Huang <kai.huang@intel.com> wrote: > > On Thu, 2022-05-05 at 17:22 -0700, Dan Williams wrote: > > On Thu, May 5, 2022 at 3:14 PM Kai Huang <kai.huang@intel.com> wrote: > > > > > > Thanks for feedback! > > > > > > On Thu, 2022-05-05 at 06:51 -0700, Dan Williams wrote: > > > > [ add Mike ] > > > > > > > > > > > > On Thu, May 5, 2022 at 2:54 AM Kai Huang <kai.huang@intel.com> wrote: > > > > [..] > > > > > > > > > > Hi Dave, > > > > > > > > > > Sorry to ping (trying to close this). > > > > > > > > > > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module > > > > > initialization, I think for now it's totally fine to exclude legacy PMEMs from > > > > > TDMRs. The worst case is when someone tries to use them as TD guest backend > > > > > directly, the TD will fail to create. IMO it's acceptable, as it is supposedly > > > > > that no one should just use some random backend to run TD. > > > > > > > > The platform will already do this, right? > > > > > > > > > > In the current v3 implementation, we don't have any code to handle memory > > > hotplug, therefore nothing prevents people from adding legacy PMEMs as system > > > RAM using kmem driver. In order to guarantee all pages managed by page > > > > That's the fundamental question I am asking why is "guarantee all > > pages managed by page allocator are TDX memory". That seems overkill > > compared to indicating the incompatibility after the fact. > > As I explained, the reason is I don't want to modify page allocator to > distinguish TDX and non-TDX allocation, for instance, having to have a ZONE_TDX > and GFP_TDX. Right, TDX details do not belong at that level, but it will work almost all the time if you do nothing to "guarantee" all TDX capable pages all the time. > KVM depends on host's page fault handler to allocate the page. In fact KVM only > consumes PFN from host's page tables. For now only RAM is TDX memory. By > guaranteeing all pages in page allocator is TDX memory, we can easily use > anonymous pages as TD guest memory. Again, TDX capable pages will be the overwhelming default, why are you worried about cluttering the memory hotplug path for nice corner cases. Consider the fact that end users can break the kernel by specifying invalid memmap= command line options. The memory hotplug code does not take any steps to add safety in those cases because there are already too many ways it can go wrong. TDX is just one more corner case where the memmap= user needs to be careful. Otherwise, it is up to the platform firmware to make sure everything in the base memory map is TDX capable, and then all you need is documentation about the failure mode when extending "System RAM" beyond that baseline. > shmem to support a new fd-based backend which doesn't require having to mmap() > TD guest memory to host userspace: > > https://lore.kernel.org/kvm/20220310140911.50924-1-chao.p.peng@linux.intel.com/ > > Also, besides TD guest memory, there are some per-TD control data structures > (which must be TDX memory too) need to be allocated for each TD. Normal memory > allocation APIs can be used for such allocation if we guarantee all pages in > page allocator is TDX memory. You don't need that guarantee, just check it after the fact and fail if that assertion fails. It should almost always be the case that it succeeds and if it doesn't then something special is happening with that system and the end user has effectively opt-ed out of TDX operation. > > > allocator are all TDX memory, the v3 implementation needs to always include > > > legacy PMEMs as TDX memory so that even people truly add legacy PMEMs as system > > > RAM, we can still guarantee all pages in page allocator are TDX memory. > > > > Why? > > If we don't include legacy PMEMs as TDX memory, then after they are hot-added as > system RAM using kmem driver, the assumption of "all pages in page allocator are > TDX memory" is broken. A TD can be killed during runtime. Yes, that is what the end user asked for. If they don't want that to happen then the policy decision about using kmem needs to be updated in userspace, not hard code that policy decision towards TDX inside the kernel.
On Thu, 2022-05-05 at 18:15 -0700, Dan Williams wrote: > On Thu, May 5, 2022 at 5:46 PM Kai Huang <kai.huang@intel.com> wrote: > > > > On Thu, 2022-05-05 at 17:22 -0700, Dan Williams wrote: > > > On Thu, May 5, 2022 at 3:14 PM Kai Huang <kai.huang@intel.com> wrote: > > > > > > > > Thanks for feedback! > > > > > > > > On Thu, 2022-05-05 at 06:51 -0700, Dan Williams wrote: > > > > > [ add Mike ] > > > > > > > > > > > > > > > On Thu, May 5, 2022 at 2:54 AM Kai Huang <kai.huang@intel.com> wrote: > > > > > [..] > > > > > > > > > > > > Hi Dave, > > > > > > > > > > > > Sorry to ping (trying to close this). > > > > > > > > > > > > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module > > > > > > initialization, I think for now it's totally fine to exclude legacy PMEMs from > > > > > > TDMRs. The worst case is when someone tries to use them as TD guest backend > > > > > > directly, the TD will fail to create. IMO it's acceptable, as it is supposedly > > > > > > that no one should just use some random backend to run TD. > > > > > > > > > > The platform will already do this, right? > > > > > > > > > > > > > In the current v3 implementation, we don't have any code to handle memory > > > > hotplug, therefore nothing prevents people from adding legacy PMEMs as system > > > > RAM using kmem driver. In order to guarantee all pages managed by page > > > > > > That's the fundamental question I am asking why is "guarantee all > > > pages managed by page allocator are TDX memory". That seems overkill > > > compared to indicating the incompatibility after the fact. > > > > As I explained, the reason is I don't want to modify page allocator to > > distinguish TDX and non-TDX allocation, for instance, having to have a ZONE_TDX > > and GFP_TDX. > > Right, TDX details do not belong at that level, but it will work > almost all the time if you do nothing to "guarantee" all TDX capable > pages all the time. "almost all the time" do you mean? > > > KVM depends on host's page fault handler to allocate the page. In fact KVM only > > consumes PFN from host's page tables. For now only RAM is TDX memory. By > > guaranteeing all pages in page allocator is TDX memory, we can easily use > > anonymous pages as TD guest memory. > > Again, TDX capable pages will be the overwhelming default, why are you > worried about cluttering the memory hotplug path for nice corner > cases. Firstly perhaps I forgot to mention there are two concepts about TDX memory, so let me clarify first: 1) Convertible Memory Regions (CMRs). This is reported by BIOS (thus static) to indicate which memory regions *can* be used as TDX memory. This basically means all RAM during boot for now. 2) TD Memory Regions (TDMRs). Memory pages in CMRs are not automatically TDX usable memory. The TDX module needs to be configured which (convertible) memory regions can be used as TDX memory. Kernel is responsible for choosing the ranges, and configure to the TDX module. If a convertible memory page is not included into TDMRs, the TDX module will report error when it is assigned to a TD. > > Consider the fact that end users can break the kernel by specifying > invalid memmap= command line options. The memory hotplug code does not > take any steps to add safety in those cases because there are already > too many ways it can go wrong. TDX is just one more corner case where > the memmap= user needs to be careful. Otherwise, it is up to the > platform firmware to make sure everything in the base memory map is > TDX capable, and then all you need is documentation about the failure > mode when extending "System RAM" beyond that baseline. So the fact is, if we don't include legacy PMEMs into TDMRs, and don't do anything in memory hotplug, then if user does kmem-hot-add legacy PMEMs as system RAM, a live TD may eventually be killed. If such case is a corner case that we don't need to guarantee, then even better. And we have an additional reason that those legacy PMEMs don't need to be in TDMRs. As you suggested, we can add some documentation to point out. But the point we want to do some code check and prevent memory hotplug is, as Dave said, we want this piece of code to work on *ANY* TDX capable machines, including future machines which may, i.e. supports NVDIMM/CLX memory as TDX memory. If we don't do any code check in memory hotplug in this series, then when this code runs in future platforms, user can plug NVDIMM or CLX memory as system RAM thus break the assumption "all pages in page allocator are TDX memory", which eventually leads to live TDs being killed potentially. Dave said we need to guarantee this code can work on *ANY* TDX machines. Some documentation saying it only works one some platforms and you shouldn't do things on other platforms are not good enough: https://lore.kernel.org/lkml/cover.1649219184.git.kai.huang@intel.com/T/#m6df45b6e1702bb03dcb027044a0dabf30a86e471 > > > > shmem to support a new fd-based backend which doesn't require having to mmap() > > TD guest memory to host userspace: > > > > https://lore.kernel.org/kvm/20220310140911.50924-1-chao.p.peng@linux.intel.com/ > > > > Also, besides TD guest memory, there are some per-TD control data structures > > (which must be TDX memory too) need to be allocated for each TD. Normal memory > > allocation APIs can be used for such allocation if we guarantee all pages in > > page allocator is TDX memory. > > You don't need that guarantee, just check it after the fact and fail > if that assertion fails. It should almost always be the case that it > succeeds and if it doesn't then something special is happening with > that system and the end user has effectively opt-ed out of TDX > operation. This doesn't guarantee consistent behaviour. For instance, for one TD it can be created, while the second may fail. We should provide a consistent service. The thing is anyway we need to configure some memory regions to the TDX module. To me there's no reason we don't want to guarantee all pages in page allocator are TDX memory. > > > > > allocator are all TDX memory, the v3 implementation needs to always include > > > > legacy PMEMs as TDX memory so that even people truly add legacy PMEMs as system > > > > RAM, we can still guarantee all pages in page allocator are TDX memory. > > > > > > Why? > > > > If we don't include legacy PMEMs as TDX memory, then after they are hot-added as > > system RAM using kmem driver, the assumption of "all pages in page allocator are > > TDX memory" is broken. A TD can be killed during runtime. > > Yes, that is what the end user asked for. If they don't want that to > happen then the policy decision about using kmem needs to be updated > in userspace, not hard code that policy decision towards TDX inside > the kernel. This is also fine to me. But please also see above Dave's comment. Thanks for those valuable feedback!
On Thu, May 5, 2022 at 6:47 PM Kai Huang <kai.huang@intel.com> wrote: > > On Thu, 2022-05-05 at 18:15 -0700, Dan Williams wrote: > > On Thu, May 5, 2022 at 5:46 PM Kai Huang <kai.huang@intel.com> wrote: > > > > > > On Thu, 2022-05-05 at 17:22 -0700, Dan Williams wrote: > > > > On Thu, May 5, 2022 at 3:14 PM Kai Huang <kai.huang@intel.com> wrote: > > > > > > > > > > Thanks for feedback! > > > > > > > > > > On Thu, 2022-05-05 at 06:51 -0700, Dan Williams wrote: > > > > > > [ add Mike ] > > > > > > > > > > > > > > > > > > On Thu, May 5, 2022 at 2:54 AM Kai Huang <kai.huang@intel.com> wrote: > > > > > > [..] > > > > > > > > > > > > > > Hi Dave, > > > > > > > > > > > > > > Sorry to ping (trying to close this). > > > > > > > > > > > > > > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module > > > > > > > initialization, I think for now it's totally fine to exclude legacy PMEMs from > > > > > > > TDMRs. The worst case is when someone tries to use them as TD guest backend > > > > > > > directly, the TD will fail to create. IMO it's acceptable, as it is supposedly > > > > > > > that no one should just use some random backend to run TD. > > > > > > > > > > > > The platform will already do this, right? > > > > > > > > > > > > > > > > In the current v3 implementation, we don't have any code to handle memory > > > > > hotplug, therefore nothing prevents people from adding legacy PMEMs as system > > > > > RAM using kmem driver. In order to guarantee all pages managed by page > > > > > > > > That's the fundamental question I am asking why is "guarantee all > > > > pages managed by page allocator are TDX memory". That seems overkill > > > > compared to indicating the incompatibility after the fact. > > > > > > As I explained, the reason is I don't want to modify page allocator to > > > distinguish TDX and non-TDX allocation, for instance, having to have a ZONE_TDX > > > and GFP_TDX. > > > > Right, TDX details do not belong at that level, but it will work > > almost all the time if you do nothing to "guarantee" all TDX capable > > pages all the time. > > "almost all the time" do you mean? > > > > > > KVM depends on host's page fault handler to allocate the page. In fact KVM only > > > consumes PFN from host's page tables. For now only RAM is TDX memory. By > > > guaranteeing all pages in page allocator is TDX memory, we can easily use > > > anonymous pages as TD guest memory. > > > > Again, TDX capable pages will be the overwhelming default, why are you > > worried about cluttering the memory hotplug path for nice corner > > cases. > > Firstly perhaps I forgot to mention there are two concepts about TDX memory, so > let me clarify first: > > 1) Convertible Memory Regions (CMRs). This is reported by BIOS (thus static) to > indicate which memory regions *can* be used as TDX memory. This basically means > all RAM during boot for now. > > 2) TD Memory Regions (TDMRs). Memory pages in CMRs are not automatically TDX > usable memory. The TDX module needs to be configured which (convertible) memory > regions can be used as TDX memory. Kernel is responsible for choosing the > ranges, and configure to the TDX module. If a convertible memory page is not > included into TDMRs, the TDX module will report error when it is assigned to a > TD. > > > > > Consider the fact that end users can break the kernel by specifying > > invalid memmap= command line options. The memory hotplug code does not > > take any steps to add safety in those cases because there are already > > too many ways it can go wrong. TDX is just one more corner case where > > the memmap= user needs to be careful. Otherwise, it is up to the > > platform firmware to make sure everything in the base memory map is > > TDX capable, and then all you need is documentation about the failure > > mode when extending "System RAM" beyond that baseline. > > So the fact is, if we don't include legacy PMEMs into TDMRs, and don't do > anything in memory hotplug, then if user does kmem-hot-add legacy PMEMs as > system RAM, a live TD may eventually be killed. > > If such case is a corner case that we don't need to guarantee, then even better. > And we have an additional reason that those legacy PMEMs don't need to be in > TDMRs. As you suggested, we can add some documentation to point out. > > But the point we want to do some code check and prevent memory hotplug is, as > Dave said, we want this piece of code to work on *ANY* TDX capable machines, > including future machines which may, i.e. supports NVDIMM/CLX memory as TDX > memory. If we don't do any code check in memory hotplug in this series, then > when this code runs in future platforms, user can plug NVDIMM or CLX memory as > system RAM thus break the assumption "all pages in page allocator are TDX > memory", which eventually leads to live TDs being killed potentially. > > Dave said we need to guarantee this code can work on *ANY* TDX machines. Some > documentation saying it only works one some platforms and you shouldn't do > things on other platforms are not good enough: > > https://lore.kernel.org/lkml/cover.1649219184.git.kai.huang@intel.com/T/#m6df45b6e1702bb03dcb027044a0dabf30a86e471 Yes, the incompatible cases cannot be ignored, but I disagree that they actively need to be prevented. One way to achieve that is to explicitly enumerate TDX capable memory and document how mempolicy can be used to avoid killing TDs. > > > shmem to support a new fd-based backend which doesn't require having to mmap() > > > TD guest memory to host userspace: > > > > > > https://lore.kernel.org/kvm/20220310140911.50924-1-chao.p.peng@linux.intel.com/ > > > > > > Also, besides TD guest memory, there are some per-TD control data structures > > > (which must be TDX memory too) need to be allocated for each TD. Normal memory > > > allocation APIs can be used for such allocation if we guarantee all pages in > > > page allocator is TDX memory. > > > > You don't need that guarantee, just check it after the fact and fail > > if that assertion fails. It should almost always be the case that it > > succeeds and if it doesn't then something special is happening with > > that system and the end user has effectively opt-ed out of TDX > > operation. > > This doesn't guarantee consistent behaviour. For instance, for one TD it can be > created, while the second may fail. We should provide a consistent service. Yes, there needs to be enumeration and policy knobs to avoid failures, hard coded "no memory hotplug" hacks do not seem the right enumeration and policy knobs to me. > The thing is anyway we need to configure some memory regions to the TDX module. > To me there's no reason we don't want to guarantee all pages in page allocator > are TDX memory. > > > > > > > > allocator are all TDX memory, the v3 implementation needs to always include > > > > > legacy PMEMs as TDX memory so that even people truly add legacy PMEMs as system > > > > > RAM, we can still guarantee all pages in page allocator are TDX memory. > > > > > > > > Why? > > > > > > If we don't include legacy PMEMs as TDX memory, then after they are hot-added as > > > system RAM using kmem driver, the assumption of "all pages in page allocator are > > > TDX memory" is broken. A TD can be killed during runtime. > > > > Yes, that is what the end user asked for. If they don't want that to > > happen then the policy decision about using kmem needs to be updated > > in userspace, not hard code that policy decision towards TDX inside > > the kernel. > > This is also fine to me. But please also see above Dave's comment. Dave is right, the implementation can not just ignore the conflict. To me, enumeration plus error reporting allows for flexibility without hard coding policy in the kernel.
On Thu, May 05, 2022 at 06:51:20AM -0700, Dan Williams wrote: > [ add Mike ] > > On Thu, May 5, 2022 at 2:54 AM Kai Huang <kai.huang@intel.com> wrote: > [..] > > > > Hi Dave, > > > > Sorry to ping (trying to close this). > > > > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module > > initialization, I think for now it's totally fine to exclude legacy PMEMs from > > TDMRs. The worst case is when someone tries to use them as TD guest backend > > directly, the TD will fail to create. IMO it's acceptable, as it is supposedly > > that no one should just use some random backend to run TD. > > The platform will already do this, right? I don't understand why this > is trying to take proactive action versus documenting the error > conditions and steps someone needs to take to avoid unconvertible > memory. There is already the CONFIG_HMEM_REPORTING that describes > relative performance properties between initiators and targets, it > seems fitting to also add security properties between initiators and > targets so someone can enumerate the numa-mempolicy that avoids > unconvertible memory. > > No, special casing in hotplug code paths needed. > > > > > I think w/o needing to include legacy PMEM, it's better to get all TDX memory > > blocks based on memblock, but not e820. The pages managed by page allocator are > > from memblock anyway (w/o those from memory hotplug). > > > > And I also think it makes more sense to introduce 'tdx_memblock' and > > 'tdx_memory' data structures to gather all TDX memory blocks during boot when > > memblock is still alive. When TDX module is initialized during runtime, TDMRs > > can be created based on the 'struct tdx_memory' which contains all TDX memory > > blocks we gathered based on memblock during boot. This is also more flexible to > > support other TDX memory from other sources such as CLX memory in the future. > > > > Please let me know if you have any objection? Thanks! > > It's already the case that x86 maintains sideband structures to > preserve memory after exiting the early memblock code. Mike, correct > me if I am wrong, but adding more is less desirable than just keeping > the memblock around? TBH, I didn't read the entire thread yet, but at the first glance, keeping memblock around is much more preferable that adding yet another { .start, .end, .flags } data structure. To keep memblock after boot all is needed is something like select ARCH_KEEP_MEMBLOCK if INTEL_TDX_HOST I'll take a closer look next week on the entire series, maybe I'm missing some details.
On Fri, 2022-05-06 at 20:09 -0400, Mike Rapoport wrote: > On Thu, May 05, 2022 at 06:51:20AM -0700, Dan Williams wrote: > > [ add Mike ] > > > > On Thu, May 5, 2022 at 2:54 AM Kai Huang <kai.huang@intel.com> wrote: > > [..] > > > > > > Hi Dave, > > > > > > Sorry to ping (trying to close this). > > > > > > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module > > > initialization, I think for now it's totally fine to exclude legacy PMEMs from > > > TDMRs. The worst case is when someone tries to use them as TD guest backend > > > directly, the TD will fail to create. IMO it's acceptable, as it is supposedly > > > that no one should just use some random backend to run TD. > > > > The platform will already do this, right? I don't understand why this > > is trying to take proactive action versus documenting the error > > conditions and steps someone needs to take to avoid unconvertible > > memory. There is already the CONFIG_HMEM_REPORTING that describes > > relative performance properties between initiators and targets, it > > seems fitting to also add security properties between initiators and > > targets so someone can enumerate the numa-mempolicy that avoids > > unconvertible memory. > > > > No, special casing in hotplug code paths needed. > > > > > > > > I think w/o needing to include legacy PMEM, it's better to get all TDX memory > > > blocks based on memblock, but not e820. The pages managed by page allocator are > > > from memblock anyway (w/o those from memory hotplug). > > > > > > And I also think it makes more sense to introduce 'tdx_memblock' and > > > 'tdx_memory' data structures to gather all TDX memory blocks during boot when > > > memblock is still alive. When TDX module is initialized during runtime, TDMRs > > > can be created based on the 'struct tdx_memory' which contains all TDX memory > > > blocks we gathered based on memblock during boot. This is also more flexible to > > > support other TDX memory from other sources such as CLX memory in the future. > > > > > > Please let me know if you have any objection? Thanks! > > > > It's already the case that x86 maintains sideband structures to > > preserve memory after exiting the early memblock code. Mike, correct > > me if I am wrong, but adding more is less desirable than just keeping > > the memblock around? > > TBH, I didn't read the entire thread yet, but at the first glance, keeping > memblock around is much more preferable that adding yet another { .start, > .end, .flags } data structure. To keep memblock after boot all is needed is > something like > > select ARCH_KEEP_MEMBLOCK if INTEL_TDX_HOST > > I'll take a closer look next week on the entire series, maybe I'm missing > some details. > Hi Mike, Thanks for feedback. Perhaps I haven't put a lot details of the new TDX data structures, so let me point out that the new two data structures 'struct tdx_memblock' and 'struct tdx_memory' that I am proposing are mostly supposed to be used by TDX code only, which is pretty standalone. They are not supposed to be some basic infrastructure that can be widely used by other random kernel components. In fact, currently the only operation we need is to allow memblock to register all memory regions as TDX memory blocks when the memblock is still alive. Therefore, in fact, the new data structures can even be completely invisible to other kernel components. For instance, TDX code can provide below API w/o exposing any data structures to other kernel components: int tdx_add_memory_block(phys_addr_t start, phys_addr_t end, int nid); And we call above API for each memory region in memblock when it is alive. TDX code internally manages those memory regions via the new data structures that I mentioned above, so we don't need to keep memblock after boot. The advantage of this approach is it is more flexible to support other potential TDX memory resources (such as CLX memory) in the future. Otherwise, we can do as you suggested to select ARCH_KEEP_MEMBLOCK when INTEL_TDX_HOST is on and TDX code internally uses memblock API directly.
On Fri, 2022-05-06 at 08:57 -0700, Dan Williams wrote: > On Thu, May 5, 2022 at 6:47 PM Kai Huang <kai.huang@intel.com> wrote: > > > > On Thu, 2022-05-05 at 18:15 -0700, Dan Williams wrote: > > > On Thu, May 5, 2022 at 5:46 PM Kai Huang <kai.huang@intel.com> wrote: > > > > > > > > On Thu, 2022-05-05 at 17:22 -0700, Dan Williams wrote: > > > > > On Thu, May 5, 2022 at 3:14 PM Kai Huang <kai.huang@intel.com> wrote: > > > > > > > > > > > > Thanks for feedback! > > > > > > > > > > > > On Thu, 2022-05-05 at 06:51 -0700, Dan Williams wrote: > > > > > > > [ add Mike ] > > > > > > > > > > > > > > > > > > > > > On Thu, May 5, 2022 at 2:54 AM Kai Huang <kai.huang@intel.com> wrote: > > > > > > > [..] > > > > > > > > > > > > > > > > Hi Dave, > > > > > > > > > > > > > > > > Sorry to ping (trying to close this). > > > > > > > > > > > > > > > > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module > > > > > > > > initialization, I think for now it's totally fine to exclude legacy PMEMs from > > > > > > > > TDMRs. The worst case is when someone tries to use them as TD guest backend > > > > > > > > directly, the TD will fail to create. IMO it's acceptable, as it is supposedly > > > > > > > > that no one should just use some random backend to run TD. > > > > > > > > > > > > > > The platform will already do this, right? > > > > > > > > > > > > > > > > > > > In the current v3 implementation, we don't have any code to handle memory > > > > > > hotplug, therefore nothing prevents people from adding legacy PMEMs as system > > > > > > RAM using kmem driver. In order to guarantee all pages managed by page > > > > > > > > > > That's the fundamental question I am asking why is "guarantee all > > > > > pages managed by page allocator are TDX memory". That seems overkill > > > > > compared to indicating the incompatibility after the fact. > > > > > > > > As I explained, the reason is I don't want to modify page allocator to > > > > distinguish TDX and non-TDX allocation, for instance, having to have a ZONE_TDX > > > > and GFP_TDX. > > > > > > Right, TDX details do not belong at that level, but it will work > > > almost all the time if you do nothing to "guarantee" all TDX capable > > > pages all the time. > > > > "almost all the time" do you mean? > > > > > > > > > KVM depends on host's page fault handler to allocate the page. In fact KVM only > > > > consumes PFN from host's page tables. For now only RAM is TDX memory. By > > > > guaranteeing all pages in page allocator is TDX memory, we can easily use > > > > anonymous pages as TD guest memory. > > > > > > Again, TDX capable pages will be the overwhelming default, why are you > > > worried about cluttering the memory hotplug path for nice corner > > > cases. > > > > Firstly perhaps I forgot to mention there are two concepts about TDX memory, so > > let me clarify first: > > > > 1) Convertible Memory Regions (CMRs). This is reported by BIOS (thus static) to > > indicate which memory regions *can* be used as TDX memory. This basically means > > all RAM during boot for now. > > > > 2) TD Memory Regions (TDMRs). Memory pages in CMRs are not automatically TDX > > usable memory. The TDX module needs to be configured which (convertible) memory > > regions can be used as TDX memory. Kernel is responsible for choosing the > > ranges, and configure to the TDX module. If a convertible memory page is not > > included into TDMRs, the TDX module will report error when it is assigned to a > > TD. > > > > > > > > Consider the fact that end users can break the kernel by specifying > > > invalid memmap= command line options. The memory hotplug code does not > > > take any steps to add safety in those cases because there are already > > > too many ways it can go wrong. TDX is just one more corner case where > > > the memmap= user needs to be careful. Otherwise, it is up to the > > > platform firmware to make sure everything in the base memory map is > > > TDX capable, and then all you need is documentation about the failure > > > mode when extending "System RAM" beyond that baseline. > > > > So the fact is, if we don't include legacy PMEMs into TDMRs, and don't do > > anything in memory hotplug, then if user does kmem-hot-add legacy PMEMs as > > system RAM, a live TD may eventually be killed. > > > > If such case is a corner case that we don't need to guarantee, then even better. > > And we have an additional reason that those legacy PMEMs don't need to be in > > TDMRs. As you suggested, we can add some documentation to point out. > > > > But the point we want to do some code check and prevent memory hotplug is, as > > Dave said, we want this piece of code to work on *ANY* TDX capable machines, > > including future machines which may, i.e. supports NVDIMM/CLX memory as TDX > > memory. If we don't do any code check in memory hotplug in this series, then > > when this code runs in future platforms, user can plug NVDIMM or CLX memory as > > system RAM thus break the assumption "all pages in page allocator are TDX > > memory", which eventually leads to live TDs being killed potentially. > > > > Dave said we need to guarantee this code can work on *ANY* TDX machines. Some > > documentation saying it only works one some platforms and you shouldn't do > > things on other platforms are not good enough: > > > > https://lore.kernel.org/lkml/cover.1649219184.git.kai.huang@intel.com/T/#m6df45b6e1702bb03dcb027044a0dabf30a86e471 > > Yes, the incompatible cases cannot be ignored, but I disagree that > they actively need to be prevented. One way to achieve that is to > explicitly enumerate TDX capable memory and document how mempolicy can > be used to avoid killing TDs. Hi Dan, Thanks for feedback. Could you elaborate what does "explicitly enumerate TDX capable memory" mean? How to enumerate exactly? And for "document how mempolicy can be used to avoid killing TDs", what mempolicy (and error reporting you mentioned below) are you referring to? I skipped to reply your below your two replies as I think they are referring to the same "enumerate" and "mempolicy" that I am asking above. > > > > > shmem to support a new fd-based backend which doesn't require having to mmap() > > > > TD guest memory to host userspace: > > > > > > > > https://lore.kernel.org/kvm/20220310140911.50924-1-chao.p.peng@linux.intel.com/ > > > > > > > > Also, besides TD guest memory, there are some per-TD control data structures > > > > (which must be TDX memory too) need to be allocated for each TD. Normal memory > > > > allocation APIs can be used for such allocation if we guarantee all pages in > > > > page allocator is TDX memory. > > > > > > You don't need that guarantee, just check it after the fact and fail > > > if that assertion fails. It should almost always be the case that it > > > succeeds and if it doesn't then something special is happening with > > > that system and the end user has effectively opt-ed out of TDX > > > operation. > > > > This doesn't guarantee consistent behaviour. For instance, for one TD it can be > > created, while the second may fail. We should provide a consistent service. > > Yes, there needs to be enumeration and policy knobs to avoid failures, > hard coded "no memory hotplug" hacks do not seem the right enumeration > and policy knobs to me. > > > The thing is anyway we need to configure some memory regions to the TDX module. > > To me there's no reason we don't want to guarantee all pages in page allocator > > are TDX memory. > > > > > > > > > > > allocator are all TDX memory, the v3 implementation needs to always include > > > > > > legacy PMEMs as TDX memory so that even people truly add legacy PMEMs as system > > > > > > RAM, we can still guarantee all pages in page allocator are TDX memory. > > > > > > > > > > Why? > > > > > > > > If we don't include legacy PMEMs as TDX memory, then after they are hot-added as > > > > system RAM using kmem driver, the assumption of "all pages in page allocator are > > > > TDX memory" is broken. A TD can be killed during runtime. > > > > > > Yes, that is what the end user asked for. If they don't want that to > > > happen then the policy decision about using kmem needs to be updated > > > in userspace, not hard code that policy decision towards TDX inside > > > the kernel. > > > > This is also fine to me. But please also see above Dave's comment. > > Dave is right, the implementation can not just ignore the conflict. To > me, enumeration plus error reporting allows for flexibility without > hard coding policy in the kernel.
On Sun, May 08, 2022 at 10:00:39PM +1200, Kai Huang wrote: > On Fri, 2022-05-06 at 20:09 -0400, Mike Rapoport wrote: > > On Thu, May 05, 2022 at 06:51:20AM -0700, Dan Williams wrote: > > > [ add Mike ] > > > > > > On Thu, May 5, 2022 at 2:54 AM Kai Huang <kai.huang@intel.com> wrote: > > > [..] > > > > > > > > Hi Dave, > > > > > > > > Sorry to ping (trying to close this). > > > > > > > > Given we don't need to consider kmem-hot-add legacy PMEM after TDX module > > > > initialization, I think for now it's totally fine to exclude legacy PMEMs from > > > > TDMRs. The worst case is when someone tries to use them as TD guest backend > > > > directly, the TD will fail to create. IMO it's acceptable, as it is supposedly > > > > that no one should just use some random backend to run TD. > > > > > > The platform will already do this, right? I don't understand why this > > > is trying to take proactive action versus documenting the error > > > conditions and steps someone needs to take to avoid unconvertible > > > memory. There is already the CONFIG_HMEM_REPORTING that describes > > > relative performance properties between initiators and targets, it > > > seems fitting to also add security properties between initiators and > > > targets so someone can enumerate the numa-mempolicy that avoids > > > unconvertible memory. > > > > > > No, special casing in hotplug code paths needed. > > > > > > > > > > > I think w/o needing to include legacy PMEM, it's better to get all TDX memory > > > > blocks based on memblock, but not e820. The pages managed by page allocator are > > > > from memblock anyway (w/o those from memory hotplug). > > > > > > > > And I also think it makes more sense to introduce 'tdx_memblock' and > > > > 'tdx_memory' data structures to gather all TDX memory blocks during boot when > > > > memblock is still alive. When TDX module is initialized during runtime, TDMRs > > > > can be created based on the 'struct tdx_memory' which contains all TDX memory > > > > blocks we gathered based on memblock during boot. This is also more flexible to > > > > support other TDX memory from other sources such as CLX memory in the future. > > > > > > > > Please let me know if you have any objection? Thanks! > > > > > > It's already the case that x86 maintains sideband structures to > > > preserve memory after exiting the early memblock code. Mike, correct > > > me if I am wrong, but adding more is less desirable than just keeping > > > the memblock around? > > > > TBH, I didn't read the entire thread yet, but at the first glance, keeping > > memblock around is much more preferable that adding yet another { .start, > > .end, .flags } data structure. To keep memblock after boot all is needed is > > something like > > > > select ARCH_KEEP_MEMBLOCK if INTEL_TDX_HOST > > > > I'll take a closer look next week on the entire series, maybe I'm missing > > some details. > > > > Hi Mike, > > Thanks for feedback. > > Perhaps I haven't put a lot details of the new TDX data structures, so let me > point out that the new two data structures 'struct tdx_memblock' and 'struct > tdx_memory' that I am proposing are mostly supposed to be used by TDX code only, > which is pretty standalone. They are not supposed to be some basic > infrastructure that can be widely used by other random kernel components. We already have "pretty standalone" numa_meminfo that originally was used to setup NUMA memory topology, but now it's used by other code as well. And e820 tables also contain similar data and they are supposedly should be used only at boot time, but in reality there are too much callbacks into e820 way after the system is booted. So any additional memory representation will only add to the overall complexity and well have even more "eventually consistent" collections of { .start, .end, .flags } structures. > In fact, currently the only operation we need is to allow memblock to register > all memory regions as TDX memory blocks when the memblock is still alive. > Therefore, in fact, the new data structures can even be completely invisible to > other kernel components. For instance, TDX code can provide below API w/o > exposing any data structures to other kernel components: > > int tdx_add_memory_block(phys_addr_t start, phys_addr_t end, int nid); > > And we call above API for each memory region in memblock when it is alive. > > TDX code internally manages those memory regions via the new data structures > that I mentioned above, so we don't need to keep memblock after boot. The > advantage of this approach is it is more flexible to support other potential TDX > memory resources (such as CLX memory) in the future. Please let keep things simple. If other TDX memory resources will need different handling it can be implemented then. For now, just enable ARCH_KEEP_MEMBLOCK and use memblock to track TDX memory. > Otherwise, we can do as you suggested to select ARCH_KEEP_MEMBLOCK when > INTEL_TDX_HOST is on and TDX code internally uses memblock API directly. > > -- > Thanks, > -Kai > >
> > > > Hi Mike, > > > > Thanks for feedback. > > > > Perhaps I haven't put a lot details of the new TDX data structures, so let me > > point out that the new two data structures 'struct tdx_memblock' and 'struct > > tdx_memory' that I am proposing are mostly supposed to be used by TDX code only, > > which is pretty standalone. They are not supposed to be some basic > > infrastructure that can be widely used by other random kernel components. > > We already have "pretty standalone" numa_meminfo that originally was used > to setup NUMA memory topology, but now it's used by other code as well. > And e820 tables also contain similar data and they are supposedly should be > used only at boot time, but in reality there are too much callbacks into > e820 way after the system is booted. > > So any additional memory representation will only add to the overall > complexity and well have even more "eventually consistent" collections of > { .start, .end, .flags } structures. > > > In fact, currently the only operation we need is to allow memblock to register > > all memory regions as TDX memory blocks when the memblock is still alive. > > Therefore, in fact, the new data structures can even be completely invisible to > > other kernel components. For instance, TDX code can provide below API w/o > > exposing any data structures to other kernel components: > > > > int tdx_add_memory_block(phys_addr_t start, phys_addr_t end, int nid); > > > > And we call above API for each memory region in memblock when it is alive. > > > > TDX code internally manages those memory regions via the new data structures > > that I mentioned above, so we don't need to keep memblock after boot. The > > advantage of this approach is it is more flexible to support other potential TDX > > memory resources (such as CLX memory) in the future. > > Please let keep things simple. If other TDX memory resources will need > different handling it can be implemented then. For now, just enable > ARCH_KEEP_MEMBLOCK and use memblock to track TDX memory. > Looks good to me. Thanks for the feedback.
> > > > > > > > > > > Consider the fact that end users can break the kernel by specifying > > > > invalid memmap= command line options. The memory hotplug code does not > > > > take any steps to add safety in those cases because there are already > > > > too many ways it can go wrong. TDX is just one more corner case where > > > > the memmap= user needs to be careful. Otherwise, it is up to the > > > > platform firmware to make sure everything in the base memory map is > > > > TDX capable, and then all you need is documentation about the failure > > > > mode when extending "System RAM" beyond that baseline. > > > > > > So the fact is, if we don't include legacy PMEMs into TDMRs, and don't do > > > anything in memory hotplug, then if user does kmem-hot-add legacy PMEMs as > > > system RAM, a live TD may eventually be killed. > > > > > > If such case is a corner case that we don't need to guarantee, then even better. > > > And we have an additional reason that those legacy PMEMs don't need to be in > > > TDMRs. As you suggested, we can add some documentation to point out. > > > > > > But the point we want to do some code check and prevent memory hotplug is, as > > > Dave said, we want this piece of code to work on *ANY* TDX capable machines, > > > including future machines which may, i.e. supports NVDIMM/CLX memory as TDX > > > memory. If we don't do any code check in memory hotplug in this series, then > > > when this code runs in future platforms, user can plug NVDIMM or CLX memory as > > > system RAM thus break the assumption "all pages in page allocator are TDX > > > memory", which eventually leads to live TDs being killed potentially. > > > > > > Dave said we need to guarantee this code can work on *ANY* TDX machines. Some > > > documentation saying it only works one some platforms and you shouldn't do > > > things on other platforms are not good enough: > > > > > > https://lore.kernel.org/lkml/cover.1649219184.git.kai.huang@intel.com/T/#m6df45b6e1702bb03dcb027044a0dabf30a86e471 > > > > Yes, the incompatible cases cannot be ignored, but I disagree that > > they actively need to be prevented. One way to achieve that is to > > explicitly enumerate TDX capable memory and document how mempolicy can > > be used to avoid killing TDs. > > Hi Dan, > > Thanks for feedback. > > Could you elaborate what does "explicitly enumerate TDX capable memory" mean? > How to enumerate exactly? > > And for "document how mempolicy can be used to avoid killing TDs", what > mempolicy (and error reporting you mentioned below) are you referring to? > > I skipped to reply your below your two replies as I think they are referring to > the same "enumerate" and "mempolicy" that I am asking above. > > Hi Dan, I guess "explicitly enumerate TDX capable memory" means getting the Convertible Memory Regions (CMR). And "document how mempolicy can be used to avoid killing TDs" means we say something like below in the documentation? Any non TDX capable memory hot-add will result in non TDX capable pages being potentially allocated to a TD, in which case a TD may fail to be created or a live TD may be killed at runtime. And "error reporting" do you mean in memory hot-add code path, we check whether the new memory resource is TDX capable, if not we print some error similar to above message in documentation, but still allow the memory hot-add to happen? Something like below in add_memory_resource()? if (platform_has_tdx() && new memory resource NOT in CMRs) pr_err("Hot-add non-TDX memory on TDX capable system. TD may fail to be created, or a live TD may be killed during runtime.\n"); // allow memory hot-add anyway I have below concerns of this approach: 1) I think we should provide a consistent service to user, that is, we either to guarantee that TD won't be failed to be created randomly and a running TD won't be killed during runtime, or we don't provide any TDX functionality at all. So I am not sure only "document how mempolicy can be use to avoid killing TDs" is good enough. 2) Above code to check whether a new memory resource is in CMRs or not requires the kernel to get CMRs during kernel boot. However getting CMRs requires calling SEAMCALL which requires kernel to support VMXON/VMXOFF. VMXON/VMXOFF is currently only handled by KVM. We'd like to avoid adding VMXON/VMXOFF to core- kernel now if not mandatory, as eventually we will very likely need to have a reference-based approach to call VMXON/VMXOFF. This part is explained in the cover letter in this series. Dave suggested for now to keep things simple, we can use "winner take all" approach: If TDX is initialized first, don't allow memory hotplug. If memory hotplug happens first, don't allow TDX to be initialized. https://lore.kernel.org/lkml/cover.1649219184.git.kai.huang@intel.com/T/#mfa6b5dcc536d8a7b78522f46ccd1230f84d52ae0 I think this is perhaps more reasonable as we are at least providing some consistent service to user. And in this approach we don't need to handle VMXON/VMXOFF in core-kernel. Comments?