[v5,07/22] x86/virt/tdx: Implement SEAMCALL function

Message ID	095e6bbc57b4470e1e9a9104059a5238c9775f00.1655894131.git.kai.huang@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> From: Kai Huang <kai.huang@intel.com> To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: seanjc@google.com, pbonzini@redhat.com, dave.hansen@intel.com, len.brown@intel.com, tony.luck@intel.com, rafael.j.wysocki@intel.com, reinette.chatre@intel.com, dan.j.williams@intel.com, peterz@infradead.org, ak@linux.intel.com, kirill.shutemov@linux.intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, isaku.yamahata@intel.com, kai.huang@intel.com Subject: [PATCH v5 07/22] x86/virt/tdx: Implement SEAMCALL function Date: Wed, 22 Jun 2022 23:16:30 +1200 Message-Id: <095e6bbc57b4470e1e9a9104059a5238c9775f00.1655894131.git.kai.huang@intel.com> In-Reply-To: <cover.1655894131.git.kai.huang@intel.com> References: <cover.1655894131.git.kai.huang@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	TDX host kernel support \| expand [v5,00/22] TDX host kernel support [v5,01/22] x86/virt/tdx: Detect TDX during kernel boot [v5,02/22] cc_platform: Add new attribute to prevent ACPI CPU hotplug [v5,03/22] cc_platform: Add new attribute to prevent ACPI memory hotplug [v5,04/22] x86/virt/tdx: Prevent ACPI CPU hotplug and ACPI memory hotplug [v5,05/22] x86/virt/tdx: Prevent hot-add driver managed memory [v5,06/22] x86/virt/tdx: Add skeleton to initialize TDX on demand [v5,07/22] x86/virt/tdx: Implement SEAMCALL function [v5,08/22] x86/virt/tdx: Shut down TDX module in case of error [v5,09/22] x86/virt/tdx: Detect TDX module by doing module global initialization [v5,10/22] x86/virt/tdx: Do logical-cpu scope TDX module initialization [v5,11/22] x86/virt/tdx: Get information about TDX module and TDX-capable memory [v5,12/22] x86/virt/tdx: Convert all memory regions in memblock to TDX memory [v5,13/22] x86/virt/tdx: Add placeholder to construct TDMRs based on memblock [v5,14/22] x86/virt/tdx: Create TDMRs to cover all memblock memory regions [v5,15/22] x86/virt/tdx: Allocate and set up PAMTs for TDMRs [v5,16/22] x86/virt/tdx: Set up reserved areas for all TDMRs [v5,17/22] x86/virt/tdx: Reserve TDX module global KeyID [v5,18/22] x86/virt/tdx: Configure TDX module with TDMRs and global KeyID [v5,19/22] x86/virt/tdx: Configure global KeyID on all packages [v5,20/22] x86/virt/tdx: Initialize all TDMRs [v5,21/22] x86/virt/tdx: Support kexec() [v5,22/22] Documentation/x86: Add documentation for TDX host support

Huang, Kai June 22, 2022, 11:16 a.m. UTC

TDX introduces a new CPU mode: Secure Arbitration Mode (SEAM).  This
mode runs only the TDX module itself or other code to load the TDX
module.

The host kernel communicates with SEAM software via a new SEAMCALL
instruction.  This is conceptually similar to a guest->host hypercall,
except it is made from the host to SEAM software instead.

The TDX module defines SEAMCALL leaf functions to allow the host to
initialize it, and to create and run protected VMs.  SEAMCALL leaf
functions use an ABI different from the x86-64 system-v ABI.  Instead,
they share the same ABI with the TDCALL leaf functions.

Implement a function __seamcall() to allow the host to make SEAMCALL
to SEAM software using the TDX_MODULE_CALL macro which is the common
assembly for both SEAMCALL and TDCALL.

SEAMCALL instruction causes #GP when SEAMRR isn't enabled, and #UD when
CPU is not in VMX operation.  The TDX_MODULE_CALL macro doesn't handle
SEAMCALL exceptions.  Leave to the caller to guarantee those conditions
before calling __seamcall().

Signed-off-by: Kai Huang <kai.huang@intel.com>
---

- v3 -> v5 (no feedback on v4):

 - Explicitly tell TDX_SEAMCALL_VMFAILINVALID is returned if the
   SEAMCALL itself fails.
 - Improve the changelog.

---
 arch/x86/virt/vmx/tdx/Makefile   |  2 +-
 arch/x86/virt/vmx/tdx/seamcall.S | 52 ++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h      | 11 +++++++
 3 files changed, 64 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/virt/vmx/tdx/seamcall.S

Dave Hansen June 24, 2022, 6:38 p.m. UTC | #1

On 6/22/22 04:16, Kai Huang wrote:
> SEAMCALL instruction causes #GP when SEAMRR isn't enabled, and #UD when
> CPU is not in VMX operation.  The TDX_MODULE_CALL macro doesn't handle
> SEAMCALL exceptions.  Leave to the caller to guarantee those conditions
> before calling __seamcall().

I was trying to make the argument earlier that you don't need *ANY*
detection for TDX, other than the ability to make a SEAMCALL.
Basically, patch 01/22 could go away.

You are right that:

	The TDX_MODULE_CALL macro doesn't handle SEAMCALL exceptions.

But, it's also not hard to make it *able* to handle exceptions.

So what does patch 01/22 buy us?  One EXTABLE entry?

Huang, Kai June 27, 2022, 5:23 a.m. UTC | #2

On Fri, 2022-06-24 at 11:38 -0700, Dave Hansen wrote:
> On 6/22/22 04:16, Kai Huang wrote:
> > SEAMCALL instruction causes #GP when SEAMRR isn't enabled, and #UD when
> > CPU is not in VMX operation.  The TDX_MODULE_CALL macro doesn't handle
> > SEAMCALL exceptions.  Leave to the caller to guarantee those conditions
> > before calling __seamcall().
> 
> I was trying to make the argument earlier that you don't need *ANY*
> detection for TDX, other than the ability to make a SEAMCALL.
> Basically, patch 01/22 could go away.
> 
> You are right that:
> 
> 	The TDX_MODULE_CALL macro doesn't handle SEAMCALL exceptions.
> 
> But, it's also not hard to make it *able* to handle exceptions.
> 
> So what does patch 01/22 buy us?  One EXTABLE entry?

There are below pros if we can detect whether TDX is enabled by BIOS during boot
before initializing the TDX Module:

1) There are requirements from customers to report whether platform supports TDX
and the TDX keyID numbers before initializing the TDX module so the userspace
cloud software can use this information to do something.  Sorry I cannot find
the lore link now.

Isaku, if you see, could you provide more info?

2) As you can see, it can be used to handle ACPI CPU/memory hotplug and driver
managed memory hotplug.  Kexec() support patch also can use it.

Particularly, in concept, ACPI CPU/memory hotplug is only related to whether TDX
is enabled by BIOS, but not whether TDX module is loaded, or the result of
initializing the TDX module.  So I think we should have some code to detect TDX
during boot.


Also, it seems adding EXTABLE to TDX_MODULE_CALL doesn't have significantly less
code comparing to detecting TDX during boot:

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 4b75c930fa1b..4a97ca8eb14c 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -8,6 +8,7 @@
 #include <asm/ptrace.h>
 #include <asm/shared/tdx.h>

+#ifdef CONFIG_INTEL_TDX_HOST
 /*
  * SW-defined error codes.
  *
@@ -18,6 +19,21 @@
 #define TDX_SW_ERROR                   (TDX_ERROR | GENMASK_ULL(47, 40))
 #define TDX_SEAMCALL_VMFAILINVALID     (TDX_SW_ERROR | _UL(0xFFFF0000))

+/*
+ * Special error codes to indicate SEAMCALL #GP and #UD.
+ *
+ * SEAMCALL causes #GP when SEAMRR is not properly enabled by BIOS, and
+ * causes #UD when CPU is not in VMX operation.  Define two separate
+ * error codes to distinguish the two cases so caller can be aware of
+ * what caused the SEAMCALL to fail.
+ *
+ * Bits 61:48 are reserved bits which will never be set by the TDX
+ * module.  Borrow 2 reserved bits to represent #GP and #UD.
+ */
+#define TDX_SEAMCALL_GP                (TDX_ERROR | GENMASK_ULL(48, 48))
+#define TDX_SEAMCALL_UD                (TDX_ERROR | GENMASK_ULL(49, 49))
+#endif
+
 #ifndef __ASSEMBLY__

 /*
diff --git a/arch/x86/virt/vmx/tdx/tdxcall.S b/arch/x86/virt/vmx/tdx/tdxcall.S
index 49a54356ae99..7431c47258d9 100644
--- a/arch/x86/virt/vmx/tdx/tdxcall.S
+++ b/arch/x86/virt/vmx/tdx/tdxcall.S
@@ -1,6 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #include <asm/asm-offsets.h>
 #include <asm/tdx.h>
+#include <asm/asm.h>

 /*
  * TDCALL and SEAMCALL are supported in Binutils >= 2.36.
@@ -45,6 +46,7 @@
        /* Leave input param 2 in RDX */

        .if \host
+1:
        seamcall
        /*
         * SEAMCALL instruction is essentially a VMExit from VMX root
@@ -57,9 +59,25 @@
         * This value will never be used as actual SEAMCALL error code as
         * it is from the Reserved status code class.
         */
-       jnc .Lno_vmfailinvalid
+       jnc .Lseamcall_out
        mov $TDX_SEAMCALL_VMFAILINVALID, %rax
-.Lno_vmfailinvalid:
+       jmp .Lseamcall_out
+2:
+       /*
+        * SEAMCALL caused #GP or #UD.  By reaching here %eax contains
+        * the trap number.  Check the trap number and set up the return
+        * value to %rax.
+        */
+       cmp $X86_TRAP_GP, %eax
+       je .Lseamcall_gp
+       mov $TDX_SEAMCALL_UD, %rax
+       jmp .Lseamcall_out
+.Lseamcall_gp:
+       mov $TDX_SEAMCALL_GP, %rax
+       jmp .Lseamcall_out
+
+       _ASM_EXTABLE_FAULT(1b, 2b)
+.Lseamcall_out

Dave Hansen June 27, 2022, 8:58 p.m. UTC | #3

On 6/26/22 22:23, Kai Huang wrote:
> On Fri, 2022-06-24 at 11:38 -0700, Dave Hansen wrote:
>> On 6/22/22 04:16, Kai Huang wrote:
>>> SEAMCALL instruction causes #GP when SEAMRR isn't enabled, and #UD when
>>> CPU is not in VMX operation.  The TDX_MODULE_CALL macro doesn't handle
>>> SEAMCALL exceptions.  Leave to the caller to guarantee those conditions
>>> before calling __seamcall().
>>
>> I was trying to make the argument earlier that you don't need *ANY*
>> detection for TDX, other than the ability to make a SEAMCALL.
>> Basically, patch 01/22 could go away.
...
>> So what does patch 01/22 buy us?  One EXTABLE entry?
> 
> There are below pros if we can detect whether TDX is enabled by BIOS during boot
> before initializing the TDX Module:
> 
> 1) There are requirements from customers to report whether platform supports TDX
> and the TDX keyID numbers before initializing the TDX module so the userspace
> cloud software can use this information to do something.  Sorry I cannot find
> the lore link now.

<sigh>

Never listen to customers literally.  It'll just lead you down the wrong
path.  They told you, "we need $FOO in dmesg" and you ran with it
without understanding why.  The fact that you even *need* to find the
lore link is because you didn't bother to realize what they really needed.

dmesg is not ABI.  It's for humans.  If you need data out of the kernel,
do it with a *REAL* ABI.  Not dmesg.

> 2) As you can see, it can be used to handle ACPI CPU/memory hotplug and driver
> managed memory hotplug.  Kexec() support patch also can use it.
> 
> Particularly, in concept, ACPI CPU/memory hotplug is only related to whether TDX
> is enabled by BIOS, but not whether TDX module is loaded, or the result of
> initializing the TDX module.  So I think we should have some code to detect TDX
> during boot.

This is *EXACTLY* why our colleagues at Intel needs to tell us about
what the OS and firmware should do when TDX is in varying states of decay.

Does the mere presence of the TDX module prevent hotplug?  Or, if a
system has the TDX module loaded but no intent to ever use TDX, why
can't it just use hotplug like a normal system which is not addled with
the TDX albatross around its neck?

> Also, it seems adding EXTABLE to TDX_MODULE_CALL doesn't have significantly less
> code comparing to detecting TDX during boot:

It depends on a bunch of things.  It might only be a line or two of
assembly.

If you actually went and tried it, you might be able to convince me it's
a bad idea.

Huang, Kai June 27, 2022, 10:10 p.m. UTC | #4

On Mon, 2022-06-27 at 13:58 -0700, Dave Hansen wrote:
> On 6/26/22 22:23, Kai Huang wrote:
> > On Fri, 2022-06-24 at 11:38 -0700, Dave Hansen wrote:
> > > On 6/22/22 04:16, Kai Huang wrote:
> > > > SEAMCALL instruction causes #GP when SEAMRR isn't enabled, and #UD when
> > > > CPU is not in VMX operation.  The TDX_MODULE_CALL macro doesn't handle
> > > > SEAMCALL exceptions.  Leave to the caller to guarantee those conditions
> > > > before calling __seamcall().
> > > 
> > > I was trying to make the argument earlier that you don't need *ANY*
> > > detection for TDX, other than the ability to make a SEAMCALL.
> > > Basically, patch 01/22 could go away.
> ...
> > > So what does patch 01/22 buy us?  One EXTABLE entry?
> > 
> > There are below pros if we can detect whether TDX is enabled by BIOS during boot
> > before initializing the TDX Module:
> > 
> > 1) There are requirements from customers to report whether platform supports TDX
> > and the TDX keyID numbers before initializing the TDX module so the userspace
> > cloud software can use this information to do something.  Sorry I cannot find
> > the lore link now.
> 
> <sigh>
> 
> Never listen to customers literally.  It'll just lead you down the wrong
> path.  They told you, "we need $FOO in dmesg" and you ran with it
> without understanding why.  The fact that you even *need* to find the
> lore link is because you didn't bother to realize what they really needed.
> 
> dmesg is not ABI.  It's for humans.  If you need data out of the kernel,
> do it with a *REAL* ABI.  Not dmesg.

Showing in the dmesg is the first step, but later we have plan to expose keyID
info via /sysfs.  Of course, it's always arguable customer's such requirement is
absolutely needed, but to me it's still a good thing to have code to detect TDX
during boot.  The code isn't complicated as you can see.

> 
> > 2) As you can see, it can be used to handle ACPI CPU/memory hotplug and driver
> > managed memory hotplug.  Kexec() support patch also can use it.
> > 
> > Particularly, in concept, ACPI CPU/memory hotplug is only related to whether TDX
> > is enabled by BIOS, but not whether TDX module is loaded, or the result of
> > initializing the TDX module.  So I think we should have some code to detect TDX
> > during boot.
> 
> This is *EXACTLY* why our colleagues at Intel needs to tell us about
> what the OS and firmware should do when TDX is in varying states of decay.

Yes I am working on it to make it public.

> 
> Does the mere presence of the TDX module prevent hotplug?  
> 

For ACPI CPU hotplug, yes.  The TDX module even doesn't need to be loaded. 
Whether SEAMRR is enabled determines.

For ACPI memory hotplug, in practice yes.  For architectural behaviour, I'll
work with others internally to get some public statement.

> Or, if a
> system has the TDX module loaded but no intent to ever use TDX, why
> can't it just use hotplug like a normal system which is not addled with
> the TDX albatross around its neck?

I think if a machine has enabled TDX in the BIOS, the user of the machine very
likely has intention to actually use TDX.

Yes for driver-managed memory hotplug, it makes sense if user doesn't want to
use TDX, it's better to not disable it.  But to me it's also not a disaster if
we just disable driver-managed memory hotplug if TDX is enabled by BIOS.

For ACPI memory hotplug, I think in practice we can treat it as BIOS bug, but
I'll get some public statement around this.

> 
> > Also, it seems adding EXTABLE to TDX_MODULE_CALL doesn't have significantly less
> > code comparing to detecting TDX during boot:
> 
> It depends on a bunch of things.  It might only be a line or two of
> assembly.
> 
> If you actually went and tried it, you might be able to convince me it's
> a bad idea.

The code I showed is basically the patch we need to call SEAMCALL at runtime w/o
detecting TDX at first.  #GP must be handled as it is what SEAMCALL triggers if
TDX is not enabled.  #UD happens when CPU isn't in VMX operation, and we should
distinguish it from #GP if we already want to handle #GP.

Dan Williams July 19, 2022, 7:39 p.m. UTC | #5

Kai Huang wrote:
> On Mon, 2022-06-27 at 13:58 -0700, Dave Hansen wrote:
> > On 6/26/22 22:23, Kai Huang wrote:
> > > On Fri, 2022-06-24 at 11:38 -0700, Dave Hansen wrote:
> > > > On 6/22/22 04:16, Kai Huang wrote:
> > > > > SEAMCALL instruction causes #GP when SEAMRR isn't enabled, and #UD when
> > > > > CPU is not in VMX operation.  The TDX_MODULE_CALL macro doesn't handle
> > > > > SEAMCALL exceptions.  Leave to the caller to guarantee those conditions
> > > > > before calling __seamcall().
> > > > 
> > > > I was trying to make the argument earlier that you don't need *ANY*
> > > > detection for TDX, other than the ability to make a SEAMCALL.
> > > > Basically, patch 01/22 could go away.
> > ...
> > > > So what does patch 01/22 buy us?  One EXTABLE entry?
> > > 
> > > There are below pros if we can detect whether TDX is enabled by BIOS during boot
> > > before initializing the TDX Module:
> > > 
> > > 1) There are requirements from customers to report whether platform supports TDX
> > > and the TDX keyID numbers before initializing the TDX module so the userspace
> > > cloud software can use this information to do something.  Sorry I cannot find
> > > the lore link now.
> > 
> > <sigh>
> > 
> > Never listen to customers literally.  It'll just lead you down the wrong
> > path.  They told you, "we need $FOO in dmesg" and you ran with it
> > without understanding why.  The fact that you even *need* to find the
> > lore link is because you didn't bother to realize what they really needed.
> > 
> > dmesg is not ABI.  It's for humans.  If you need data out of the kernel,
> > do it with a *REAL* ABI.  Not dmesg.
> 
> Showing in the dmesg is the first step, but later we have plan to expose keyID
> info via /sysfs.  Of course, it's always arguable customer's such requirement is
> absolutely needed, but to me it's still a good thing to have code to detect TDX
> during boot.  The code isn't complicated as you can see.
> 
> > 
> > > 2) As you can see, it can be used to handle ACPI CPU/memory hotplug and driver
> > > managed memory hotplug.  Kexec() support patch also can use it.
> > > 
> > > Particularly, in concept, ACPI CPU/memory hotplug is only related to whether TDX
> > > is enabled by BIOS, but not whether TDX module is loaded, or the result of
> > > initializing the TDX module.  So I think we should have some code to detect TDX
> > > during boot.
> > 
> > This is *EXACTLY* why our colleagues at Intel needs to tell us about
> > what the OS and firmware should do when TDX is in varying states of decay.
> 
> Yes I am working on it to make it public.
> 
> > 
> > Does the mere presence of the TDX module prevent hotplug?  
> > 
> 
> For ACPI CPU hotplug, yes.  The TDX module even doesn't need to be loaded. 
> Whether SEAMRR is enabled determines.
> 
> For ACPI memory hotplug, in practice yes.  For architectural behaviour, I'll
> work with others internally to get some public statement.
> 
> > Or, if a
> > system has the TDX module loaded but no intent to ever use TDX, why
> > can't it just use hotplug like a normal system which is not addled with
> > the TDX albatross around its neck?
> 
> I think if a machine has enabled TDX in the BIOS, the user of the machine very
> likely has intention to actually use TDX.
> 
> Yes for driver-managed memory hotplug, it makes sense if user doesn't want to
> use TDX, it's better to not disable it.  But to me it's also not a disaster if
> we just disable driver-managed memory hotplug if TDX is enabled by BIOS.

No, driver-managed memory hotplug is how Linux handles "dedicated
memory" management. The architecture needs to comprehend that end users
may want to move address ranges into and out of Linux core-mm management
independently of whether those address ranges are also covered by a SEAM
range.

Huang, Kai July 19, 2022, 11:28 p.m. UTC | #6

On Tue, 2022-07-19 at 12:39 -0700, Dan Williams wrote:
> Kai Huang wrote:
> > On Mon, 2022-06-27 at 13:58 -0700, Dave Hansen wrote:
> > > On 6/26/22 22:23, Kai Huang wrote:
> > > > On Fri, 2022-06-24 at 11:38 -0700, Dave Hansen wrote:
> > > > > On 6/22/22 04:16, Kai Huang wrote:
> > > > > > SEAMCALL instruction causes #GP when SEAMRR isn't enabled, and #UD when
> > > > > > CPU is not in VMX operation.  The TDX_MODULE_CALL macro doesn't handle
> > > > > > SEAMCALL exceptions.  Leave to the caller to guarantee those conditions
> > > > > > before calling __seamcall().
> > > > > 
> > > > > I was trying to make the argument earlier that you don't need *ANY*
> > > > > detection for TDX, other than the ability to make a SEAMCALL.
> > > > > Basically, patch 01/22 could go away.
> > > ...
> > > > > So what does patch 01/22 buy us?  One EXTABLE entry?
> > > > 
> > > > There are below pros if we can detect whether TDX is enabled by BIOS during boot
> > > > before initializing the TDX Module:
> > > > 
> > > > 1) There are requirements from customers to report whether platform supports TDX
> > > > and the TDX keyID numbers before initializing the TDX module so the userspace
> > > > cloud software can use this information to do something.  Sorry I cannot find
> > > > the lore link now.
> > > 
> > > <sigh>
> > > 
> > > Never listen to customers literally.  It'll just lead you down the wrong
> > > path.  They told you, "we need $FOO in dmesg" and you ran with it
> > > without understanding why.  The fact that you even *need* to find the
> > > lore link is because you didn't bother to realize what they really needed.
> > > 
> > > dmesg is not ABI.  It's for humans.  If you need data out of the kernel,
> > > do it with a *REAL* ABI.  Not dmesg.
> > 
> > Showing in the dmesg is the first step, but later we have plan to expose keyID
> > info via /sysfs.  Of course, it's always arguable customer's such requirement is
> > absolutely needed, but to me it's still a good thing to have code to detect TDX
> > during boot.  The code isn't complicated as you can see.
> > 
> > > 
> > > > 2) As you can see, it can be used to handle ACPI CPU/memory hotplug and driver
> > > > managed memory hotplug.  Kexec() support patch also can use it.
> > > > 
> > > > Particularly, in concept, ACPI CPU/memory hotplug is only related to whether TDX
> > > > is enabled by BIOS, but not whether TDX module is loaded, or the result of
> > > > initializing the TDX module.  So I think we should have some code to detect TDX
> > > > during boot.
> > > 
> > > This is *EXACTLY* why our colleagues at Intel needs to tell us about
> > > what the OS and firmware should do when TDX is in varying states of decay.
> > 
> > Yes I am working on it to make it public.
> > 
> > > 
> > > Does the mere presence of the TDX module prevent hotplug?  
> > > 
> > 
> > For ACPI CPU hotplug, yes.  The TDX module even doesn't need to be loaded. 
> > Whether SEAMRR is enabled determines.
> > 
> > For ACPI memory hotplug, in practice yes.  For architectural behaviour, I'll
> > work with others internally to get some public statement.
> > 
> > > Or, if a
> > > system has the TDX module loaded but no intent to ever use TDX, why
> > > can't it just use hotplug like a normal system which is not addled with
> > > the TDX albatross around its neck?
> > 
> > I think if a machine has enabled TDX in the BIOS, the user of the machine very
> > likely has intention to actually use TDX.
> > 
> > Yes for driver-managed memory hotplug, it makes sense if user doesn't want to
> > use TDX, it's better to not disable it.  But to me it's also not a disaster if
> > we just disable driver-managed memory hotplug if TDX is enabled by BIOS.
> 
> No, driver-managed memory hotplug is how Linux handles "dedicated
> memory" management. The architecture needs to comprehend that end users
> may want to move address ranges into and out of Linux core-mm management
> independently of whether those address ranges are also covered by a SEAM
> range.

But to avoid GFP_TDX (and ZONE_TDX) staff, we need to guarantee all memory pages
in page allocator are TDX pages.  To me it's at least quite fair that user needs
to *choose* to use driver-managed memory hotplug or TDX.

If automatically disable driver-managed memory hotplug on a TDX BIOS enabled
platform isn't desired, how about we introduce a kernel command line (i.e.
use_tdx={on|off}) to let user to choose?

If user specifies use_tdx=on, then user cannot use driver-managed memory
hotplug.  if use_tdx=off, then user cannot use TDX even it is enabled by BIOS.

Huang, Kai July 20, 2022, 10:18 a.m. UTC | #7

On Tue, 2022-06-28 at 10:10 +1200, Kai Huang wrote:
> On Mon, 2022-06-27 at 13:58 -0700, Dave Hansen wrote:
> > On 6/26/22 22:23, Kai Huang wrote:
> > > On Fri, 2022-06-24 at 11:38 -0700, Dave Hansen wrote:
> > > > On 6/22/22 04:16, Kai Huang wrote:
> > > > > SEAMCALL instruction causes #GP when SEAMRR isn't enabled, and #UD when
> > > > > CPU is not in VMX operation.  The TDX_MODULE_CALL macro doesn't handle
> > > > > SEAMCALL exceptions.  Leave to the caller to guarantee those conditions
> > > > > before calling __seamcall().
> > > > 
> > > > I was trying to make the argument earlier that you don't need *ANY*
> > > > detection for TDX, other than the ability to make a SEAMCALL.
> > > > Basically, patch 01/22 could go away.
> > ...
> > > > So what does patch 01/22 buy us?  One EXTABLE entry?
> > > 
> > > There are below pros if we can detect whether TDX is enabled by BIOS during boot
> > > before initializing the TDX Module:
> > > 
> > > 1) There are requirements from customers to report whether platform supports TDX
> > > and the TDX keyID numbers before initializing the TDX module so the userspace
> > > cloud software can use this information to do something.  Sorry I cannot find
> > > the lore link now.
> > 
> > <sigh>
> > 
> > Never listen to customers literally.  It'll just lead you down the wrong
> > path.  They told you, "we need $FOO in dmesg" and you ran with it
> > without understanding why.  The fact that you even *need* to find the
> > lore link is because you didn't bother to realize what they really needed.
> > 
> > dmesg is not ABI.  It's for humans.  If you need data out of the kernel,
> > do it with a *REAL* ABI.  Not dmesg.
> 
> Showing in the dmesg is the first step, but later we have plan to expose keyID
> info via /sysfs.  Of course, it's always arguable customer's such requirement is
> absolutely needed, but to me it's still a good thing to have code to detect TDX
> during boot.  The code isn't complicated as you can see.
> 
> > 
> > > 2) As you can see, it can be used to handle ACPI CPU/memory hotplug and driver
> > > managed memory hotplug.  Kexec() support patch also can use it.
> > > 
> > > Particularly, in concept, ACPI CPU/memory hotplug is only related to whether TDX
> > > is enabled by BIOS, but not whether TDX module is loaded, or the result of
> > > initializing the TDX module.  So I think we should have some code to detect TDX
> > > during boot.
> > 
> > This is *EXACTLY* why our colleagues at Intel needs to tell us about
> > what the OS and firmware should do when TDX is in varying states of decay.
> 
> Yes I am working on it to make it public.
> 
> > 
> > Does the mere presence of the TDX module prevent hotplug?  
> > 
> 
> For ACPI CPU hotplug, yes.  The TDX module even doesn't need to be loaded. 
> Whether SEAMRR is enabled determines.
> 
> For ACPI memory hotplug, in practice yes.  For architectural behaviour, I'll
> work with others internally to get some public statement.
> 
> > Or, if a
> > system has the TDX module loaded but no intent to ever use TDX, why
> > can't it just use hotplug like a normal system which is not addled with
> > the TDX albatross around its neck?
> 
> I think if a machine has enabled TDX in the BIOS, the user of the machine very
> likely has intention to actually use TDX.
> 
> Yes for driver-managed memory hotplug, it makes sense if user doesn't want to
> use TDX, it's better to not disable it.  But to me it's also not a disaster if
> we just disable driver-managed memory hotplug if TDX is enabled by BIOS.
> 
> For ACPI memory hotplug, I think in practice we can treat it as BIOS bug, but
> I'll get some public statement around this.
> 

Hi Dave,

Try to close on how to handle memory hotplug.  After discussion, below will be
architectural behaviour of TDX in terms of ACPI memory hotplug:

1) During platform boot, CMRs must be physically present. MCHECK verifies all
CMRs are physically present and are actually TDX convertible memory.
2) CMRs are static after platform boots and don't change at runtime.  TDX
architecture doesn't support hot-add or hot-removal of CMR memory.
3) TDX architecture doesn't forbid non-CMR memory hotplug.

Also, although TDX doesn't trust BIOS in terms of security, a non-buggy BIOS
should prevent CMR memory from being hot-removed.  If kernel ever receives such
event, it's a BIOS bug, or even worse, the BIOS is compromised and under attack.

As a result, the kernel should also never receive event of hot-add CMR memory. 
It is very much likely TDX is under attack (physical attack) in such case, i.e.
someone is trying to physically replace any CMR memory.

In terms of how to handle ACPI memory hotplug, my thinking is -- ideally, if the
kernel can get the CMRs during kernel boot when detecting whether TDX is enabled
by BIOS, we can do below:

- For memory hot-removal, if the removed memory falls into any CMR, then kernel
can speak loudly it is a BIOS bug.  But when this happens, the hot-removal has
been handled by BIOS thus kernel cannot actually prevent, so kernel can either
BUG(), or just print error message.  If the removed memory doesn't fall into
CMR, we do nothing.

- For memory hot-add, if the new memory falls into any CMR, then kernel should
speak loudly it is a BIOS bug, or even say "TDX is under attack" as this is only
possible when CMR memory has been previously hot-removed.  And kernel should
reject the new memory for security reason.  If the new memory doesn't fall into
any CMR, then we (also) just reject the new memory, as we want to guarantee all
memory in page allocator are TDX pages.  But this is basically due to kernel
policy but not due to TDX architecture.

BUT, since as the first step, we cannot get the CMR during kernel boot (as it
requires additional code to put CPU into VMX operation), I think for now we can
handle ACPI memory hotplug in below way:

- For memory hot-removal, we do nothing.
- For memory hot-add, we simply reject the new memory when TDX is enabled by
BIOS.  This not only prevents the potential "physical attack of replacing any
CMR memory", but also makes sure no non-CMR memory will be added to page
allocator during runtime via ACPI memory hot-add.

We can improve this in next stage when we can get CMRs during kernel boot.

For the concern that on a TDX BIOS enabled system, people may not want to use
TDX at all but just use it as normal system, as I replied to Dan regarding to
the driver-managed memory hotplug, we can provide a kernel commandline, i.e.
use_tdx={on|off}, to allow user to *choose* between TDX and memory hotplug. 
When use_tdx=off, we continue to allow memory hotplug and driver-managed hotplug
as normal but refuse to initialize TDX module.

Any comments?

Dave Hansen July 20, 2022, 4:48 p.m. UTC | #8

On 7/20/22 03:18, Kai Huang wrote:
> Try to close on how to handle memory hotplug.  After discussion, below will be
> architectural behaviour of TDX in terms of ACPI memory hotplug:
> 
> 1) During platform boot, CMRs must be physically present. MCHECK verifies all
> CMRs are physically present and are actually TDX convertible memory.

I doubt this is strictly true.  This makes it sound like MCHECK is doing
*ACTUAL* verification that the memory is, in practice, convertible.
That would mean actually writing to it, which would take a long time for
a large system.

Does it *ACTUALLY* verify this?

Also, it's very odd to say that "CMRs must be physically present".  A
CMR itself is a logical construct.  The physical memory *backing* a CMR
is, something else entirely.

> 2) CMRs are static after platform boots and don't change at runtime.  TDX
> architecture doesn't support hot-add or hot-removal of CMR memory.
> 3) TDX architecture doesn't forbid non-CMR memory hotplug.
> 
> Also, although TDX doesn't trust BIOS in terms of security, a non-buggy BIOS
> should prevent CMR memory from being hot-removed.  If kernel ever receives such
> event, it's a BIOS bug, or even worse, the BIOS is compromised and under attack.
> 
> As a result, the kernel should also never receive event of hot-add CMR memory. 
> It is very much likely TDX is under attack (physical attack) in such case, i.e.
> someone is trying to physically replace any CMR memory.
> 
> In terms of how to handle ACPI memory hotplug, my thinking is -- ideally, if the
> kernel can get the CMRs during kernel boot when detecting whether TDX is enabled
> by BIOS, we can do below:
> 
> - For memory hot-removal, if the removed memory falls into any CMR, then kernel
> can speak loudly it is a BIOS bug.  But when this happens, the hot-removal has
> been handled by BIOS thus kernel cannot actually prevent, so kernel can either
> BUG(), or just print error message.  If the removed memory doesn't fall into
> CMR, we do nothing.

Hold on a sec.  Hot-removal is a two-step process.  The kernel *MUST*
know in advance that the removal is going to occur.  It follows that up
with evacuating the memory, giving the "all clear", then the actual
physical removal can occur.

I'm not sure what you're getting at with the "kernel cannot actually
prevent" bit.  No sane system actively destroys perfect good memory
content and tells the kernel about it after the fact.

> - For memory hot-add, if the new memory falls into any CMR, then kernel should
> speak loudly it is a BIOS bug, or even say "TDX is under attack" as this is only
> possible when CMR memory has been previously hot-removed.

I don't think this is strictly true.  It's totally possible to get a
hot-add *event* for memory which is in a CMR.  It would be another BIOS
bug, of course, but hot-remove is not a prerequisite purely for an event.

> And kernel should
> reject the new memory for security reason.  If the new memory doesn't fall into
> any CMR, then we (also) just reject the new memory, as we want to guarantee all
> memory in page allocator are TDX pages.  But this is basically due to kernel
> policy but not due to TDX architecture.

Agreed.

> BUT, since as the first step, we cannot get the CMR during kernel boot (as it
> requires additional code to put CPU into VMX operation), I think for now we can
> handle ACPI memory hotplug in below way:
> 
> - For memory hot-removal, we do nothing.

This doesn't seem right to me.  *If* we get a known-bogus hot-remove
event, we need to reject it.  Remember, removal is a two-step process.

> - For memory hot-add, we simply reject the new memory when TDX is enabled by
> BIOS.  This not only prevents the potential "physical attack of replacing any
> CMR memory",

I don't think there's *any* meaningful attack mitigation here.  Even if
someone managed to replace the physical address space that backed some
private memory, the integrity checksums won't match.  Memory integrity
mitigates physical replacement, not software.

> but also makes sure no non-CMR memory will be added to page
> allocator during runtime via ACPI memory hot-add.

Agreed.  This one _is_ important and since it supports an existing
policy, it makes sense to enforce this in the kernel.

> We can improve this in next stage when we can get CMRs during kernel boot.
> 
> For the concern that on a TDX BIOS enabled system, people may not want to use
> TDX at all but just use it as normal system, as I replied to Dan regarding to
> the driver-managed memory hotplug, we can provide a kernel commandline, i.e.
> use_tdx={on|off}, to allow user to *choose* between TDX and memory hotplug. 
> When use_tdx=off, we continue to allow memory hotplug and driver-managed hotplug
> as normal but refuse to initialize TDX module.

That doesn't sound like a good resolution to me.

It conflates pure "software" hotplug operations like transitioning
memory ownership from the core mm to a driver (like device DAX).

TDX should not have *ANY* impact on purely software operations.  Period.

Huang, Kai July 21, 2022, 1:52 a.m. UTC | #9

On Wed, 2022-07-20 at 09:48 -0700, Dave Hansen wrote:
> On 7/20/22 03:18, Kai Huang wrote:
> > Try to close on how to handle memory hotplug.  After discussion, below will be
> > architectural behaviour of TDX in terms of ACPI memory hotplug:
> > 
> > 1) During platform boot, CMRs must be physically present. MCHECK verifies all
> > CMRs are physically present and are actually TDX convertible memory.
> 
> I doubt this is strictly true.  This makes it sound like MCHECK is doing
> *ACTUAL* verification that the memory is, in practice, convertible.
> That would mean actually writing to it, which would take a long time for
> a large system.

The "verify" is used in many places in the specs.  In the public TDX module
spec, it is also used:

Table 1.1: Intel TDX Glossary
CMR: A range of physical memory configured by BIOS and verified by MCHECK

I guess to verify, MCHECK doesn't need to actually write something to memory. 
For example, when memory is present, it can know what type it is so it can
determine.

> 
> Does it *ACTUALLY* verify this?

Yes.  This is what spec says.  And this is what Intel colleagues said.

> 
> Also, it's very odd to say that "CMRs must be physically present".  A
> CMR itself is a logical construct.  The physical memory *backing* a CMR
> is, something else entirely.

OK.  But I think it is easy to interpret this actually means the physical memory
*backing* a CMR.

> 
> > 2) CMRs are static after platform boots and don't change at runtime.  TDX
> > architecture doesn't support hot-add or hot-removal of CMR memory.
> > 3) TDX architecture doesn't forbid non-CMR memory hotplug.
> > 
> > Also, although TDX doesn't trust BIOS in terms of security, a non-buggy BIOS
> > should prevent CMR memory from being hot-removed.  If kernel ever receives such
> > event, it's a BIOS bug, or even worse, the BIOS is compromised and under attack.
> > 
> > As a result, the kernel should also never receive event of hot-add CMR memory. 
> > It is very much likely TDX is under attack (physical attack) in such case, i.e.
> > someone is trying to physically replace any CMR memory.
> > 
> > In terms of how to handle ACPI memory hotplug, my thinking is -- ideally, if the
> > kernel can get the CMRs during kernel boot when detecting whether TDX is enabled
> > by BIOS, we can do below:
> > 
> > - For memory hot-removal, if the removed memory falls into any CMR, then kernel
> > can speak loudly it is a BIOS bug.  But when this happens, the hot-removal has
> > been handled by BIOS thus kernel cannot actually prevent, so kernel can either
> > BUG(), or just print error message.  If the removed memory doesn't fall into
> > CMR, we do nothing.
> 
> Hold on a sec.  Hot-removal is a two-step process.  The kernel *MUST*
> know in advance that the removal is going to occur.  It follows that up
> with evacuating the memory, giving the "all clear", then the actual
> physical removal can occur.

After looking more, looks "the hot-removal has been handled by BIOS" is wrong. 
And you are right there's a previous step must be done (it is device offline). 
But the "kernel cannot actually prevent" means in the device removal callback,
the kernel cannot prevent it from being removed.

This is my understanding by reading the ACPI spec and the code:

Firstly, the BIOS will send a "Eject Request" notification to the kernel. Upon
receiving this event, the kernel will firstly try to offline the device (which
can fail due to -EBUSY, etc).  If offline is successful, the kernel will call
device's remove callback to remove the device.  But this remove callback doesn't
return error code (which means it doesn't fail).  Instead, after the remove
callback is done, the kernel calls _EJ0 ACPI method to actually do the ejection.

> 
> I'm not sure what you're getting at with the "kernel cannot actually
> prevent" bit.  No sane system actively destroys perfect good memory
> content and tells the kernel about it after the fact.

The kernel will offline the device first.  This guarantees all good memory
content has been migrated.

> 
> > - For memory hot-add, if the new memory falls into any CMR, then kernel should
> > speak loudly it is a BIOS bug, or even say "TDX is under attack" as this is only
> > possible when CMR memory has been previously hot-removed.
> 
> I don't think this is strictly true.  It's totally possible to get a
> hot-add *event* for memory which is in a CMR.  It would be another BIOS
> bug, of course, but hot-remove is not a prerequisite purely for an event.

OK.

> 
> > And kernel should
> > reject the new memory for security reason.  If the new memory doesn't fall into
> > any CMR, then we (also) just reject the new memory, as we want to guarantee all
> > memory in page allocator are TDX pages.  But this is basically due to kernel
> > policy but not due to TDX architecture.
> 
> Agreed.
> 
> > BUT, since as the first step, we cannot get the CMR during kernel boot (as it
> > requires additional code to put CPU into VMX operation), I think for now we can
> > handle ACPI memory hotplug in below way:
> > 
> > - For memory hot-removal, we do nothing.
> 
> This doesn't seem right to me.  *If* we get a known-bogus hot-remove
> event, we need to reject it.  Remember, removal is a two-step process.

If so, we need to reject the (CMR) memory offline.  Or we just BUG() in the ACPI
memory removal  callback?

But either way this will requires us to get the CMRs during kernel boot.

Do you think we need to add this support in the first series?

> 
> > - For memory hot-add, we simply reject the new memory when TDX is enabled by
> > BIOS.  This not only prevents the potential "physical attack of replacing any
> > CMR memory",
> 
> I don't think there's *any* meaningful attack mitigation here.  Even if
> someone managed to replace the physical address space that backed some
> private memory, the integrity checksums won't match.  Memory integrity
> mitigates physical replacement, not software.

My thinking is rejecting the new memory is a more aggressive defence than
waiting until integrity checksum failure.

Btw, the integrity checksum support isn't a mandatory requirement for TDX
architecture.  In fact, TDX also supports a mode which doesn't require integrity
check (for instance, TDX on client machines).

> 
> > but also makes sure no non-CMR memory will be added to page
> > allocator during runtime via ACPI memory hot-add.
> 
> Agreed.  This one _is_ important and since it supports an existing
> policy, it makes sense to enforce this in the kernel.
> 
> > We can improve this in next stage when we can get CMRs during kernel boot.
> > 
> > For the concern that on a TDX BIOS enabled system, people may not want to use
> > TDX at all but just use it as normal system, as I replied to Dan regarding to
> > the driver-managed memory hotplug, we can provide a kernel commandline, i.e.
> > use_tdx={on|off}, to allow user to *choose* between TDX and memory hotplug. 
> > When use_tdx=off, we continue to allow memory hotplug and driver-managed hotplug
> > as normal but refuse to initialize TDX module.
> 
> That doesn't sound like a good resolution to me.
> 
> It conflates pure "software" hotplug operations like transitioning
> memory ownership from the core mm to a driver (like device DAX).
> 
> TDX should not have *ANY* impact on purely software operations.  Period.

The hard requirement is: Once TDX module gets initialized, we cannot add any
*new* memory to core-mm.

But if some memory block is included to TDX memory when the module gets
initialized, then we should be able to move it from core-mm to driver or vice
versa.  In this case, we can select all memory regions that the kernel wants to
use as TDX memory at some point (during kernel boot I guess).  

Adding any non-selected-TDX memory regions to core-mm should always be rejected
(therefore there's no removal of them from core-mm either), although it is
"software" hotplug.  If user wants this, he/she cannot use TDX.  This is what I
mean we can provide command line to allow user to *choose*.

Also, if I understand correctly above, your suggestion is we want to prevent any
CMR memory going offline so it won't be hot-removed (assuming we can get CMRs
during boot).  This looks contradicts to the requirement of being able to allow
moving memory from core-mm to driver.  When we offline the memory, we cannot
know whether the memory will be used by driver, or later hot-removed.

Huang, Kai July 27, 2022, 12:34 a.m. UTC | #10

On Thu, 2022-07-21 at 13:52 +1200, Kai Huang wrote:
> > > BUT, since as the first step, we cannot get the CMR during kernel boot (as
> > > it
> > > requires additional code to put CPU into VMX operation), I think for now
> > > we can
> > > handle ACPI memory hotplug in below way:
> > > 
> > > - For memory hot-removal, we do nothing.
> > 
> > This doesn't seem right to me.  *If* we get a known-bogus hot-remove
> > event, we need to reject it.  Remember, removal is a two-step process.
> 
> If so, we need to reject the (CMR) memory offline.  Or we just BUG() in the
> ACPI
> memory removal  callback?
> 
> But either way this will requires us to get the CMRs during kernel boot.
> 
> Do you think we need to add this support in the first series?

Hi Dave,

In terms of whether we should get CMRs during kernel boot (which requires we do
VMXON/VMXOFF during kernel boot around SEAMCALL), I forgot one thing:

Technically, ACPI memory hotplug is related to whether TDX is enabled in BIOS,
but not related to whether TDX module is loaded or not.  With doing
VMXON/VMXOFF, we can get CMRs during kernel boot by calling P-SEAMLDR's
SEAMCALL.  But theoretically, from TDX architecture's point of view, the P-
SEAMLDR may not be loaded even TDX is enabled by BIOS (in practice, the P-
SEAMLDR is always loaded by BIOS when TDX is enabled), in which case there's no
way we can get CMRs.  But in this case, I think we can just treat TDX isn't
enabled by BIOS as kernel should never try to load P-SEAMLDR.

Other advantages of being able to do VMXON/VMXOFF and getting CMRs during kernel
boot:

1) We can just shut down the TDX module in kexec();
2) We can choose to trim any non-CMR memory out of memblock.memory instead of
having to manually verify all memory regions in memblock are CMR memory.

Comments?

Dave Hansen July 27, 2022, 12:50 a.m. UTC | #11

On 7/26/22 17:34, Kai Huang wrote:
>> This doesn't seem right to me.  *If* we get a known-bogus
>> hot-remove event, we need to reject it.  Remember, removal is a
>> two-step process.
> If so, we need to reject the (CMR) memory offline.  Or we just BUG()
> in the ACPI memory removal  callback?
> 
> But either way this will requires us to get the CMRs during kernel boot.

I don't get the link there between CMRs at boot and handling hotplug.

We don't need to go to extreme measures just to get a message out of the
kernel that the BIOS is bad.  If we don't have the data to do it
already, then I don't really see the nee to warn about it.

Think of a system that has TDX enabled in the BIOS, but is running an
old kernel.  It will have *ZERO* idea that hotplug doesn't work.  It'll
run blissfully along.  I don't see any reason that a kernel with TDX
support, but where TDX is disabled should actively go out and try to be
better than those old pre-TDX kernels.

Further, there's nothing to stop non-CMR memory from being added to a
system with TDX enabled in the BIOS but where the kernel is not using
it.  If we actively go out and keep good old DRAM from being added, then
we unnecessarily addle those systems.

Huang, Kai July 27, 2022, 12:46 p.m. UTC | #12

On Tue, 2022-07-26 at 17:50 -0700, Dave Hansen wrote:
> On 7/26/22 17:34, Kai Huang wrote:
> > > This doesn't seem right to me.  *If* we get a known-bogus
> > > hot-remove event, we need to reject it.  Remember, removal is a
> > > two-step process.
> > If so, we need to reject the (CMR) memory offline.  Or we just BUG()
> > in the ACPI memory removal  callback?
> > 
> > But either way this will requires us to get the CMRs during kernel boot.
> 
> I don't get the link there between CMRs at boot and handling hotplug.
> 
> We don't need to go to extreme measures just to get a message out of the
> kernel that the BIOS is bad.  If we don't have the data to do it
> already, then I don't really see the nee to warn about it.
> 
> Think of a system that has TDX enabled in the BIOS, but is running an
> old kernel.  It will have *ZERO* idea that hotplug doesn't work.  It'll
> run blissfully along.  I don't see any reason that a kernel with TDX
> support, but where TDX is disabled should actively go out and try to be
> better than those old pre-TDX kernels.

Agreed, assuming "where TDX is disabled" you mean TDX isn't usable (i.e. when
TDX module isn't loaded, or won't be initialized at all).

> 
> Further, there's nothing to stop non-CMR memory from being added to a
> system with TDX enabled in the BIOS but where the kernel is not using
> it.  If we actively go out and keep good old DRAM from being added, then
> we unnecessarily addle those systems.
> 

OK.

Then for memory hot-add, perhaps we can just go with the "winner-take-all"
approach you mentioned before?

For memory hot-removal, as I replied previously, looks the kernel cannot reject
the removal if it allows memory offline.  Any suggestion on this?

Huang, Kai Aug. 3, 2022, 2:37 a.m. UTC | #13

On Thu, 2022-07-21 at 13:52 +1200, Kai Huang wrote:
> Also, if I understand correctly above, your suggestion is we want to prevent any
> CMR memory going offline so it won't be hot-removed (assuming we can get CMRs
> during boot).  This looks contradicts to the requirement of being able to allow
> moving memory from core-mm to driver.  When we offline the memory, we cannot
> know whether the memory will be used by driver, or later hot-removed.

Hi Dave,

The high level flow of device hot-removal is:

acpi_scan_hot_remove()
	-> acpi_scan_try_to_offline()
		-> acpi_bus_offline()
			-> device_offline()
				-> memory_subsys_offline()
	-> acpi_bus_trim()
		-> acpi_memory_device_remove()

And memory_subsys_offline() can also be triggered via /sysfs:

	echo 0 > /sys/devices/system/memory/memory30/online

After the memory block is offline, my understanding is kernel can theoretically
move it to, i.e. ZONE_DEVICE via memremap_pages().

As you can see memory_subsys_offline() is the entry point of memory device
offline (before it the code is generic for all ACPI device), and it cannot
distinguish whether the removal is from ACPI event, or from /sysfs, so it seems
we are unable to refuse to offline memory in  memory_subsys_offline() when it is
called from ACPI event.

Any comments?

Dave Hansen Aug. 3, 2022, 2:20 p.m. UTC | #14

On 8/2/22 19:37, Kai Huang wrote:
> On Thu, 2022-07-21 at 13:52 +1200, Kai Huang wrote:
>> Also, if I understand correctly above, your suggestion is we want to prevent any
>> CMR memory going offline so it won't be hot-removed (assuming we can get CMRs
>> during boot).  This looks contradicts to the requirement of being able to allow
>> moving memory from core-mm to driver.  When we offline the memory, we cannot
>> know whether the memory will be used by driver, or later hot-removed.
> Hi Dave,
> 
> The high level flow of device hot-removal is:
> 
> acpi_scan_hot_remove()
> 	-> acpi_scan_try_to_offline()
> 		-> acpi_bus_offline()
> 			-> device_offline()
> 				-> memory_subsys_offline()
> 	-> acpi_bus_trim()
> 		-> acpi_memory_device_remove()
> 
> 
> And memory_subsys_offline() can also be triggered via /sysfs:
> 
> 	echo 0 > /sys/devices/system/memory/memory30/online
> 
> After the memory block is offline, my understanding is kernel can theoretically
> move it to, i.e. ZONE_DEVICE via memremap_pages().
> 
> As you can see memory_subsys_offline() is the entry point of memory device
> offline (before it the code is generic for all ACPI device), and it cannot
> distinguish whether the removal is from ACPI event, or from /sysfs, so it seems
> we are unable to refuse to offline memory in  memory_subsys_offline() when it is
> called from ACPI event.
> 
> Any comments?

I suggest refactoring the code in a way that makes it possible to
distinguish the two cases.

It's not like you have some binary kernel.  You have the source code for
the whole thing and can propose changes *ANYWHERE* you need.  Even better:

$ grep -A2 ^ACPI\$ MAINTAINERS
ACPI
M:	"Rafael J. Wysocki" <rafael@kernel.org>
R:	Len Brown <lenb@kernel.org>

The maintainer of ACPI works for our employer.  Plus, he's a nice
helpful guy that you can go ask how you might refactor this or
approaches you might take.  Have you talked to Rafael about this issue?

Also, from a two-minute grepping session, I noticed this:

> static acpi_status acpi_bus_offline(acpi_handle handle, u32 lvl, void *data,
>                                     void **ret_p)
> {
...
>         if (device->handler && !device->handler->hotplug.enabled) {
>                 *ret_p = &device->dev;
>                 return AE_SUPPORT;
>         }

It looks to me like if you simply set:

	memory_device_handler->hotplug.enabled = false;

you'll get most of the behavior you want.  ACPI memory hotplug would not
work and the changes would be confined to the ACPI world.  The
"lower-level" bus-based hotplug would be unaffected.

Now, I don't know what kind of locking would be needed to muck with a
global structure like that.  But, it's a start.

Huang, Kai Aug. 3, 2022, 10:35 p.m. UTC | #15

On Wed, 2022-08-03 at 07:20 -0700, Dave Hansen wrote:
> On 8/2/22 19:37, Kai Huang wrote:
> > On Thu, 2022-07-21 at 13:52 +1200, Kai Huang wrote:
> > > Also, if I understand correctly above, your suggestion is we want to prevent any
> > > CMR memory going offline so it won't be hot-removed (assuming we can get CMRs
> > > during boot).  This looks contradicts to the requirement of being able to allow
> > > moving memory from core-mm to driver.  When we offline the memory, we cannot
> > > know whether the memory will be used by driver, or later hot-removed.
> > Hi Dave,
> > 
> > The high level flow of device hot-removal is:
> > 
> > acpi_scan_hot_remove()
> > 	-> acpi_scan_try_to_offline()
> > 		-> acpi_bus_offline()
> > 			-> device_offline()
> > 				-> memory_subsys_offline()
> > 	-> acpi_bus_trim()
> > 		-> acpi_memory_device_remove()
> > 
> > 
> > And memory_subsys_offline() can also be triggered via /sysfs:
> > 
> > 	echo 0 > /sys/devices/system/memory/memory30/online
> > 
> > After the memory block is offline, my understanding is kernel can theoretically
> > move it to, i.e. ZONE_DEVICE via memremap_pages().
> > 
> > As you can see memory_subsys_offline() is the entry point of memory device
> > offline (before it the code is generic for all ACPI device), and it cannot
> > distinguish whether the removal is from ACPI event, or from /sysfs, so it seems
> > we are unable to refuse to offline memory in  memory_subsys_offline() when it is
> > called from ACPI event.
> > 
> > Any comments?
> 
> I suggest refactoring the code in a way that makes it possible to
> distinguish the two cases.
> 
> It's not like you have some binary kernel.  You have the source code for
> the whole thing and can propose changes *ANYWHERE* you need.  Even better:
> 
> $ grep -A2 ^ACPI\$ MAINTAINERS
> ACPI
> M:	"Rafael J. Wysocki" <rafael@kernel.org>
> R:	Len Brown <lenb@kernel.org>
> 
> The maintainer of ACPI works for our employer.  Plus, he's a nice
> helpful guy that you can go ask how you might refactor this or
> approaches you might take.  Have you talked to Rafael about this issue?

Rafael once also suggested to set hotplug.enabled to 0 as your code shows below,
but we just got the TDX architecture behaviour of memory hotplug clarified from
Intel TDX guys recently. 

> Also, from a two-minute grepping session, I noticed this:
> 
> > static acpi_status acpi_bus_offline(acpi_handle handle, u32 lvl, void *data,
> >                                     void **ret_p)
> > {
> ...
> >         if (device->handler && !device->handler->hotplug.enabled) {
> >                 *ret_p = &device->dev;
> >                 return AE_SUPPORT;
> >         }
> 
> It looks to me like if you simply set:
> 
> 	memory_device_handler->hotplug.enabled = false;
> 
> you'll get most of the behavior you want.  ACPI memory hotplug would not
> work and the changes would be confined to the ACPI world.  The
> "lower-level" bus-based hotplug would be unaffected.
> 
> Now, I don't know what kind of locking would be needed to muck with a
> global structure like that.  But, it's a start.

This has two problems:

1) This approach cannot distinguish non-CMR memory hotplug and CMR memory
hotplug, as it disables ACPI memory hotplug for all.  But this is fine as we
want to reject non-CMR memory hotplug anyway.  We just need to explain clearly
in changelog.

2) This won't allow the kernel to speak out "BIOS  bug" when CMR memory hotplug
actually happens.  Instead, we can only print out "hotplug is disabled due to
TDX is enabled by BIOS." when we set hotplug.enable to false.

Assuming above is OK, I'll explore this option.  I'll also do some research to
see if it's still possible to speak out "BIOS bug" in this approach but it's not
a mandatory requirement to me now.

Also, if print out "BIOS bug" for CMR memory hotplug isn't mandatory, then we
can just detect TDX during kernel boot, and disable hotplug when TDX is enabled
by BIOS, but don't need to use "winner-take-all" approach.  The former is
clearer and easier to implement.  I'll go with the former approach if I don't
hear objection from you.

And ACPI CPU hotplug can also use the same way.

Please let me know any comments.  Thanks!

Huang, Kai Aug. 4, 2022, 10:06 a.m. UTC | #16

On Thu, 2022-08-04 at 10:35 +1200, Kai Huang wrote:
> On Wed, 2022-08-03 at 07:20 -0700, Dave Hansen wrote:
> > On 8/2/22 19:37, Kai Huang wrote:
> > > On Thu, 2022-07-21 at 13:52 +1200, Kai Huang wrote:
> > > > Also, if I understand correctly above, your suggestion is we want to prevent any
> > > > CMR memory going offline so it won't be hot-removed (assuming we can get CMRs
> > > > during boot).  This looks contradicts to the requirement of being able to allow
> > > > moving memory from core-mm to driver.  When we offline the memory, we cannot
> > > > know whether the memory will be used by driver, or later hot-removed.
> > > Hi Dave,
> > > 
> > > The high level flow of device hot-removal is:
> > > 
> > > acpi_scan_hot_remove()
> > > 	-> acpi_scan_try_to_offline()
> > > 		-> acpi_bus_offline()
> > > 			-> device_offline()
> > > 				-> memory_subsys_offline()
> > > 	-> acpi_bus_trim()
> > > 		-> acpi_memory_device_remove()
> > > 
> > > 
> > > And memory_subsys_offline() can also be triggered via /sysfs:
> > > 
> > > 	echo 0 > /sys/devices/system/memory/memory30/online
> > > 
> > > After the memory block is offline, my understanding is kernel can theoretically
> > > move it to, i.e. ZONE_DEVICE via memremap_pages().
> > > 
> > > As you can see memory_subsys_offline() is the entry point of memory device
> > > offline (before it the code is generic for all ACPI device), and it cannot
> > > distinguish whether the removal is from ACPI event, or from /sysfs, so it seems
> > > we are unable to refuse to offline memory in  memory_subsys_offline() when it is
> > > called from ACPI event.
> > > 
> > > Any comments?
> > 
> > I suggest refactoring the code in a way that makes it possible to
> > distinguish the two cases.
> > 
> > It's not like you have some binary kernel.  You have the source code for
> > the whole thing and can propose changes *ANYWHERE* you need.  Even better:
> > 
> > $ grep -A2 ^ACPI\$ MAINTAINERS
> > ACPI
> > M:	"Rafael J. Wysocki" <rafael@kernel.org>
> > R:	Len Brown <lenb@kernel.org>
> > 
> > The maintainer of ACPI works for our employer.  Plus, he's a nice
> > helpful guy that you can go ask how you might refactor this or
> > approaches you might take.  Have you talked to Rafael about this issue?
> 
> Rafael once also suggested to set hotplug.enabled to 0 as your code shows below,
> but we just got the TDX architecture behaviour of memory hotplug clarified from
> Intel TDX guys recently. 
> 
> > Also, from a two-minute grepping session, I noticed this:
> > 
> > > static acpi_status acpi_bus_offline(acpi_handle handle, u32 lvl, void *data,
> > >                                     void **ret_p)
> > > {
> > ...
> > >         if (device->handler && !device->handler->hotplug.enabled) {
> > >                 *ret_p = &device->dev;
> > >                 return AE_SUPPORT;
> > >         }
> > 
> > It looks to me like if you simply set:
> > 
> > 	memory_device_handler->hotplug.enabled = false;
> > 
> > you'll get most of the behavior you want.  ACPI memory hotplug would not
> > work and the changes would be confined to the ACPI world.  The
> > "lower-level" bus-based hotplug would be unaffected.
> > 
> > Now, I don't know what kind of locking would be needed to muck with a
> > global structure like that.  But, it's a start.
> 
> This has two problems:
> 
> 1) This approach cannot distinguish non-CMR memory hotplug and CMR memory
> hotplug, as it disables ACPI memory hotplug for all.  But this is fine as we
> want to reject non-CMR memory hotplug anyway.  We just need to explain clearly
> in changelog.
> 
> 2) This won't allow the kernel to speak out "BIOS  bug" when CMR memory hotplug
> actually happens.  Instead, we can only print out "hotplug is disabled due to
> TDX is enabled by BIOS." when we set hotplug.enable to false.
> 
> Assuming above is OK, I'll explore this option.  I'll also do some research to
> see if it's still possible to speak out "BIOS bug" in this approach but it's not
> a mandatory requirement to me now.
> 
> Also, if print out "BIOS bug" for CMR memory hotplug isn't mandatory, then we
> can just detect TDX during kernel boot, and disable hotplug when TDX is enabled
> by BIOS, but don't need to use "winner-take-all" approach.  The former is
> clearer and easier to implement.  I'll go with the former approach if I don't
> hear objection from you.
> 
> And ACPI CPU hotplug can also use the same way.
> 
> Please let me know any comments.  Thanks!
> 

One more reason why "winner-take-all" approach doesn't work: 

If we allow ACPI memory hotplug to happen but choose to disable it in the
handler using "winner-take-all", then at the beginning the ACPI code will
actually create a /sysfs entry for hotplug.enabled to allow userspace to change
it:

	/sys/firmware/acpi/hotplug/memory/enabled

Which means even we set hotplug.enabled to false at some point, userspace can
turn it on again.  The only way is to not create this /sysfs entry at the
beginning.

With "winner-take-all" approach, I don't think we should avoid creating the
/sysfs entry.  Nor we should introduce arch-specific hook to, i.e. prevent
/sysfs entry being changed by userspace.

So instead of "winner-take-all" approach, I'll introduce a new kernel command
line to allow user to choose between ACPI CPU/memory hotplug vs TDX.  This
command line should not impact the "software" CPU/memory hotplug even when user
choose to use TDX.  In this case, this is similar to "winner-take-all" anyway.

[v5,07/22] x86/virt/tdx: Implement SEAMCALL function

Commit Message

Comments

Patch