Message ID | 095e6bbc57b4470e1e9a9104059a5238c9775f00.1655894131.git.kai.huang@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | TDX host kernel support | expand |
On 6/22/22 04:16, Kai Huang wrote: > SEAMCALL instruction causes #GP when SEAMRR isn't enabled, and #UD when > CPU is not in VMX operation. The TDX_MODULE_CALL macro doesn't handle > SEAMCALL exceptions. Leave to the caller to guarantee those conditions > before calling __seamcall(). I was trying to make the argument earlier that you don't need *ANY* detection for TDX, other than the ability to make a SEAMCALL. Basically, patch 01/22 could go away. You are right that: The TDX_MODULE_CALL macro doesn't handle SEAMCALL exceptions. But, it's also not hard to make it *able* to handle exceptions. So what does patch 01/22 buy us? One EXTABLE entry?
On Fri, 2022-06-24 at 11:38 -0700, Dave Hansen wrote: > On 6/22/22 04:16, Kai Huang wrote: > > SEAMCALL instruction causes #GP when SEAMRR isn't enabled, and #UD when > > CPU is not in VMX operation. The TDX_MODULE_CALL macro doesn't handle > > SEAMCALL exceptions. Leave to the caller to guarantee those conditions > > before calling __seamcall(). > > I was trying to make the argument earlier that you don't need *ANY* > detection for TDX, other than the ability to make a SEAMCALL. > Basically, patch 01/22 could go away. > > You are right that: > > The TDX_MODULE_CALL macro doesn't handle SEAMCALL exceptions. > > But, it's also not hard to make it *able* to handle exceptions. > > So what does patch 01/22 buy us? One EXTABLE entry? There are below pros if we can detect whether TDX is enabled by BIOS during boot before initializing the TDX Module: 1) There are requirements from customers to report whether platform supports TDX and the TDX keyID numbers before initializing the TDX module so the userspace cloud software can use this information to do something. Sorry I cannot find the lore link now. Isaku, if you see, could you provide more info? 2) As you can see, it can be used to handle ACPI CPU/memory hotplug and driver managed memory hotplug. Kexec() support patch also can use it. Particularly, in concept, ACPI CPU/memory hotplug is only related to whether TDX is enabled by BIOS, but not whether TDX module is loaded, or the result of initializing the TDX module. So I think we should have some code to detect TDX during boot. Also, it seems adding EXTABLE to TDX_MODULE_CALL doesn't have significantly less code comparing to detecting TDX during boot: diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index 4b75c930fa1b..4a97ca8eb14c 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -8,6 +8,7 @@ #include <asm/ptrace.h> #include <asm/shared/tdx.h> +#ifdef CONFIG_INTEL_TDX_HOST /* * SW-defined error codes. * @@ -18,6 +19,21 @@ #define TDX_SW_ERROR (TDX_ERROR | GENMASK_ULL(47, 40)) #define TDX_SEAMCALL_VMFAILINVALID (TDX_SW_ERROR | _UL(0xFFFF0000)) +/* + * Special error codes to indicate SEAMCALL #GP and #UD. + * + * SEAMCALL causes #GP when SEAMRR is not properly enabled by BIOS, and + * causes #UD when CPU is not in VMX operation. Define two separate + * error codes to distinguish the two cases so caller can be aware of + * what caused the SEAMCALL to fail. + * + * Bits 61:48 are reserved bits which will never be set by the TDX + * module. Borrow 2 reserved bits to represent #GP and #UD. + */ +#define TDX_SEAMCALL_GP (TDX_ERROR | GENMASK_ULL(48, 48)) +#define TDX_SEAMCALL_UD (TDX_ERROR | GENMASK_ULL(49, 49)) +#endif + #ifndef __ASSEMBLY__ /* diff --git a/arch/x86/virt/vmx/tdx/tdxcall.S b/arch/x86/virt/vmx/tdx/tdxcall.S index 49a54356ae99..7431c47258d9 100644 --- a/arch/x86/virt/vmx/tdx/tdxcall.S +++ b/arch/x86/virt/vmx/tdx/tdxcall.S @@ -1,6 +1,7 @@ /* SPDX-License-Identifier: GPL-2.0 */ #include <asm/asm-offsets.h> #include <asm/tdx.h> +#include <asm/asm.h> /* * TDCALL and SEAMCALL are supported in Binutils >= 2.36. @@ -45,6 +46,7 @@ /* Leave input param 2 in RDX */ .if \host +1: seamcall /* * SEAMCALL instruction is essentially a VMExit from VMX root @@ -57,9 +59,25 @@ * This value will never be used as actual SEAMCALL error code as * it is from the Reserved status code class. */ - jnc .Lno_vmfailinvalid + jnc .Lseamcall_out mov $TDX_SEAMCALL_VMFAILINVALID, %rax -.Lno_vmfailinvalid: + jmp .Lseamcall_out +2: + /* + * SEAMCALL caused #GP or #UD. By reaching here %eax contains + * the trap number. Check the trap number and set up the return + * value to %rax. + */ + cmp $X86_TRAP_GP, %eax + je .Lseamcall_gp + mov $TDX_SEAMCALL_UD, %rax + jmp .Lseamcall_out +.Lseamcall_gp: + mov $TDX_SEAMCALL_GP, %rax + jmp .Lseamcall_out + + _ASM_EXTABLE_FAULT(1b, 2b) +.Lseamcall_out
On 6/26/22 22:23, Kai Huang wrote: > On Fri, 2022-06-24 at 11:38 -0700, Dave Hansen wrote: >> On 6/22/22 04:16, Kai Huang wrote: >>> SEAMCALL instruction causes #GP when SEAMRR isn't enabled, and #UD when >>> CPU is not in VMX operation. The TDX_MODULE_CALL macro doesn't handle >>> SEAMCALL exceptions. Leave to the caller to guarantee those conditions >>> before calling __seamcall(). >> >> I was trying to make the argument earlier that you don't need *ANY* >> detection for TDX, other than the ability to make a SEAMCALL. >> Basically, patch 01/22 could go away. ... >> So what does patch 01/22 buy us? One EXTABLE entry? > > There are below pros if we can detect whether TDX is enabled by BIOS during boot > before initializing the TDX Module: > > 1) There are requirements from customers to report whether platform supports TDX > and the TDX keyID numbers before initializing the TDX module so the userspace > cloud software can use this information to do something. Sorry I cannot find > the lore link now. <sigh> Never listen to customers literally. It'll just lead you down the wrong path. They told you, "we need $FOO in dmesg" and you ran with it without understanding why. The fact that you even *need* to find the lore link is because you didn't bother to realize what they really needed. dmesg is not ABI. It's for humans. If you need data out of the kernel, do it with a *REAL* ABI. Not dmesg. > 2) As you can see, it can be used to handle ACPI CPU/memory hotplug and driver > managed memory hotplug. Kexec() support patch also can use it. > > Particularly, in concept, ACPI CPU/memory hotplug is only related to whether TDX > is enabled by BIOS, but not whether TDX module is loaded, or the result of > initializing the TDX module. So I think we should have some code to detect TDX > during boot. This is *EXACTLY* why our colleagues at Intel needs to tell us about what the OS and firmware should do when TDX is in varying states of decay. Does the mere presence of the TDX module prevent hotplug? Or, if a system has the TDX module loaded but no intent to ever use TDX, why can't it just use hotplug like a normal system which is not addled with the TDX albatross around its neck? > Also, it seems adding EXTABLE to TDX_MODULE_CALL doesn't have significantly less > code comparing to detecting TDX during boot: It depends on a bunch of things. It might only be a line or two of assembly. If you actually went and tried it, you might be able to convince me it's a bad idea.
On Mon, 2022-06-27 at 13:58 -0700, Dave Hansen wrote: > On 6/26/22 22:23, Kai Huang wrote: > > On Fri, 2022-06-24 at 11:38 -0700, Dave Hansen wrote: > > > On 6/22/22 04:16, Kai Huang wrote: > > > > SEAMCALL instruction causes #GP when SEAMRR isn't enabled, and #UD when > > > > CPU is not in VMX operation. The TDX_MODULE_CALL macro doesn't handle > > > > SEAMCALL exceptions. Leave to the caller to guarantee those conditions > > > > before calling __seamcall(). > > > > > > I was trying to make the argument earlier that you don't need *ANY* > > > detection for TDX, other than the ability to make a SEAMCALL. > > > Basically, patch 01/22 could go away. > ... > > > So what does patch 01/22 buy us? One EXTABLE entry? > > > > There are below pros if we can detect whether TDX is enabled by BIOS during boot > > before initializing the TDX Module: > > > > 1) There are requirements from customers to report whether platform supports TDX > > and the TDX keyID numbers before initializing the TDX module so the userspace > > cloud software can use this information to do something. Sorry I cannot find > > the lore link now. > > <sigh> > > Never listen to customers literally. It'll just lead you down the wrong > path. They told you, "we need $FOO in dmesg" and you ran with it > without understanding why. The fact that you even *need* to find the > lore link is because you didn't bother to realize what they really needed. > > dmesg is not ABI. It's for humans. If you need data out of the kernel, > do it with a *REAL* ABI. Not dmesg. Showing in the dmesg is the first step, but later we have plan to expose keyID info via /sysfs. Of course, it's always arguable customer's such requirement is absolutely needed, but to me it's still a good thing to have code to detect TDX during boot. The code isn't complicated as you can see. > > > 2) As you can see, it can be used to handle ACPI CPU/memory hotplug and driver > > managed memory hotplug. Kexec() support patch also can use it. > > > > Particularly, in concept, ACPI CPU/memory hotplug is only related to whether TDX > > is enabled by BIOS, but not whether TDX module is loaded, or the result of > > initializing the TDX module. So I think we should have some code to detect TDX > > during boot. > > This is *EXACTLY* why our colleagues at Intel needs to tell us about > what the OS and firmware should do when TDX is in varying states of decay. Yes I am working on it to make it public. > > Does the mere presence of the TDX module prevent hotplug? > For ACPI CPU hotplug, yes. The TDX module even doesn't need to be loaded. Whether SEAMRR is enabled determines. For ACPI memory hotplug, in practice yes. For architectural behaviour, I'll work with others internally to get some public statement. > Or, if a > system has the TDX module loaded but no intent to ever use TDX, why > can't it just use hotplug like a normal system which is not addled with > the TDX albatross around its neck? I think if a machine has enabled TDX in the BIOS, the user of the machine very likely has intention to actually use TDX. Yes for driver-managed memory hotplug, it makes sense if user doesn't want to use TDX, it's better to not disable it. But to me it's also not a disaster if we just disable driver-managed memory hotplug if TDX is enabled by BIOS. For ACPI memory hotplug, I think in practice we can treat it as BIOS bug, but I'll get some public statement around this. > > > Also, it seems adding EXTABLE to TDX_MODULE_CALL doesn't have significantly less > > code comparing to detecting TDX during boot: > > It depends on a bunch of things. It might only be a line or two of > assembly. > > If you actually went and tried it, you might be able to convince me it's > a bad idea. The code I showed is basically the patch we need to call SEAMCALL at runtime w/o detecting TDX at first. #GP must be handled as it is what SEAMCALL triggers if TDX is not enabled. #UD happens when CPU isn't in VMX operation, and we should distinguish it from #GP if we already want to handle #GP.
Kai Huang wrote: > On Mon, 2022-06-27 at 13:58 -0700, Dave Hansen wrote: > > On 6/26/22 22:23, Kai Huang wrote: > > > On Fri, 2022-06-24 at 11:38 -0700, Dave Hansen wrote: > > > > On 6/22/22 04:16, Kai Huang wrote: > > > > > SEAMCALL instruction causes #GP when SEAMRR isn't enabled, and #UD when > > > > > CPU is not in VMX operation. The TDX_MODULE_CALL macro doesn't handle > > > > > SEAMCALL exceptions. Leave to the caller to guarantee those conditions > > > > > before calling __seamcall(). > > > > > > > > I was trying to make the argument earlier that you don't need *ANY* > > > > detection for TDX, other than the ability to make a SEAMCALL. > > > > Basically, patch 01/22 could go away. > > ... > > > > So what does patch 01/22 buy us? One EXTABLE entry? > > > > > > There are below pros if we can detect whether TDX is enabled by BIOS during boot > > > before initializing the TDX Module: > > > > > > 1) There are requirements from customers to report whether platform supports TDX > > > and the TDX keyID numbers before initializing the TDX module so the userspace > > > cloud software can use this information to do something. Sorry I cannot find > > > the lore link now. > > > > <sigh> > > > > Never listen to customers literally. It'll just lead you down the wrong > > path. They told you, "we need $FOO in dmesg" and you ran with it > > without understanding why. The fact that you even *need* to find the > > lore link is because you didn't bother to realize what they really needed. > > > > dmesg is not ABI. It's for humans. If you need data out of the kernel, > > do it with a *REAL* ABI. Not dmesg. > > Showing in the dmesg is the first step, but later we have plan to expose keyID > info via /sysfs. Of course, it's always arguable customer's such requirement is > absolutely needed, but to me it's still a good thing to have code to detect TDX > during boot. The code isn't complicated as you can see. > > > > > > 2) As you can see, it can be used to handle ACPI CPU/memory hotplug and driver > > > managed memory hotplug. Kexec() support patch also can use it. > > > > > > Particularly, in concept, ACPI CPU/memory hotplug is only related to whether TDX > > > is enabled by BIOS, but not whether TDX module is loaded, or the result of > > > initializing the TDX module. So I think we should have some code to detect TDX > > > during boot. > > > > This is *EXACTLY* why our colleagues at Intel needs to tell us about > > what the OS and firmware should do when TDX is in varying states of decay. > > Yes I am working on it to make it public. > > > > > Does the mere presence of the TDX module prevent hotplug? > > > > For ACPI CPU hotplug, yes. The TDX module even doesn't need to be loaded. > Whether SEAMRR is enabled determines. > > For ACPI memory hotplug, in practice yes. For architectural behaviour, I'll > work with others internally to get some public statement. > > > Or, if a > > system has the TDX module loaded but no intent to ever use TDX, why > > can't it just use hotplug like a normal system which is not addled with > > the TDX albatross around its neck? > > I think if a machine has enabled TDX in the BIOS, the user of the machine very > likely has intention to actually use TDX. > > Yes for driver-managed memory hotplug, it makes sense if user doesn't want to > use TDX, it's better to not disable it. But to me it's also not a disaster if > we just disable driver-managed memory hotplug if TDX is enabled by BIOS. No, driver-managed memory hotplug is how Linux handles "dedicated memory" management. The architecture needs to comprehend that end users may want to move address ranges into and out of Linux core-mm management independently of whether those address ranges are also covered by a SEAM range.
On Tue, 2022-07-19 at 12:39 -0700, Dan Williams wrote: > Kai Huang wrote: > > On Mon, 2022-06-27 at 13:58 -0700, Dave Hansen wrote: > > > On 6/26/22 22:23, Kai Huang wrote: > > > > On Fri, 2022-06-24 at 11:38 -0700, Dave Hansen wrote: > > > > > On 6/22/22 04:16, Kai Huang wrote: > > > > > > SEAMCALL instruction causes #GP when SEAMRR isn't enabled, and #UD when > > > > > > CPU is not in VMX operation. The TDX_MODULE_CALL macro doesn't handle > > > > > > SEAMCALL exceptions. Leave to the caller to guarantee those conditions > > > > > > before calling __seamcall(). > > > > > > > > > > I was trying to make the argument earlier that you don't need *ANY* > > > > > detection for TDX, other than the ability to make a SEAMCALL. > > > > > Basically, patch 01/22 could go away. > > > ... > > > > > So what does patch 01/22 buy us? One EXTABLE entry? > > > > > > > > There are below pros if we can detect whether TDX is enabled by BIOS during boot > > > > before initializing the TDX Module: > > > > > > > > 1) There are requirements from customers to report whether platform supports TDX > > > > and the TDX keyID numbers before initializing the TDX module so the userspace > > > > cloud software can use this information to do something. Sorry I cannot find > > > > the lore link now. > > > > > > <sigh> > > > > > > Never listen to customers literally. It'll just lead you down the wrong > > > path. They told you, "we need $FOO in dmesg" and you ran with it > > > without understanding why. The fact that you even *need* to find the > > > lore link is because you didn't bother to realize what they really needed. > > > > > > dmesg is not ABI. It's for humans. If you need data out of the kernel, > > > do it with a *REAL* ABI. Not dmesg. > > > > Showing in the dmesg is the first step, but later we have plan to expose keyID > > info via /sysfs. Of course, it's always arguable customer's such requirement is > > absolutely needed, but to me it's still a good thing to have code to detect TDX > > during boot. The code isn't complicated as you can see. > > > > > > > > > 2) As you can see, it can be used to handle ACPI CPU/memory hotplug and driver > > > > managed memory hotplug. Kexec() support patch also can use it. > > > > > > > > Particularly, in concept, ACPI CPU/memory hotplug is only related to whether TDX > > > > is enabled by BIOS, but not whether TDX module is loaded, or the result of > > > > initializing the TDX module. So I think we should have some code to detect TDX > > > > during boot. > > > > > > This is *EXACTLY* why our colleagues at Intel needs to tell us about > > > what the OS and firmware should do when TDX is in varying states of decay. > > > > Yes I am working on it to make it public. > > > > > > > > Does the mere presence of the TDX module prevent hotplug? > > > > > > > For ACPI CPU hotplug, yes. The TDX module even doesn't need to be loaded. > > Whether SEAMRR is enabled determines. > > > > For ACPI memory hotplug, in practice yes. For architectural behaviour, I'll > > work with others internally to get some public statement. > > > > > Or, if a > > > system has the TDX module loaded but no intent to ever use TDX, why > > > can't it just use hotplug like a normal system which is not addled with > > > the TDX albatross around its neck? > > > > I think if a machine has enabled TDX in the BIOS, the user of the machine very > > likely has intention to actually use TDX. > > > > Yes for driver-managed memory hotplug, it makes sense if user doesn't want to > > use TDX, it's better to not disable it. But to me it's also not a disaster if > > we just disable driver-managed memory hotplug if TDX is enabled by BIOS. > > No, driver-managed memory hotplug is how Linux handles "dedicated > memory" management. The architecture needs to comprehend that end users > may want to move address ranges into and out of Linux core-mm management > independently of whether those address ranges are also covered by a SEAM > range. But to avoid GFP_TDX (and ZONE_TDX) staff, we need to guarantee all memory pages in page allocator are TDX pages. To me it's at least quite fair that user needs to *choose* to use driver-managed memory hotplug or TDX. If automatically disable driver-managed memory hotplug on a TDX BIOS enabled platform isn't desired, how about we introduce a kernel command line (i.e. use_tdx={on|off}) to let user to choose? If user specifies use_tdx=on, then user cannot use driver-managed memory hotplug. if use_tdx=off, then user cannot use TDX even it is enabled by BIOS.
On Tue, 2022-06-28 at 10:10 +1200, Kai Huang wrote: > On Mon, 2022-06-27 at 13:58 -0700, Dave Hansen wrote: > > On 6/26/22 22:23, Kai Huang wrote: > > > On Fri, 2022-06-24 at 11:38 -0700, Dave Hansen wrote: > > > > On 6/22/22 04:16, Kai Huang wrote: > > > > > SEAMCALL instruction causes #GP when SEAMRR isn't enabled, and #UD when > > > > > CPU is not in VMX operation. The TDX_MODULE_CALL macro doesn't handle > > > > > SEAMCALL exceptions. Leave to the caller to guarantee those conditions > > > > > before calling __seamcall(). > > > > > > > > I was trying to make the argument earlier that you don't need *ANY* > > > > detection for TDX, other than the ability to make a SEAMCALL. > > > > Basically, patch 01/22 could go away. > > ... > > > > So what does patch 01/22 buy us? One EXTABLE entry? > > > > > > There are below pros if we can detect whether TDX is enabled by BIOS during boot > > > before initializing the TDX Module: > > > > > > 1) There are requirements from customers to report whether platform supports TDX > > > and the TDX keyID numbers before initializing the TDX module so the userspace > > > cloud software can use this information to do something. Sorry I cannot find > > > the lore link now. > > > > <sigh> > > > > Never listen to customers literally. It'll just lead you down the wrong > > path. They told you, "we need $FOO in dmesg" and you ran with it > > without understanding why. The fact that you even *need* to find the > > lore link is because you didn't bother to realize what they really needed. > > > > dmesg is not ABI. It's for humans. If you need data out of the kernel, > > do it with a *REAL* ABI. Not dmesg. > > Showing in the dmesg is the first step, but later we have plan to expose keyID > info via /sysfs. Of course, it's always arguable customer's such requirement is > absolutely needed, but to me it's still a good thing to have code to detect TDX > during boot. The code isn't complicated as you can see. > > > > > > 2) As you can see, it can be used to handle ACPI CPU/memory hotplug and driver > > > managed memory hotplug. Kexec() support patch also can use it. > > > > > > Particularly, in concept, ACPI CPU/memory hotplug is only related to whether TDX > > > is enabled by BIOS, but not whether TDX module is loaded, or the result of > > > initializing the TDX module. So I think we should have some code to detect TDX > > > during boot. > > > > This is *EXACTLY* why our colleagues at Intel needs to tell us about > > what the OS and firmware should do when TDX is in varying states of decay. > > Yes I am working on it to make it public. > > > > > Does the mere presence of the TDX module prevent hotplug? > > > > For ACPI CPU hotplug, yes. The TDX module even doesn't need to be loaded. > Whether SEAMRR is enabled determines. > > For ACPI memory hotplug, in practice yes. For architectural behaviour, I'll > work with others internally to get some public statement. > > > Or, if a > > system has the TDX module loaded but no intent to ever use TDX, why > > can't it just use hotplug like a normal system which is not addled with > > the TDX albatross around its neck? > > I think if a machine has enabled TDX in the BIOS, the user of the machine very > likely has intention to actually use TDX. > > Yes for driver-managed memory hotplug, it makes sense if user doesn't want to > use TDX, it's better to not disable it. But to me it's also not a disaster if > we just disable driver-managed memory hotplug if TDX is enabled by BIOS. > > For ACPI memory hotplug, I think in practice we can treat it as BIOS bug, but > I'll get some public statement around this. > Hi Dave, Try to close on how to handle memory hotplug. After discussion, below will be architectural behaviour of TDX in terms of ACPI memory hotplug: 1) During platform boot, CMRs must be physically present. MCHECK verifies all CMRs are physically present and are actually TDX convertible memory. 2) CMRs are static after platform boots and don't change at runtime. TDX architecture doesn't support hot-add or hot-removal of CMR memory. 3) TDX architecture doesn't forbid non-CMR memory hotplug. Also, although TDX doesn't trust BIOS in terms of security, a non-buggy BIOS should prevent CMR memory from being hot-removed. If kernel ever receives such event, it's a BIOS bug, or even worse, the BIOS is compromised and under attack. As a result, the kernel should also never receive event of hot-add CMR memory. It is very much likely TDX is under attack (physical attack) in such case, i.e. someone is trying to physically replace any CMR memory. In terms of how to handle ACPI memory hotplug, my thinking is -- ideally, if the kernel can get the CMRs during kernel boot when detecting whether TDX is enabled by BIOS, we can do below: - For memory hot-removal, if the removed memory falls into any CMR, then kernel can speak loudly it is a BIOS bug. But when this happens, the hot-removal has been handled by BIOS thus kernel cannot actually prevent, so kernel can either BUG(), or just print error message. If the removed memory doesn't fall into CMR, we do nothing. - For memory hot-add, if the new memory falls into any CMR, then kernel should speak loudly it is a BIOS bug, or even say "TDX is under attack" as this is only possible when CMR memory has been previously hot-removed. And kernel should reject the new memory for security reason. If the new memory doesn't fall into any CMR, then we (also) just reject the new memory, as we want to guarantee all memory in page allocator are TDX pages. But this is basically due to kernel policy but not due to TDX architecture. BUT, since as the first step, we cannot get the CMR during kernel boot (as it requires additional code to put CPU into VMX operation), I think for now we can handle ACPI memory hotplug in below way: - For memory hot-removal, we do nothing. - For memory hot-add, we simply reject the new memory when TDX is enabled by BIOS. This not only prevents the potential "physical attack of replacing any CMR memory", but also makes sure no non-CMR memory will be added to page allocator during runtime via ACPI memory hot-add. We can improve this in next stage when we can get CMRs during kernel boot. For the concern that on a TDX BIOS enabled system, people may not want to use TDX at all but just use it as normal system, as I replied to Dan regarding to the driver-managed memory hotplug, we can provide a kernel commandline, i.e. use_tdx={on|off}, to allow user to *choose* between TDX and memory hotplug. When use_tdx=off, we continue to allow memory hotplug and driver-managed hotplug as normal but refuse to initialize TDX module. Any comments?
On 7/20/22 03:18, Kai Huang wrote: > Try to close on how to handle memory hotplug. After discussion, below will be > architectural behaviour of TDX in terms of ACPI memory hotplug: > > 1) During platform boot, CMRs must be physically present. MCHECK verifies all > CMRs are physically present and are actually TDX convertible memory. I doubt this is strictly true. This makes it sound like MCHECK is doing *ACTUAL* verification that the memory is, in practice, convertible. That would mean actually writing to it, which would take a long time for a large system. Does it *ACTUALLY* verify this? Also, it's very odd to say that "CMRs must be physically present". A CMR itself is a logical construct. The physical memory *backing* a CMR is, something else entirely. > 2) CMRs are static after platform boots and don't change at runtime. TDX > architecture doesn't support hot-add or hot-removal of CMR memory. > 3) TDX architecture doesn't forbid non-CMR memory hotplug. > > Also, although TDX doesn't trust BIOS in terms of security, a non-buggy BIOS > should prevent CMR memory from being hot-removed. If kernel ever receives such > event, it's a BIOS bug, or even worse, the BIOS is compromised and under attack. > > As a result, the kernel should also never receive event of hot-add CMR memory. > It is very much likely TDX is under attack (physical attack) in such case, i.e. > someone is trying to physically replace any CMR memory. > > In terms of how to handle ACPI memory hotplug, my thinking is -- ideally, if the > kernel can get the CMRs during kernel boot when detecting whether TDX is enabled > by BIOS, we can do below: > > - For memory hot-removal, if the removed memory falls into any CMR, then kernel > can speak loudly it is a BIOS bug. But when this happens, the hot-removal has > been handled by BIOS thus kernel cannot actually prevent, so kernel can either > BUG(), or just print error message. If the removed memory doesn't fall into > CMR, we do nothing. Hold on a sec. Hot-removal is a two-step process. The kernel *MUST* know in advance that the removal is going to occur. It follows that up with evacuating the memory, giving the "all clear", then the actual physical removal can occur. I'm not sure what you're getting at with the "kernel cannot actually prevent" bit. No sane system actively destroys perfect good memory content and tells the kernel about it after the fact. > - For memory hot-add, if the new memory falls into any CMR, then kernel should > speak loudly it is a BIOS bug, or even say "TDX is under attack" as this is only > possible when CMR memory has been previously hot-removed. I don't think this is strictly true. It's totally possible to get a hot-add *event* for memory which is in a CMR. It would be another BIOS bug, of course, but hot-remove is not a prerequisite purely for an event. > And kernel should > reject the new memory for security reason. If the new memory doesn't fall into > any CMR, then we (also) just reject the new memory, as we want to guarantee all > memory in page allocator are TDX pages. But this is basically due to kernel > policy but not due to TDX architecture. Agreed. > BUT, since as the first step, we cannot get the CMR during kernel boot (as it > requires additional code to put CPU into VMX operation), I think for now we can > handle ACPI memory hotplug in below way: > > - For memory hot-removal, we do nothing. This doesn't seem right to me. *If* we get a known-bogus hot-remove event, we need to reject it. Remember, removal is a two-step process. > - For memory hot-add, we simply reject the new memory when TDX is enabled by > BIOS. This not only prevents the potential "physical attack of replacing any > CMR memory", I don't think there's *any* meaningful attack mitigation here. Even if someone managed to replace the physical address space that backed some private memory, the integrity checksums won't match. Memory integrity mitigates physical replacement, not software. > but also makes sure no non-CMR memory will be added to page > allocator during runtime via ACPI memory hot-add. Agreed. This one _is_ important and since it supports an existing policy, it makes sense to enforce this in the kernel. > We can improve this in next stage when we can get CMRs during kernel boot. > > For the concern that on a TDX BIOS enabled system, people may not want to use > TDX at all but just use it as normal system, as I replied to Dan regarding to > the driver-managed memory hotplug, we can provide a kernel commandline, i.e. > use_tdx={on|off}, to allow user to *choose* between TDX and memory hotplug. > When use_tdx=off, we continue to allow memory hotplug and driver-managed hotplug > as normal but refuse to initialize TDX module. That doesn't sound like a good resolution to me. It conflates pure "software" hotplug operations like transitioning memory ownership from the core mm to a driver (like device DAX). TDX should not have *ANY* impact on purely software operations. Period.
On Wed, 2022-07-20 at 09:48 -0700, Dave Hansen wrote: > On 7/20/22 03:18, Kai Huang wrote: > > Try to close on how to handle memory hotplug. After discussion, below will be > > architectural behaviour of TDX in terms of ACPI memory hotplug: > > > > 1) During platform boot, CMRs must be physically present. MCHECK verifies all > > CMRs are physically present and are actually TDX convertible memory. > > I doubt this is strictly true. This makes it sound like MCHECK is doing > *ACTUAL* verification that the memory is, in practice, convertible. > That would mean actually writing to it, which would take a long time for > a large system. The "verify" is used in many places in the specs. In the public TDX module spec, it is also used: Table 1.1: Intel TDX Glossary CMR: A range of physical memory configured by BIOS and verified by MCHECK I guess to verify, MCHECK doesn't need to actually write something to memory. For example, when memory is present, it can know what type it is so it can determine. > > Does it *ACTUALLY* verify this? Yes. This is what spec says. And this is what Intel colleagues said. > > Also, it's very odd to say that "CMRs must be physically present". A > CMR itself is a logical construct. The physical memory *backing* a CMR > is, something else entirely. OK. But I think it is easy to interpret this actually means the physical memory *backing* a CMR. > > > 2) CMRs are static after platform boots and don't change at runtime. TDX > > architecture doesn't support hot-add or hot-removal of CMR memory. > > 3) TDX architecture doesn't forbid non-CMR memory hotplug. > > > > Also, although TDX doesn't trust BIOS in terms of security, a non-buggy BIOS > > should prevent CMR memory from being hot-removed. If kernel ever receives such > > event, it's a BIOS bug, or even worse, the BIOS is compromised and under attack. > > > > As a result, the kernel should also never receive event of hot-add CMR memory. > > It is very much likely TDX is under attack (physical attack) in such case, i.e. > > someone is trying to physically replace any CMR memory. > > > > In terms of how to handle ACPI memory hotplug, my thinking is -- ideally, if the > > kernel can get the CMRs during kernel boot when detecting whether TDX is enabled > > by BIOS, we can do below: > > > > - For memory hot-removal, if the removed memory falls into any CMR, then kernel > > can speak loudly it is a BIOS bug. But when this happens, the hot-removal has > > been handled by BIOS thus kernel cannot actually prevent, so kernel can either > > BUG(), or just print error message. If the removed memory doesn't fall into > > CMR, we do nothing. > > Hold on a sec. Hot-removal is a two-step process. The kernel *MUST* > know in advance that the removal is going to occur. It follows that up > with evacuating the memory, giving the "all clear", then the actual > physical removal can occur. After looking more, looks "the hot-removal has been handled by BIOS" is wrong. And you are right there's a previous step must be done (it is device offline). But the "kernel cannot actually prevent" means in the device removal callback, the kernel cannot prevent it from being removed. This is my understanding by reading the ACPI spec and the code: Firstly, the BIOS will send a "Eject Request" notification to the kernel. Upon receiving this event, the kernel will firstly try to offline the device (which can fail due to -EBUSY, etc). If offline is successful, the kernel will call device's remove callback to remove the device. But this remove callback doesn't return error code (which means it doesn't fail). Instead, after the remove callback is done, the kernel calls _EJ0 ACPI method to actually do the ejection. > > I'm not sure what you're getting at with the "kernel cannot actually > prevent" bit. No sane system actively destroys perfect good memory > content and tells the kernel about it after the fact. The kernel will offline the device first. This guarantees all good memory content has been migrated. > > > - For memory hot-add, if the new memory falls into any CMR, then kernel should > > speak loudly it is a BIOS bug, or even say "TDX is under attack" as this is only > > possible when CMR memory has been previously hot-removed. > > I don't think this is strictly true. It's totally possible to get a > hot-add *event* for memory which is in a CMR. It would be another BIOS > bug, of course, but hot-remove is not a prerequisite purely for an event. OK. > > > And kernel should > > reject the new memory for security reason. If the new memory doesn't fall into > > any CMR, then we (also) just reject the new memory, as we want to guarantee all > > memory in page allocator are TDX pages. But this is basically due to kernel > > policy but not due to TDX architecture. > > Agreed. > > > BUT, since as the first step, we cannot get the CMR during kernel boot (as it > > requires additional code to put CPU into VMX operation), I think for now we can > > handle ACPI memory hotplug in below way: > > > > - For memory hot-removal, we do nothing. > > This doesn't seem right to me. *If* we get a known-bogus hot-remove > event, we need to reject it. Remember, removal is a two-step process. If so, we need to reject the (CMR) memory offline. Or we just BUG() in the ACPI memory removal callback? But either way this will requires us to get the CMRs during kernel boot. Do you think we need to add this support in the first series? > > > - For memory hot-add, we simply reject the new memory when TDX is enabled by > > BIOS. This not only prevents the potential "physical attack of replacing any > > CMR memory", > > I don't think there's *any* meaningful attack mitigation here. Even if > someone managed to replace the physical address space that backed some > private memory, the integrity checksums won't match. Memory integrity > mitigates physical replacement, not software. My thinking is rejecting the new memory is a more aggressive defence than waiting until integrity checksum failure. Btw, the integrity checksum support isn't a mandatory requirement for TDX architecture. In fact, TDX also supports a mode which doesn't require integrity check (for instance, TDX on client machines). > > > but also makes sure no non-CMR memory will be added to page > > allocator during runtime via ACPI memory hot-add. > > Agreed. This one _is_ important and since it supports an existing > policy, it makes sense to enforce this in the kernel. > > > We can improve this in next stage when we can get CMRs during kernel boot. > > > > For the concern that on a TDX BIOS enabled system, people may not want to use > > TDX at all but just use it as normal system, as I replied to Dan regarding to > > the driver-managed memory hotplug, we can provide a kernel commandline, i.e. > > use_tdx={on|off}, to allow user to *choose* between TDX and memory hotplug. > > When use_tdx=off, we continue to allow memory hotplug and driver-managed hotplug > > as normal but refuse to initialize TDX module. > > That doesn't sound like a good resolution to me. > > It conflates pure "software" hotplug operations like transitioning > memory ownership from the core mm to a driver (like device DAX). > > TDX should not have *ANY* impact on purely software operations. Period. The hard requirement is: Once TDX module gets initialized, we cannot add any *new* memory to core-mm. But if some memory block is included to TDX memory when the module gets initialized, then we should be able to move it from core-mm to driver or vice versa. In this case, we can select all memory regions that the kernel wants to use as TDX memory at some point (during kernel boot I guess). Adding any non-selected-TDX memory regions to core-mm should always be rejected (therefore there's no removal of them from core-mm either), although it is "software" hotplug. If user wants this, he/she cannot use TDX. This is what I mean we can provide command line to allow user to *choose*. Also, if I understand correctly above, your suggestion is we want to prevent any CMR memory going offline so it won't be hot-removed (assuming we can get CMRs during boot). This looks contradicts to the requirement of being able to allow moving memory from core-mm to driver. When we offline the memory, we cannot know whether the memory will be used by driver, or later hot-removed.
On Thu, 2022-07-21 at 13:52 +1200, Kai Huang wrote: > > > BUT, since as the first step, we cannot get the CMR during kernel boot (as > > > it > > > requires additional code to put CPU into VMX operation), I think for now > > > we can > > > handle ACPI memory hotplug in below way: > > > > > > - For memory hot-removal, we do nothing. > > > > This doesn't seem right to me. *If* we get a known-bogus hot-remove > > event, we need to reject it. Remember, removal is a two-step process. > > If so, we need to reject the (CMR) memory offline. Or we just BUG() in the > ACPI > memory removal callback? > > But either way this will requires us to get the CMRs during kernel boot. > > Do you think we need to add this support in the first series? Hi Dave, In terms of whether we should get CMRs during kernel boot (which requires we do VMXON/VMXOFF during kernel boot around SEAMCALL), I forgot one thing: Technically, ACPI memory hotplug is related to whether TDX is enabled in BIOS, but not related to whether TDX module is loaded or not. With doing VMXON/VMXOFF, we can get CMRs during kernel boot by calling P-SEAMLDR's SEAMCALL. But theoretically, from TDX architecture's point of view, the P- SEAMLDR may not be loaded even TDX is enabled by BIOS (in practice, the P- SEAMLDR is always loaded by BIOS when TDX is enabled), in which case there's no way we can get CMRs. But in this case, I think we can just treat TDX isn't enabled by BIOS as kernel should never try to load P-SEAMLDR. Other advantages of being able to do VMXON/VMXOFF and getting CMRs during kernel boot: 1) We can just shut down the TDX module in kexec(); 2) We can choose to trim any non-CMR memory out of memblock.memory instead of having to manually verify all memory regions in memblock are CMR memory. Comments?
On 7/26/22 17:34, Kai Huang wrote: >> This doesn't seem right to me. *If* we get a known-bogus >> hot-remove event, we need to reject it. Remember, removal is a >> two-step process. > If so, we need to reject the (CMR) memory offline. Or we just BUG() > in the ACPI memory removal callback? > > But either way this will requires us to get the CMRs during kernel boot. I don't get the link there between CMRs at boot and handling hotplug. We don't need to go to extreme measures just to get a message out of the kernel that the BIOS is bad. If we don't have the data to do it already, then I don't really see the nee to warn about it. Think of a system that has TDX enabled in the BIOS, but is running an old kernel. It will have *ZERO* idea that hotplug doesn't work. It'll run blissfully along. I don't see any reason that a kernel with TDX support, but where TDX is disabled should actively go out and try to be better than those old pre-TDX kernels. Further, there's nothing to stop non-CMR memory from being added to a system with TDX enabled in the BIOS but where the kernel is not using it. If we actively go out and keep good old DRAM from being added, then we unnecessarily addle those systems.
On Tue, 2022-07-26 at 17:50 -0700, Dave Hansen wrote: > On 7/26/22 17:34, Kai Huang wrote: > > > This doesn't seem right to me. *If* we get a known-bogus > > > hot-remove event, we need to reject it. Remember, removal is a > > > two-step process. > > If so, we need to reject the (CMR) memory offline. Or we just BUG() > > in the ACPI memory removal callback? > > > > But either way this will requires us to get the CMRs during kernel boot. > > I don't get the link there between CMRs at boot and handling hotplug. > > We don't need to go to extreme measures just to get a message out of the > kernel that the BIOS is bad. If we don't have the data to do it > already, then I don't really see the nee to warn about it. > > Think of a system that has TDX enabled in the BIOS, but is running an > old kernel. It will have *ZERO* idea that hotplug doesn't work. It'll > run blissfully along. I don't see any reason that a kernel with TDX > support, but where TDX is disabled should actively go out and try to be > better than those old pre-TDX kernels. Agreed, assuming "where TDX is disabled" you mean TDX isn't usable (i.e. when TDX module isn't loaded, or won't be initialized at all). > > Further, there's nothing to stop non-CMR memory from being added to a > system with TDX enabled in the BIOS but where the kernel is not using > it. If we actively go out and keep good old DRAM from being added, then > we unnecessarily addle those systems. > OK. Then for memory hot-add, perhaps we can just go with the "winner-take-all" approach you mentioned before? For memory hot-removal, as I replied previously, looks the kernel cannot reject the removal if it allows memory offline. Any suggestion on this?
On Thu, 2022-07-21 at 13:52 +1200, Kai Huang wrote: > Also, if I understand correctly above, your suggestion is we want to prevent any > CMR memory going offline so it won't be hot-removed (assuming we can get CMRs > during boot). This looks contradicts to the requirement of being able to allow > moving memory from core-mm to driver. When we offline the memory, we cannot > know whether the memory will be used by driver, or later hot-removed. Hi Dave, The high level flow of device hot-removal is: acpi_scan_hot_remove() -> acpi_scan_try_to_offline() -> acpi_bus_offline() -> device_offline() -> memory_subsys_offline() -> acpi_bus_trim() -> acpi_memory_device_remove() And memory_subsys_offline() can also be triggered via /sysfs: echo 0 > /sys/devices/system/memory/memory30/online After the memory block is offline, my understanding is kernel can theoretically move it to, i.e. ZONE_DEVICE via memremap_pages(). As you can see memory_subsys_offline() is the entry point of memory device offline (before it the code is generic for all ACPI device), and it cannot distinguish whether the removal is from ACPI event, or from /sysfs, so it seems we are unable to refuse to offline memory in memory_subsys_offline() when it is called from ACPI event. Any comments?
On 8/2/22 19:37, Kai Huang wrote: > On Thu, 2022-07-21 at 13:52 +1200, Kai Huang wrote: >> Also, if I understand correctly above, your suggestion is we want to prevent any >> CMR memory going offline so it won't be hot-removed (assuming we can get CMRs >> during boot). This looks contradicts to the requirement of being able to allow >> moving memory from core-mm to driver. When we offline the memory, we cannot >> know whether the memory will be used by driver, or later hot-removed. > Hi Dave, > > The high level flow of device hot-removal is: > > acpi_scan_hot_remove() > -> acpi_scan_try_to_offline() > -> acpi_bus_offline() > -> device_offline() > -> memory_subsys_offline() > -> acpi_bus_trim() > -> acpi_memory_device_remove() > > > And memory_subsys_offline() can also be triggered via /sysfs: > > echo 0 > /sys/devices/system/memory/memory30/online > > After the memory block is offline, my understanding is kernel can theoretically > move it to, i.e. ZONE_DEVICE via memremap_pages(). > > As you can see memory_subsys_offline() is the entry point of memory device > offline (before it the code is generic for all ACPI device), and it cannot > distinguish whether the removal is from ACPI event, or from /sysfs, so it seems > we are unable to refuse to offline memory in memory_subsys_offline() when it is > called from ACPI event. > > Any comments? I suggest refactoring the code in a way that makes it possible to distinguish the two cases. It's not like you have some binary kernel. You have the source code for the whole thing and can propose changes *ANYWHERE* you need. Even better: $ grep -A2 ^ACPI\$ MAINTAINERS ACPI M: "Rafael J. Wysocki" <rafael@kernel.org> R: Len Brown <lenb@kernel.org> The maintainer of ACPI works for our employer. Plus, he's a nice helpful guy that you can go ask how you might refactor this or approaches you might take. Have you talked to Rafael about this issue? Also, from a two-minute grepping session, I noticed this: > static acpi_status acpi_bus_offline(acpi_handle handle, u32 lvl, void *data, > void **ret_p) > { ... > if (device->handler && !device->handler->hotplug.enabled) { > *ret_p = &device->dev; > return AE_SUPPORT; > } It looks to me like if you simply set: memory_device_handler->hotplug.enabled = false; you'll get most of the behavior you want. ACPI memory hotplug would not work and the changes would be confined to the ACPI world. The "lower-level" bus-based hotplug would be unaffected. Now, I don't know what kind of locking would be needed to muck with a global structure like that. But, it's a start.
On Wed, 2022-08-03 at 07:20 -0700, Dave Hansen wrote: > On 8/2/22 19:37, Kai Huang wrote: > > On Thu, 2022-07-21 at 13:52 +1200, Kai Huang wrote: > > > Also, if I understand correctly above, your suggestion is we want to prevent any > > > CMR memory going offline so it won't be hot-removed (assuming we can get CMRs > > > during boot). This looks contradicts to the requirement of being able to allow > > > moving memory from core-mm to driver. When we offline the memory, we cannot > > > know whether the memory will be used by driver, or later hot-removed. > > Hi Dave, > > > > The high level flow of device hot-removal is: > > > > acpi_scan_hot_remove() > > -> acpi_scan_try_to_offline() > > -> acpi_bus_offline() > > -> device_offline() > > -> memory_subsys_offline() > > -> acpi_bus_trim() > > -> acpi_memory_device_remove() > > > > > > And memory_subsys_offline() can also be triggered via /sysfs: > > > > echo 0 > /sys/devices/system/memory/memory30/online > > > > After the memory block is offline, my understanding is kernel can theoretically > > move it to, i.e. ZONE_DEVICE via memremap_pages(). > > > > As you can see memory_subsys_offline() is the entry point of memory device > > offline (before it the code is generic for all ACPI device), and it cannot > > distinguish whether the removal is from ACPI event, or from /sysfs, so it seems > > we are unable to refuse to offline memory in memory_subsys_offline() when it is > > called from ACPI event. > > > > Any comments? > > I suggest refactoring the code in a way that makes it possible to > distinguish the two cases. > > It's not like you have some binary kernel. You have the source code for > the whole thing and can propose changes *ANYWHERE* you need. Even better: > > $ grep -A2 ^ACPI\$ MAINTAINERS > ACPI > M: "Rafael J. Wysocki" <rafael@kernel.org> > R: Len Brown <lenb@kernel.org> > > The maintainer of ACPI works for our employer. Plus, he's a nice > helpful guy that you can go ask how you might refactor this or > approaches you might take. Have you talked to Rafael about this issue? Rafael once also suggested to set hotplug.enabled to 0 as your code shows below, but we just got the TDX architecture behaviour of memory hotplug clarified from Intel TDX guys recently. > Also, from a two-minute grepping session, I noticed this: > > > static acpi_status acpi_bus_offline(acpi_handle handle, u32 lvl, void *data, > > void **ret_p) > > { > ... > > if (device->handler && !device->handler->hotplug.enabled) { > > *ret_p = &device->dev; > > return AE_SUPPORT; > > } > > It looks to me like if you simply set: > > memory_device_handler->hotplug.enabled = false; > > you'll get most of the behavior you want. ACPI memory hotplug would not > work and the changes would be confined to the ACPI world. The > "lower-level" bus-based hotplug would be unaffected. > > Now, I don't know what kind of locking would be needed to muck with a > global structure like that. But, it's a start. This has two problems: 1) This approach cannot distinguish non-CMR memory hotplug and CMR memory hotplug, as it disables ACPI memory hotplug for all. But this is fine as we want to reject non-CMR memory hotplug anyway. We just need to explain clearly in changelog. 2) This won't allow the kernel to speak out "BIOS bug" when CMR memory hotplug actually happens. Instead, we can only print out "hotplug is disabled due to TDX is enabled by BIOS." when we set hotplug.enable to false. Assuming above is OK, I'll explore this option. I'll also do some research to see if it's still possible to speak out "BIOS bug" in this approach but it's not a mandatory requirement to me now. Also, if print out "BIOS bug" for CMR memory hotplug isn't mandatory, then we can just detect TDX during kernel boot, and disable hotplug when TDX is enabled by BIOS, but don't need to use "winner-take-all" approach. The former is clearer and easier to implement. I'll go with the former approach if I don't hear objection from you. And ACPI CPU hotplug can also use the same way. Please let me know any comments. Thanks!
On Thu, 2022-08-04 at 10:35 +1200, Kai Huang wrote: > On Wed, 2022-08-03 at 07:20 -0700, Dave Hansen wrote: > > On 8/2/22 19:37, Kai Huang wrote: > > > On Thu, 2022-07-21 at 13:52 +1200, Kai Huang wrote: > > > > Also, if I understand correctly above, your suggestion is we want to prevent any > > > > CMR memory going offline so it won't be hot-removed (assuming we can get CMRs > > > > during boot). This looks contradicts to the requirement of being able to allow > > > > moving memory from core-mm to driver. When we offline the memory, we cannot > > > > know whether the memory will be used by driver, or later hot-removed. > > > Hi Dave, > > > > > > The high level flow of device hot-removal is: > > > > > > acpi_scan_hot_remove() > > > -> acpi_scan_try_to_offline() > > > -> acpi_bus_offline() > > > -> device_offline() > > > -> memory_subsys_offline() > > > -> acpi_bus_trim() > > > -> acpi_memory_device_remove() > > > > > > > > > And memory_subsys_offline() can also be triggered via /sysfs: > > > > > > echo 0 > /sys/devices/system/memory/memory30/online > > > > > > After the memory block is offline, my understanding is kernel can theoretically > > > move it to, i.e. ZONE_DEVICE via memremap_pages(). > > > > > > As you can see memory_subsys_offline() is the entry point of memory device > > > offline (before it the code is generic for all ACPI device), and it cannot > > > distinguish whether the removal is from ACPI event, or from /sysfs, so it seems > > > we are unable to refuse to offline memory in memory_subsys_offline() when it is > > > called from ACPI event. > > > > > > Any comments? > > > > I suggest refactoring the code in a way that makes it possible to > > distinguish the two cases. > > > > It's not like you have some binary kernel. You have the source code for > > the whole thing and can propose changes *ANYWHERE* you need. Even better: > > > > $ grep -A2 ^ACPI\$ MAINTAINERS > > ACPI > > M: "Rafael J. Wysocki" <rafael@kernel.org> > > R: Len Brown <lenb@kernel.org> > > > > The maintainer of ACPI works for our employer. Plus, he's a nice > > helpful guy that you can go ask how you might refactor this or > > approaches you might take. Have you talked to Rafael about this issue? > > Rafael once also suggested to set hotplug.enabled to 0 as your code shows below, > but we just got the TDX architecture behaviour of memory hotplug clarified from > Intel TDX guys recently. > > > Also, from a two-minute grepping session, I noticed this: > > > > > static acpi_status acpi_bus_offline(acpi_handle handle, u32 lvl, void *data, > > > void **ret_p) > > > { > > ... > > > if (device->handler && !device->handler->hotplug.enabled) { > > > *ret_p = &device->dev; > > > return AE_SUPPORT; > > > } > > > > It looks to me like if you simply set: > > > > memory_device_handler->hotplug.enabled = false; > > > > you'll get most of the behavior you want. ACPI memory hotplug would not > > work and the changes would be confined to the ACPI world. The > > "lower-level" bus-based hotplug would be unaffected. > > > > Now, I don't know what kind of locking would be needed to muck with a > > global structure like that. But, it's a start. > > This has two problems: > > 1) This approach cannot distinguish non-CMR memory hotplug and CMR memory > hotplug, as it disables ACPI memory hotplug for all. But this is fine as we > want to reject non-CMR memory hotplug anyway. We just need to explain clearly > in changelog. > > 2) This won't allow the kernel to speak out "BIOS bug" when CMR memory hotplug > actually happens. Instead, we can only print out "hotplug is disabled due to > TDX is enabled by BIOS." when we set hotplug.enable to false. > > Assuming above is OK, I'll explore this option. I'll also do some research to > see if it's still possible to speak out "BIOS bug" in this approach but it's not > a mandatory requirement to me now. > > Also, if print out "BIOS bug" for CMR memory hotplug isn't mandatory, then we > can just detect TDX during kernel boot, and disable hotplug when TDX is enabled > by BIOS, but don't need to use "winner-take-all" approach. The former is > clearer and easier to implement. I'll go with the former approach if I don't > hear objection from you. > > And ACPI CPU hotplug can also use the same way. > > Please let me know any comments. Thanks! > One more reason why "winner-take-all" approach doesn't work: If we allow ACPI memory hotplug to happen but choose to disable it in the handler using "winner-take-all", then at the beginning the ACPI code will actually create a /sysfs entry for hotplug.enabled to allow userspace to change it: /sys/firmware/acpi/hotplug/memory/enabled Which means even we set hotplug.enabled to false at some point, userspace can turn it on again. The only way is to not create this /sysfs entry at the beginning. With "winner-take-all" approach, I don't think we should avoid creating the /sysfs entry. Nor we should introduce arch-specific hook to, i.e. prevent /sysfs entry being changed by userspace. So instead of "winner-take-all" approach, I'll introduce a new kernel command line to allow user to choose between ACPI CPU/memory hotplug vs TDX. This command line should not impact the "software" CPU/memory hotplug even when user choose to use TDX. In this case, this is similar to "winner-take-all" anyway.
diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile index 1bd688684716..fd577619620e 100644 --- a/arch/x86/virt/vmx/tdx/Makefile +++ b/arch/x86/virt/vmx/tdx/Makefile @@ -1,2 +1,2 @@ # SPDX-License-Identifier: GPL-2.0-only -obj-$(CONFIG_INTEL_TDX_HOST) += tdx.o +obj-$(CONFIG_INTEL_TDX_HOST) += tdx.o seamcall.o diff --git a/arch/x86/virt/vmx/tdx/seamcall.S b/arch/x86/virt/vmx/tdx/seamcall.S new file mode 100644 index 000000000000..f322427e48c3 --- /dev/null +++ b/arch/x86/virt/vmx/tdx/seamcall.S @@ -0,0 +1,52 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#include <linux/linkage.h> +#include <asm/frame.h> + +#include "tdxcall.S" + +/* + * __seamcall() - Host-side interface functions to SEAM software module + * (the P-SEAMLDR or the TDX module). + * + * Transform function call register arguments into the SEAMCALL register + * ABI. Return TDX_SEAMCALL_VMFAILINVALID if the SEAMCALL itself fails, + * or the completion status of the SEAMCALL leaf function. Additional + * output operands are saved in @out (if it is provided by caller). + * + *------------------------------------------------------------------------- + * SEAMCALL ABI: + *------------------------------------------------------------------------- + * Input Registers: + * + * RAX - SEAMCALL Leaf number. + * RCX,RDX,R8-R9 - SEAMCALL Leaf specific input registers. + * + * Output Registers: + * + * RAX - SEAMCALL completion status code. + * RCX,RDX,R8-R11 - SEAMCALL Leaf specific output registers. + * + *------------------------------------------------------------------------- + * + * __seamcall() function ABI: + * + * @fn (RDI) - SEAMCALL Leaf number, moved to RAX + * @rcx (RSI) - Input parameter 1, moved to RCX + * @rdx (RDX) - Input parameter 2, moved to RDX + * @r8 (RCX) - Input parameter 3, moved to R8 + * @r9 (R8) - Input parameter 4, moved to R9 + * + * @out (R9) - struct tdx_module_output pointer + * stored temporarily in R12 (not + * used by the P-SEAMLDR or the TDX + * module). It can be NULL. + * + * Return (via RAX) the completion status of the SEAMCALL, or + * TDX_SEAMCALL_VMFAILINVALID. + */ +SYM_FUNC_START(__seamcall) + FRAME_BEGIN + TDX_MODULE_CALL host=1 + FRAME_END + RET +SYM_FUNC_END(__seamcall) diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h index f16055cc25f4..f1a2dfb978b1 100644 --- a/arch/x86/virt/vmx/tdx/tdx.h +++ b/arch/x86/virt/vmx/tdx/tdx.h @@ -2,6 +2,7 @@ #ifndef _X86_VIRT_TDX_H #define _X86_VIRT_TDX_H +#include <linux/types.h> #include <linux/bits.h> /* @@ -44,4 +45,14 @@ ((u32)(((_keyid_part) & 0xffffffffull) + 1)) #define TDX_KEYID_NUM(_keyid_part) ((u32)((_keyid_part) >> 32)) + +/* + * Do not put any hardware-defined TDX structure representations below this + * comment! + */ + +struct tdx_module_output; +u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9, + struct tdx_module_output *out); + #endif
TDX introduces a new CPU mode: Secure Arbitration Mode (SEAM). This mode runs only the TDX module itself or other code to load the TDX module. The host kernel communicates with SEAM software via a new SEAMCALL instruction. This is conceptually similar to a guest->host hypercall, except it is made from the host to SEAM software instead. The TDX module defines SEAMCALL leaf functions to allow the host to initialize it, and to create and run protected VMs. SEAMCALL leaf functions use an ABI different from the x86-64 system-v ABI. Instead, they share the same ABI with the TDCALL leaf functions. Implement a function __seamcall() to allow the host to make SEAMCALL to SEAM software using the TDX_MODULE_CALL macro which is the common assembly for both SEAMCALL and TDCALL. SEAMCALL instruction causes #GP when SEAMRR isn't enabled, and #UD when CPU is not in VMX operation. The TDX_MODULE_CALL macro doesn't handle SEAMCALL exceptions. Leave to the caller to guarantee those conditions before calling __seamcall(). Signed-off-by: Kai Huang <kai.huang@intel.com> --- - v3 -> v5 (no feedback on v4): - Explicitly tell TDX_SEAMCALL_VMFAILINVALID is returned if the SEAMCALL itself fails. - Improve the changelog. --- arch/x86/virt/vmx/tdx/Makefile | 2 +- arch/x86/virt/vmx/tdx/seamcall.S | 52 ++++++++++++++++++++++++++++++++ arch/x86/virt/vmx/tdx/tdx.h | 11 +++++++ 3 files changed, 64 insertions(+), 1 deletion(-) create mode 100644 arch/x86/virt/vmx/tdx/seamcall.S