From patchwork Fri Dec 9 06:52:21 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kai Huang X-Patchwork-Id: 13069294 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9694EC4332F for ; Fri, 9 Dec 2022 06:53:06 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 07C4D8E0003; Fri, 9 Dec 2022 01:53:06 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 02AA78E0001; Fri, 9 Dec 2022 01:53:05 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E0DEC8E0003; Fri, 9 Dec 2022 01:53:05 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id C9B828E0001 for ; Fri, 9 Dec 2022 01:53:05 -0500 (EST) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 940241604D7 for ; Fri, 9 Dec 2022 06:53:05 +0000 (UTC) X-FDA: 80221850730.12.A008FD0 Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by imf20.hostedemail.com (Postfix) with ESMTP id 7993F1C0002 for ; Fri, 9 Dec 2022 06:53:02 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=a74FWtvO; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf20.hostedemail.com: domain of kai.huang@intel.com designates 134.134.136.24 as permitted sender) smtp.mailfrom=kai.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1670568784; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=Ki1eBlc6O2sD6NCcxVshDgArDGay8I0Sg16fS8X5zus=; b=Y+/et65b/wGOvbQSGV+gvu1FtFFIWD/ilDTZhw2JnK6C/dIh9xr0TfOAVQrhy2PzbBFxHd EYbGSuLDDBuk8f00QO6MgldPrjNDlHBlz8rlI0cM66eDgRvVouX3KxO4OXa//sZ1uHZSbF XQIIQD+MIp08IvZ4o6Z5uIdQUnVbVQk= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=a74FWtvO; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf20.hostedemail.com: domain of kai.huang@intel.com designates 134.134.136.24 as permitted sender) smtp.mailfrom=kai.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1670568784; a=rsa-sha256; cv=none; b=7GW2duSgXqyAV4OIRgqtRXcHHQ+HTMND7FHkxkSVgta7t2BPcsk2Xh+g70qS5L9+e8H/Bm TyGBhkrAgz9H4ruGzDxY/lXcH8KCr/RZt+F3fxxpwWkEaaP3IF0KiwR46NUBxk8j8RvShI xAcGwhZAIlwndPuHQ/dtBCNCZP2V4Kw= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1670568782; x=1702104782; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=Jv4YaRJlJMwYd5ZBDdAXWmQrGsvZ6MxdqXv30v82y5I=; b=a74FWtvON2JSlZk5sgnEUetDbCU5ZoCIjttd5fG4l7ZETTjDmQl3Mo37 ntLpGw7SVDbmit3zGxnQFreCCwpk5VKzqBoz5wfSPC8HiJ4x2PwS8NFbT aD57pac6g2zs9tUCijILid+MaBf31m2otISYOAxZJp95+uD5OIFpE+jWp kr2GCN7OQevS61mYaEWFSB/GdWxg50S8qmW+dUJ/7Adrfupv8YRijB9mi Zx0ZAiGVaM5d0EJtS2pHb+dRbVX2OmhRjqQf/NAvJIxUyLO34VEGMw+du rNWuj8pKZ/PQVsg8hPS/aXqcITQ8X9rlRdVEa09BLELF8IFDrWxFtGyOl g==; X-IronPort-AV: E=McAfee;i="6500,9779,10555"; a="318551236" X-IronPort-AV: E=Sophos;i="5.96,230,1665471600"; d="scan'208";a="318551236" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Dec 2022 22:53:00 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10555"; a="679836754" X-IronPort-AV: E=Sophos;i="5.96,230,1665471600"; d="scan'208";a="679836754" Received: from omiramon-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.212.28.82]) by orsmga001-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Dec 2022 22:52:55 -0800 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, dave.hansen@intel.com, peterz@infradead.org, tglx@linutronix.de, seanjc@google.com, pbonzini@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, kirill.shutemov@linux.intel.com, ying.huang@intel.com, reinette.chatre@intel.com, len.brown@intel.com, tony.luck@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v8 00/16] TDX host kernel support Date: Fri, 9 Dec 2022 19:52:21 +1300 Message-Id: X-Mailer: git-send-email 2.38.1 MIME-Version: 1.0 X-Rspamd-Queue-Id: 7993F1C0002 X-Rspamd-Server: rspam09 X-Rspam-User: X-Stat-Signature: iam4y1sx9qjaqpp8wcixohh9z6zhm6sd X-HE-Tag: 1670568782-230867 X-HE-Meta: U2FsdGVkX19GA9LYLVqXS4UlZ+Q5D5+HvoSIAPKIgzxvp2PyWQFUjbfS0tg/1qI6c9Pd6+jJ4Y8jKFpqeeB+fP5r0c8AF+ZWXXLLeKD8A/RUVaoprkAKHHVrbwl53n7pljwPmh2e7fjOSvj1LvcjmL+2B34J+HOQxq5bnA2K//mmgf+ZTRjzhm/F0vm0NnumYIGaCgot4Doi0ETjWzl8RLQKxR4WKCO5taJVgJouZCG9/emw9xwihIPOU0pfUVEf4oTgkXiixC32SESx8kSB79057K38APbArAqWqwgZN11S3cMHOtwraQqLrhY9AAgVNPV8/jTmPhJ6FnWxNRAtArQfBeSbKp3b0zKanHyyX7b9C6cTGru0IIbLOZQHK5T5a4OO08PVBkl1kqT6dcFEIl/1GRqvsRm7YQXHQyl7jnvKebeWxpzk8b3X9ZnOF9Ts8KnIElR9fJKiqp05rqHV2UHizi6Vs1HVJJtriUXkFBf9d2MdehGqOelUl86KU98amn/t6yEjH3nzmE8eiqvdJo+bO+byL74VrofP/AVjoMrehgiHsK18RaqdamGftKsyH4iJOUHNl4DlQe/wgxx8jxqKj9dVO3yKMz9BPQk0aVeu1YTYw5B7Sfq8GlC0X6rdUAmO/nJm5XnC9ax1l2pItm/n5OvJrOiXNlrbVKmRPhdkayNzG90Y9rIndkxCxn9wa2O82S4zuU+sBVAVD+akUw+20DPiHUSHOcoICGiLow4s98WqiP6dj/kClm/MFXLfKSk3r8Ut/HqelkzegcCPSWpcnBlZdcUT1v0IIFoLD6P8QcocjGhfBmbTdnhVCmvO9T5yzCPioBr7MZM8gEKAi0afUVIhMt9g+7dWvCfz8/olEZ2WxZSoAvsu0et6knM9ZFtXSPGDQ6Cp4kJZyG6jOg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Intel Trusted Domain Extensions (TDX) protects guest VMs from malicious host and certain physical attacks. TDX specs are available in [1]. This series is the initial support to enable TDX with minimal code to allow KVM to create and run TDX guests. KVM support for TDX is being developed separately[2]. A new "userspace inaccessible memfd" approach to support TDX private memory is also being developed[3]. The KVM will only support the new "userspace inaccessible memfd" as TDX guest memory. This series doesn't aim to support all functionalities, and doesn't aim to resolve all things perfectly. For example, memory hotplug is handled in simple way (please refer to "Kernel policy on TDX memory" and "Memory hotplug" sections below). Huang, Ying is working on a series to improve and will send out separately. (For memory hotplug, sorry for broadcasting widely but I cc'ed the linux-mm@kvack.org following Kirill's suggestion so MM experts can also help to provide comments.) And TDX module metadata allocation just uses alloc_contig_pages() to allocate large chunk at runtime, thus it can fail. It is imperfect now but _will_ be improved in the future. Also, the patch to add the new kernel comline tdx="force" isn't included in this initial version, as Dave suggested it isn't mandatory. But I _will_ add one once this initial version gets merged. All other optimizations will be posted as follow-up once this initial TDX support is upstreamed. Hi Dave, Peter, Thomas, Dan (and Intel reviewers), From v7 -> v8, there's a big change that we are pushing to remove TDH.SYS.INIT and TDH.SYS.LP.INIT (per-LP initialization) out of kernel. I'm assuming that the TDX spec and module will be changed to remove the TDH.SYS.INIT and TDH.SYS.LP.INIT SEAMCALLs. As a result, I've removed those patches from this series. But, the current TDX module that I'm testing with still requires those SEAMCALLs. So, I'm applying them at the end and testing with them in place. I would appreciate if folks could review this presumptive series anyway. And I would appreciate reviewed-by or acked-by tags if the patches look good to you. ----- Changelog history: ------ - v7 -> v8: - 200+ LOC removed (from 1800+ -> 1600+). - Removed patches to do TDH.SYS.INIT and TDH.SYS.LP.INIT (Dave/Peter/Thomas). - Removed patch to shut down TDX module (Sean). - For memory hotplug, changed to reject non-TDX memory from arch_add_memory() to memory_notifier (Dan/David). - Simplified the "skeletion patch" as a result of removing TDH.SYS.LP.INIT patch. - Refined changelog/comments for most of the patches (to tell better story, remove silly comments, etc) (Dave). - Added new 'struct tdmr_info_list' struct, and changed all TDMR related patches to use it (Dave). - Effectively merged patch "Reserve TDX module global KeyID" and "Configure TDX module with TDMRs and global KeyID", and removed the static variable 'tdx_global_keyid', following Dave's suggestion on making tdx_sysinfo local variable. - For detailed changes please see individual patch changelog history. v7: https://lore.kernel.org/lkml/529a22d05e21b9218dc3f29c17ac5a176334cac1.camel@intel.com/T/ - v6 -> v7: - Added memory hotplug support. - Changed how to choose the list of "TDX-usable" memory regions from at kernel boot time to TDX module initialization time. - Addressed comments received in previous versions. (Andi/Dave). - Improved the commit message and the comments of kexec() support patch, and the patch handles returnning PAMTs back to the kernel when TDX module initialization fails. Please also see "kexec()" section below. - Changed the documentation patch accordingly. - For all others please see individual patch changelog history. v6: https://lore.kernel.org/lkml/529a22d05e21b9218dc3f29c17ac5a176334cac1.camel@intel.com/T/ - v5 -> v6: - Removed ACPI CPU/memory hotplug patches. (Intel internal discussion) - Removed patch to disable driver-managed memory hotplug (Intel internal discussion). - Added one patch to introduce enum type for TDX supported page size level to replace the hard-coded values in TDX guest code (Dave). - Added one patch to make TDX depends on X2APIC being enabled (Dave). - Added one patch to build all boot-time present memory regions as TDX memory during kernel boot. - Added Reviewed-by from others to some patches. - For all others please see individual patch changelog history. v5: https://lore.kernel.org/lkml/529a22d05e21b9218dc3f29c17ac5a176334cac1.camel@intel.com/T/ - v4 -> v5: This is essentially a resent of v4. Sorry I forgot to consult get_maintainer.pl when sending out v4, so I forgot to add linux-acpi and linux-mm mailing list and the relevant people for 4 new patches. There are also very minor code and commit message update from v4: - Rebased to latest tip/x86/tdx. - Fixed a checkpatch issue that I missed in v4. - Removed an obsoleted comment that I missed in patch 6. - Very minor update to the commit message of patch 12. For other changes to individual patches since v3, please refer to the changelog histroy of individual patches (I just used v3 -> v5 since there's basically no code change to v4). v4: https://lore.kernel.org/lkml/98c84c31d8f062a0b50a69ef4d3188bc259f2af2.1654025431.git.kai.huang@intel.com/T/ - v3 -> v4 (addressed Dave's comments, and other comments from others): - Simplified SEAMRR and TDX keyID detection. - Added patches to handle ACPI CPU hotplug. - Added patches to handle ACPI memory hotplug and driver managed memory hotplug. - Removed tdx_detect() but only use single tdx_init(). - Removed detecting TDX module via P-SEAMLDR. - Changed from using e820 to using memblock to convert system RAM to TDX memory. - Excluded legacy PMEM from TDX memory. - Removed the boot-time command line to disable TDX patch. - Addressed comments for other individual patches (please see individual patches). - Improved the documentation patch based on the new implementation. v3: https://lore.kernel.org/lkml/529a22d05e21b9218dc3f29c17ac5a176334cac1.camel@intel.com/T/ - V2 -> v3: - Addressed comments from Isaku. - Fixed memory leak and unnecessary function argument in the patch to configure the key for the global keyid (patch 17). - Enhanced a little bit to the patch to get TDX module and CMR information (patch 09). - Fixed an unintended change in the patch to allocate PAMT (patch 13). - Addressed comments from Kevin: - Slightly improvement on commit message to patch 03. - Removed WARN_ON_ONCE() in the check of cpus_booted_once_mask in seamrr_enabled() (patch 04). - Changed documentation patch to add TDX host kernel support materials to Documentation/x86/tdx.rst together with TDX guest staff, instead of a standalone file (patch 21) - Very minor improvement in commit messages. v2: https://lore.kernel.org/lkml/529a22d05e21b9218dc3f29c17ac5a176334cac1.camel@intel.com/T/ - RFC (v1) -> v2: - Rebased to Kirill's latest TDX guest code. - Fixed two issues that are related to finding all RAM memory regions based on e820. - Minor improvement on comments and commit messages. v1: https://lore.kernel.org/lkml/529a22d05e21b9218dc3f29c17ac5a176334cac1.camel@intel.com/T/ == Background == TDX introduces a new CPU mode called Secure Arbitration Mode (SEAM) and a new isolated range pointed by the SEAM Ranger Register (SEAMRR). A CPU-attested software module called 'the TDX module' runs in the new isolated region as a trusted hypervisor to create/run protected VMs. TDX also leverages Intel Multi-Key Total Memory Encryption (MKTME) to provide crypto-protection to the VMs. TDX reserves part of MKTME KeyIDs as TDX private KeyIDs, which are only accessible within the SEAM mode. TDX is different from AMD SEV/SEV-ES/SEV-SNP, which uses a dedicated secure processor to provide crypto-protection. The firmware runs on the secure processor acts a similar role as the TDX module. The host kernel communicates with SEAM software via a new SEAMCALL instruction. This is conceptually similar to a guest->host hypercall, except it is made from the host to SEAM software instead. Before being able to manage TD guests, the TDX module must be loaded and properly initialized. This series assumes the TDX module is loaded by BIOS before the kernel boots. How to initialize the TDX module is described at TDX module 1.0 specification, chapter "13.Intel TDX Module Lifecycle: Enumeration, Initialization and Shutdown". == Design Considerations == 1. Initialize the TDX module at runtime There are basically two ways the TDX module could be initialized: either in early boot, or at runtime before the first TDX guest is run. This series implements the runtime initialization. This series adds a function tdx_enable() to allow the caller to initialize TDX at runtime: if (tdx_enable()) goto no_tdx; // TDX is ready to create TD guests. This approach has below pros: 1) Initializing the TDX module requires to reserve ~1/256th system RAM as metadata. Enabling TDX on demand allows only to consume this memory when TDX is truly needed (i.e. when KVM wants to create TD guests). 2) SEAMCALL requires CPU being already in VMX operation (VMXON has been done). So far, KVM is the only user of TDX, and it already handles VMXON. Letting KVM to initialize TDX avoids handling VMXON in the core kernel. 3) It is more flexible to support "TDX module runtime update" (not in this series). After updating to the new module at runtime, kernel needs to go through the initialization process again. 2. CPU hotplug TDX doesn't support physical (ACPI) CPU hotplug. A non-buggy BIOS should never support hotpluggable CPU devicee and/or deliver ACPI CPU hotplug event to the kernel. This series doesn't handle physical (ACPI) CPU hotplug at all but depends on the BIOS to behave correctly. Note TDX works with CPU logical online/offline, thus this series still allows to do logical CPU online/offline. 3. Kernel policy on TDX memory The TDX module reports a list of "Convertible Memory Region" (CMR) to indicate which memory regions are TDX-capable. The TDX architecture allows the VMM to designate specific convertible memory regions as usable for TDX private memory. The initial support of TDX guests will only allocate TDX private memory from the global page allocator. This series chooses to designate _all_ system RAM in the core-mm at the time of initializing TDX module as TDX memory to guarantee all pages in the page allocator are TDX pages. 4. Memory Hotplug After the kernel passes all "TDX-usable" memory regions to the TDX module, the set of "TDX-usable" memory regions are fixed during module's runtime. No more "TDX-usable" memory can be added to the TDX module after that. To achieve above "to guarantee all pages in the page allocator are TDX pages", this series simply choose to reject any non-TDX-usable memory in memory hotplug. This _will_ be enhanced in the future after first submission. A better solution, suggested by Kirill, is similar to the per-node memory encryption flag in this series [4]. We can allow adding/onlining non-TDX memory to separate NUMA nodes so that both "TDX-capable" nodes and "TDX-capable" nodes can co-exist. The new TDX flag can be exposed to userspace via /sysfs so userspace can bind TDX guests to "TDX-capable" nodes via NUMA ABIs. 5. Physical Memory Hotplug Note TDX assumes convertible memory is always physically present during machine's runtime. A non-buggy BIOS should never support hot-removal of any convertible memory. This implementation doesn't handle ACPI memory removal but depends on the BIOS to behave correctly. 6. Kexec() There are two problems in terms of using kexec() to boot to a new kernel when the old kernel has enabled TDX: 1) Part of the memory pages are still TDX private pages (i.e. metadata used by the TDX module, and any TDX guest memory if kexec() happens when there's any TDX guest alive). 2) There might be dirty cachelines associated with TDX private pages. Just like SME, TDX hosts require special cache flushing before kexec(). Similar to SME handling, the kernel uses wbinvd() to flush cache in stop_this_cpu() when TDX is enabled. This series doesn't convert all TDX private pages back to normal due to below considerations: 1) Neither the TDX module nor the kernel has existing infrastructure to track which pages are TDX private pages. 2) The number of TDX private pages can be large, and converting all of them (cache flush + using MOVDIR64B to clear the page) in kexec() can be time consuming. 3) The new kernel will almost only use KeyID 0 to access memory. KeyID 0 doesn't support integrity-check, so it's OK. 4) The kernel doesn't (and may never) support MKTME. If any 3rd party kernel ever supports MKTME, it can/should do MOVDIR64B to clear the page with the new MKTME KeyID (just like TDX does) before using it. Kai Huang (16): x86/tdx: Define TDX supported page sizes as macros x86/virt/tdx: Detect TDX during kernel boot x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC x86/virt/tdx: Add skeleton to initialize TDX on demand x86/virt/tdx: Implement functions to make SEAMCALL x86/virt/tdx: Get information about TDX module and TDX-capable memory x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions x86/virt/tdx: Allocate and set up PAMTs for TDMRs x86/virt/tdx: Designate reserved areas for all TDMRs x86/virt/tdx: Designate the global KeyID and configure the TDX module x86/virt/tdx: Configure global KeyID on all packages x86/virt/tdx: Initialize all TDMRs x86/virt/tdx: Flush cache in kexec() when TDX is enabled Documentation/x86: Add documentation for TDX host support Documentation/x86/tdx.rst | 169 +++- arch/x86/Kconfig | 15 + arch/x86/Makefile | 2 + arch/x86/coco/tdx/tdx.c | 6 +- arch/x86/include/asm/tdx.h | 23 + arch/x86/kernel/process.c | 8 +- arch/x86/kernel/setup.c | 2 + arch/x86/virt/Makefile | 2 + arch/x86/virt/vmx/Makefile | 2 + arch/x86/virt/vmx/tdx/Makefile | 3 + arch/x86/virt/vmx/tdx/seamcall.S | 52 ++ arch/x86/virt/vmx/tdx/tdx.c | 1229 ++++++++++++++++++++++++++++++ arch/x86/virt/vmx/tdx/tdx.h | 128 ++++ arch/x86/virt/vmx/tdx/tdxcall.S | 19 +- 14 files changed, 1643 insertions(+), 17 deletions(-) create mode 100644 arch/x86/virt/Makefile create mode 100644 arch/x86/virt/vmx/Makefile create mode 100644 arch/x86/virt/vmx/tdx/Makefile create mode 100644 arch/x86/virt/vmx/tdx/seamcall.S create mode 100644 arch/x86/virt/vmx/tdx/tdx.c create mode 100644 arch/x86/virt/vmx/tdx/tdx.h