From patchwork Mon Jun 26 14:12:31 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Kai" X-Patchwork-Id: 13292992 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C1FB3EB64DC for ; Mon, 26 Jun 2023 14:14:01 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231199AbjFZOOA (ORCPT ); Mon, 26 Jun 2023 10:14:00 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34134 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231165AbjFZONx (ORCPT ); Mon, 26 Jun 2023 10:13:53 -0400 Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 168EF10C8; Mon, 26 Jun 2023 07:13:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1687788825; x=1719324825; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=IazzBZEPaJdhzuSVqXgULVuwNqeSBeuaUYlXXc8k87w=; b=hYhJ+CgCC0mxotu6UMu5yIczG/1YdvqozjIM9/yvx/mKG5QKTjdxVK3B wyaqz7yD9WUSWPPcmfBBr5Bj6Hn/lPtIGxwIwt72EdltBnBjfbYeHX0Uc S1d3I6z74lnnXx89qnAAyQm40zWYuZweGnRRe4v89orqLDyUIw0SbNIQ2 oitQ3aeqj1OIl2wE2nRj9N31gdReeoKgkjpcS1yQBC7wplwho6iaHB7Fw +tTrlfAclv6hITnAZqlHInDiY67lDEDRzeQBF27/mcwN4oz748iFUQdcB 8BerEMFnXGIuiG2gifga/g9soe9fhr4TV3p3hP1VWdplMsYG5bFup/zLa A==; X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="346033488" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="346033488" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:13:38 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="890292212" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="890292212" Received: from smithau-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.213.179.223]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:13:31 -0700 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, x86@kernel.org, dave.hansen@intel.com, kirill.shutemov@linux.intel.com, tony.luck@intel.com, peterz@infradead.org, tglx@linutronix.de, bp@alien8.de, mingo@redhat.com, hpa@zytor.com, seanjc@google.com, pbonzini@redhat.com, david@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, ashok.raj@intel.com, reinette.chatre@intel.com, len.brown@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, ying.huang@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, nik.borisov@suse.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v12 01/22] x86/tdx: Define TDX supported page sizes as macros Date: Tue, 27 Jun 2023 02:12:31 +1200 Message-Id: <05adfd0b3f59b1f3eb10a1a1d045b01b9bba4379.1687784645.git.kai.huang@intel.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org TDX supports 4K, 2M and 1G page sizes. The corresponding values are defined by the TDX module spec and used as TDX module ABI. Currently, they are used in try_accept_one() when the TDX guest tries to accept a page. However currently try_accept_one() uses hard-coded magic values. Define TDX supported page sizes as macros and get rid of the hard-coded values in try_accept_one(). TDX host support will need to use them too. Signed-off-by: Kai Huang Reviewed-by: Kirill A. Shutemov Reviewed-by: Dave Hansen Reviewed-by: David Hildenbrand --- v11 -> v12: - No change. v10 -> v11: - Added David's Reviewed-by. v9 -> v10: - No change. v8 -> v9: - Added Dave's Reviewed-by v7 -> v8: - Improved the comment of TDX supported page sizes macros (Dave) v6 -> v7: - Removed the helper to convert kernel page level to TDX page level. - Changed to use macro to define TDX supported page sizes. --- arch/x86/coco/tdx/tdx.c | 6 +++--- arch/x86/include/asm/tdx.h | 5 +++++ 2 files changed, 8 insertions(+), 3 deletions(-) diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c index 5b8056f6c83f..b34851297ae5 100644 --- a/arch/x86/coco/tdx/tdx.c +++ b/arch/x86/coco/tdx/tdx.c @@ -755,13 +755,13 @@ static bool try_accept_one(phys_addr_t *start, unsigned long len, */ switch (pg_level) { case PG_LEVEL_4K: - page_size = 0; + page_size = TDX_PS_4K; break; case PG_LEVEL_2M: - page_size = 1; + page_size = TDX_PS_2M; break; case PG_LEVEL_1G: - page_size = 2; + page_size = TDX_PS_1G; break; default: return false; diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index 28d889c9aa16..25fd6070dc0b 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -20,6 +20,11 @@ #ifndef __ASSEMBLY__ +/* TDX supported page sizes from the TDX module ABI. */ +#define TDX_PS_4K 0 +#define TDX_PS_2M 1 +#define TDX_PS_1G 2 + /* * Used to gather the output registers values of the TDCALL and SEAMCALL * instructions when requesting services from the TDX module. From patchwork Mon Jun 26 14:12:32 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Kai" X-Patchwork-Id: 13292993 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id DB4B6EB64DA for ; Mon, 26 Jun 2023 14:14:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231219AbjFZOOD (ORCPT ); Mon, 26 Jun 2023 10:14:03 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34260 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231180AbjFZON7 (ORCPT ); Mon, 26 Jun 2023 10:13:59 -0400 Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8E72E10DB; Mon, 26 Jun 2023 07:13:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1687788826; x=1719324826; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=AleeWME7o6+0K/8YXlexP0eyV6hKggwVTUBDGmDHQc8=; b=WoPbpwITvN/DBBzpVwp0NLjc3IV8uEsQDftZ3iXlCYXVElJ8A1+sIahI Jnf5+/GIG2NoX6n6RfoOe6ZpaUMRbvI0AUD+gg4Ku9TpQDYj5TpSPJ54y 58/QFSo5SqRlpR7j5p67GpNgyINhPJM2LrLjFfGZ2OAaWjkHvRyilkvRw lnVkLpOmT3H+bnUkXv9uvyW87GliuZ+nvymwAJHexdeuBpBtVz3C2BsmT xyVH3WSDaeIaSY/FbkHOCIa+C9DAuNZbm3hbY+1d1C8wCCqljUXBJOUNL hF5rbsd9TC8wCGpBOXNgs3yU0h7OC0oEb5bwoy3K3OKAhR08R6Ma13FiQ Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="346033529" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="346033529" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:13:45 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="890292238" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="890292238" Received: from smithau-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.213.179.223]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:13:38 -0700 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, x86@kernel.org, dave.hansen@intel.com, kirill.shutemov@linux.intel.com, tony.luck@intel.com, peterz@infradead.org, tglx@linutronix.de, bp@alien8.de, mingo@redhat.com, hpa@zytor.com, seanjc@google.com, pbonzini@redhat.com, david@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, ashok.raj@intel.com, reinette.chatre@intel.com, len.brown@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, ying.huang@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, nik.borisov@suse.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v12 02/22] x86/virt/tdx: Detect TDX during kernel boot Date: Tue, 27 Jun 2023 02:12:32 +1200 Message-Id: <648189408c827bde1003344294c9222861d3315f.1687784645.git.kai.huang@intel.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org Intel Trust Domain Extensions (TDX) protects guest VMs from malicious host and certain physical attacks. A CPU-attested software module called 'the TDX module' runs inside a new isolated memory range as a trusted hypervisor to manage and run protected VMs. Pre-TDX Intel hardware has support for a memory encryption architecture called MKTME. The memory encryption hardware underpinning MKTME is also used for Intel TDX. TDX ends up "stealing" some of the physical address space from the MKTME architecture for crypto-protection to VMs. The BIOS is responsible for partitioning the "KeyID" space between legacy MKTME and TDX. The KeyIDs reserved for TDX are called 'TDX private KeyIDs' or 'TDX KeyIDs' for short. During machine boot, TDX microcode verifies that the BIOS programmed TDX private KeyIDs consistently and correctly programmed across all CPU packages. The MSRs are locked in this state after verification. This is why MSR_IA32_MKTME_KEYID_PARTITIONING gets used for TDX enumeration: it indicates not just that the hardware supports TDX, but that all the boot-time security checks passed. The TDX module is expected to be loaded by the BIOS when it enables TDX, but the kernel needs to properly initialize it before it can be used to create and run any TDX guests. The TDX module will be initialized by the KVM subsystem when KVM wants to use TDX. Add a new early_initcall(tdx_init) to detect the TDX by detecting TDX private KeyIDs. Also add a function to report whether TDX is enabled by the BIOS. Similar to AMD SME, kexec() will use it to determine whether cache flush is needed. The TDX module itself requires one TDX KeyID as the 'TDX global KeyID' to protect its metadata. Each TDX guest also needs a TDX KeyID for its own protection. Just use the first TDX KeyID as the global KeyID and leave the rest for TDX guests. If no TDX KeyID is left for TDX guests, disable TDX as initializing the TDX module alone is useless. To start to support TDX, create a new arch/x86/virt/vmx/tdx/tdx.c for TDX host kernel support. Add a new Kconfig option CONFIG_INTEL_TDX_HOST to opt-in TDX host kernel support (to distinguish with TDX guest kernel support). So far only KVM uses TDX. Make the new config option depend on KVM_INTEL. Signed-off-by: Kai Huang Reviewed-by: Kirill A. Shutemov Reviewed-by: Isaku Yamahata Reviewed-by: David Hildenbrand --- v11 -> v12: - Improve setting up guest's TDX keyID range (David) - ++tdx_keyid_start -> tdx_keyid_start + 1 - --nr_tdx_keyids -> nr_tdx_keyids - 1 - 'return -ENODEV' instead of 'goto no_tdx' (Sathy) - pr_info() -> pr_err() (Isaku) - Added tags from Isaku/David v10 -> v11 (David): - "host kernel" -> "the host kernel" - "protected VM" -> "confidential VM". - Moved setting tdx_global_keyid to the end of tdx_init(). v9 -> v10: - No change. v8 -> v9: - Moved MSR macro from local tdx.h to (Dave). - Moved reserving the TDX global KeyID from later patch to here. - Changed 'tdx_keyid_start' and 'nr_tdx_keyids' to 'tdx_guest_keyid_start' and 'tdx_nr_guest_keyids' to represent KeyIDs can be used by guest. (Dave) - Slight changelog update according to above changes. v7 -> v8: (address Dave's comments) - Improved changelog: - "KVM user" -> "The TDX module will be initialized by KVM when ..." - Changed "tdx_int" part to "Just say what this patch is doing" - Fixed the last sentence of "kexec()" paragraph - detect_tdx() -> record_keyid_partitioning() - Improved how to calculate tdx_keyid_start. - tdx_keyid_num -> nr_tdx_keyids. - Improved dmesg printing. - Add comment to clear_tdx(). v6 -> v7: - No change. v5 -> v6: - Removed SEAMRR detection to make code simpler. - Removed the 'default N' in the KVM_TDX_HOST Kconfig (Kirill). - Changed to use 'obj-y' in arch/x86/virt/vmx/tdx/Makefile (Kirill). --- arch/x86/Kconfig | 12 +++++ arch/x86/Makefile | 2 + arch/x86/include/asm/msr-index.h | 3 ++ arch/x86/include/asm/tdx.h | 7 +++ arch/x86/virt/Makefile | 2 + arch/x86/virt/vmx/Makefile | 2 + arch/x86/virt/vmx/tdx/Makefile | 2 + arch/x86/virt/vmx/tdx/tdx.c | 90 ++++++++++++++++++++++++++++++++ 8 files changed, 120 insertions(+) create mode 100644 arch/x86/virt/Makefile create mode 100644 arch/x86/virt/vmx/Makefile create mode 100644 arch/x86/virt/vmx/tdx/Makefile create mode 100644 arch/x86/virt/vmx/tdx/tdx.c diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 53bab123a8ee..191587f75810 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1952,6 +1952,18 @@ config X86_SGX If unsure, say N. +config INTEL_TDX_HOST + bool "Intel Trust Domain Extensions (TDX) host support" + depends on CPU_SUP_INTEL + depends on X86_64 + depends on KVM_INTEL + help + Intel Trust Domain Extensions (TDX) protects guest VMs from malicious + host and certain physical attacks. This option enables necessary TDX + support in the host kernel to run confidential VMs. + + If unsure, say N. + config EFI bool "EFI runtime service support" depends on ACPI diff --git a/arch/x86/Makefile b/arch/x86/Makefile index b39975977c03..ec0e71d8fa30 100644 --- a/arch/x86/Makefile +++ b/arch/x86/Makefile @@ -252,6 +252,8 @@ archheaders: libs-y += arch/x86/lib/ +core-y += arch/x86/virt/ + # drivers-y are linked after core-y drivers-$(CONFIG_MATH_EMULATION) += arch/x86/math-emu/ drivers-$(CONFIG_PCI) += arch/x86/pci/ diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index 3aedae61af4f..6d8f15b1552c 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -523,6 +523,9 @@ #define MSR_RELOAD_PMC0 0x000014c1 #define MSR_RELOAD_FIXED_CTR0 0x00001309 +/* KeyID partitioning between MKTME and TDX */ +#define MSR_IA32_MKTME_KEYID_PARTITIONING 0x00000087 + /* * AMD64 MSRs. Not complete. See the architecture manual for a more * complete list. diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index 25fd6070dc0b..4dfe2e794411 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -94,5 +94,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1, return -ENODEV; } #endif /* CONFIG_INTEL_TDX_GUEST && CONFIG_KVM_GUEST */ + +#ifdef CONFIG_INTEL_TDX_HOST +bool platform_tdx_enabled(void); +#else /* !CONFIG_INTEL_TDX_HOST */ +static inline bool platform_tdx_enabled(void) { return false; } +#endif /* CONFIG_INTEL_TDX_HOST */ + #endif /* !__ASSEMBLY__ */ #endif /* _ASM_X86_TDX_H */ diff --git a/arch/x86/virt/Makefile b/arch/x86/virt/Makefile new file mode 100644 index 000000000000..1e36502cd738 --- /dev/null +++ b/arch/x86/virt/Makefile @@ -0,0 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only +obj-y += vmx/ diff --git a/arch/x86/virt/vmx/Makefile b/arch/x86/virt/vmx/Makefile new file mode 100644 index 000000000000..feebda21d793 --- /dev/null +++ b/arch/x86/virt/vmx/Makefile @@ -0,0 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only +obj-$(CONFIG_INTEL_TDX_HOST) += tdx/ diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile new file mode 100644 index 000000000000..93ca8b73e1f1 --- /dev/null +++ b/arch/x86/virt/vmx/tdx/Makefile @@ -0,0 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only +obj-y += tdx.o diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c new file mode 100644 index 000000000000..908590e85749 --- /dev/null +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -0,0 +1,90 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright(c) 2023 Intel Corporation. + * + * Intel Trusted Domain Extensions (TDX) support + */ + +#define pr_fmt(fmt) "tdx: " fmt + +#include +#include +#include +#include +#include +#include +#include +#include + +static u32 tdx_global_keyid __ro_after_init; +static u32 tdx_guest_keyid_start __ro_after_init; +static u32 tdx_nr_guest_keyids __ro_after_init; + +static int __init record_keyid_partitioning(u32 *tdx_keyid_start, + u32 *nr_tdx_keyids) +{ + u32 _nr_mktme_keyids, _tdx_keyid_start, _nr_tdx_keyids; + int ret; + + /* + * IA32_MKTME_KEYID_PARTIONING: + * Bit [31:0]: Number of MKTME KeyIDs. + * Bit [63:32]: Number of TDX private KeyIDs. + */ + ret = rdmsr_safe(MSR_IA32_MKTME_KEYID_PARTITIONING, &_nr_mktme_keyids, + &_nr_tdx_keyids); + if (ret) + return -ENODEV; + + if (!_nr_tdx_keyids) + return -ENODEV; + + /* TDX KeyIDs start after the last MKTME KeyID. */ + _tdx_keyid_start = _nr_mktme_keyids + 1; + + *tdx_keyid_start = _tdx_keyid_start; + *nr_tdx_keyids = _nr_tdx_keyids; + + return 0; +} + +static int __init tdx_init(void) +{ + u32 tdx_keyid_start, nr_tdx_keyids; + int err; + + err = record_keyid_partitioning(&tdx_keyid_start, &nr_tdx_keyids); + if (err) + return err; + + pr_info("BIOS enabled: private KeyID range [%u, %u)\n", + tdx_keyid_start, tdx_keyid_start + nr_tdx_keyids); + + /* + * The TDX module itself requires one 'global KeyID' to protect + * its metadata. If there's only one TDX KeyID, there won't be + * any left for TDX guests thus there's no point to enable TDX + * at all. + */ + if (nr_tdx_keyids < 2) { + pr_err("initialization failed: too few private KeyIDs available.\n"); + return -ENODEV; + } + + /* + * Just use the first TDX KeyID as the 'global KeyID' and + * leave the rest for TDX guests. + */ + tdx_global_keyid = tdx_keyid_start; + tdx_guest_keyid_start = tdx_keyid_start + 1; + tdx_nr_guest_keyids = nr_tdx_keyids - 1; + + return 0; +} +early_initcall(tdx_init); + +/* Return whether the BIOS has enabled TDX */ +bool platform_tdx_enabled(void) +{ + return !!tdx_global_keyid; +} From patchwork Mon Jun 26 14:12:33 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Kai" X-Patchwork-Id: 13292994 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 36599EB64DA for ; Mon, 26 Jun 2023 14:14:18 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231288AbjFZOOR (ORCPT ); Mon, 26 Jun 2023 10:14:17 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34202 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231174AbjFZOOB (ORCPT ); Mon, 26 Jun 2023 10:14:01 -0400 Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 26C241701; Mon, 26 Jun 2023 07:13:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1687788833; x=1719324833; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=UcselnuuZdl/x5uBa9vVdTh1BZxikIkr1yVlrBvuRMQ=; b=cwkOaqLAaWo0x1aGyDRWFE5J8WhbfXXGRYdDhhUFeezgyilzFqF/rQev ciBG9PaMcP2AYzcjnVb8yfNYqq9WpiiJYojHYbGiUXawebCY24Q4NUKB7 dU33y2aZxeA7X0TUSXFirQAd0AM7TMuU2rTy5HwiZ2CVNUHNULHeSwMoh vCjbsxO2tSjVMPV12p7xFjr6WxkBeP9vgLPleFuLQgMD799gCnDlQC6k8 GpVe9vUlHk8/TJE5TQxG51+lhOo3Crqrx3IdaMd354NeZy1dwsX/6uQUu /iVt3Y0yNUX4xMuOvjowyVnX6jvRqV6p8W6xQoy8uwmrrMBv0TPXS6N1p A==; X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="346033578" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="346033578" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:13:52 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="890292251" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="890292251" Received: from smithau-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.213.179.223]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:13:45 -0700 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, x86@kernel.org, dave.hansen@intel.com, kirill.shutemov@linux.intel.com, tony.luck@intel.com, peterz@infradead.org, tglx@linutronix.de, bp@alien8.de, mingo@redhat.com, hpa@zytor.com, seanjc@google.com, pbonzini@redhat.com, david@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, ashok.raj@intel.com, reinette.chatre@intel.com, len.brown@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, ying.huang@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, nik.borisov@suse.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v12 03/22] x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC Date: Tue, 27 Jun 2023 02:12:33 +1200 Message-Id: X-Mailer: git-send-email 2.40.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org TDX capable platforms are locked to X2APIC mode and cannot fall back to the legacy xAPIC mode when TDX is enabled by the BIOS. TDX host support requires x2APIC. Make INTEL_TDX_HOST depend on X86_X2APIC. Link: https://lore.kernel.org/lkml/ba80b303-31bf-d44a-b05d-5c0f83038798@intel.com/ Signed-off-by: Kai Huang Reviewed-by: Dave Hansen Reviewed-by: David Hildenbrand Reviewed-by: Kirill A. Shutemov --- v11 -> v12: - Added Kirill's Reviewd-by. v10 -> v11: - Added David's Reviewed-by. v9 -> v10: - No change. v8 -> v9: - Added Dave's Reviewed-by. v7 -> v8: (Dave) - Only make INTEL_TDX_HOST depend on X86_X2APIC but removed other code - Rewrote the changelog. v6 -> v7: - Changed to use "Link" for the two lore links to get rid of checkpatch warning. --- arch/x86/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 191587f75810..f0f3f1a2c8e0 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1957,6 +1957,7 @@ config INTEL_TDX_HOST depends on CPU_SUP_INTEL depends on X86_64 depends on KVM_INTEL + depends on X86_X2APIC help Intel Trust Domain Extensions (TDX) protects guest VMs from malicious host and certain physical attacks. This option enables necessary TDX From patchwork Mon Jun 26 14:12:34 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Kai" X-Patchwork-Id: 13292995 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 93BD2EB64DA for ; Mon, 26 Jun 2023 14:14:30 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231294AbjFZOO3 (ORCPT ); Mon, 26 Jun 2023 10:14:29 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34204 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231249AbjFZOOQ (ORCPT ); Mon, 26 Jun 2023 10:14:16 -0400 Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 065CC10C8; Mon, 26 Jun 2023 07:13:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1687788840; x=1719324840; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=AT1tL9/WjIAyxJSzeIJnm4LmfMyuPmLuqOcY2UkVzlU=; b=ND2unvI+xOygO7VWdNAgTSS6tzfKcv/HWHI7MO/wKQqRyn8SW+ma8XB8 ix93gpnlBj7O0PAhVcttvkAxpomAP+PYnL9F9bTHmBpq4PyhvcTl2JQtJ XGD/vxdtcnALbE8FivOHd2ZVPFiLyOOOlkmkGwNWqeqKvGZ2v7iIUPGa5 +2VznLxY+IgSnnt0/VIb3dxZRECnkchwuHKmfLPbwZG27alXCUIScftay m8bRwop2SUCzmB7fjSUBlaxS4+5125L2ka31dqriaoZW6WLbjlk2kMwDA dQ2RauBVn75BsyoRdRqW7BpFuNwqhNBav7Osjp+nx8dnTvPzoIvyFmJRs A==; X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="346033608" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="346033608" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:13:59 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="890292271" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="890292271" Received: from smithau-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.213.179.223]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:13:52 -0700 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, x86@kernel.org, dave.hansen@intel.com, kirill.shutemov@linux.intel.com, tony.luck@intel.com, peterz@infradead.org, tglx@linutronix.de, bp@alien8.de, mingo@redhat.com, hpa@zytor.com, seanjc@google.com, pbonzini@redhat.com, david@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, ashok.raj@intel.com, reinette.chatre@intel.com, len.brown@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, ying.huang@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, nik.borisov@suse.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v12 04/22] x86/cpu: Detect TDX partial write machine check erratum Date: Tue, 27 Jun 2023 02:12:34 +1200 Message-Id: <0f701502157029989617bcb3f5940ff48e19a2b2.1687784645.git.kai.huang@intel.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org TDX memory has integrity and confidentiality protections. Violations of this integrity protection are supposed to only affect TDX operations and are never supposed to affect the host kernel itself. In other words, the host kernel should never, itself, see machine checks induced by the TDX integrity hardware. Alas, the first few generations of TDX hardware have an erratum. A partial write to a TDX private memory cacheline will silently "poison" the line. Subsequent reads will consume the poison and generate a machine check. According to the TDX hardware spec, neither of these things should have happened. Virtually all kernel memory accesses operations happen in full cachelines. In practice, writing a "byte" of memory usually reads a 64 byte cacheline of memory, modifies it, then writes the whole line back. Those operations do not trigger this problem. This problem is triggered by "partial" writes where a write transaction of less than cacheline lands at the memory controller. The CPU does these via non-temporal write instructions (like MOVNTI), or through UC/WC memory mappings. The issue can also be triggered away from the CPU by devices doing partial writes via DMA. With this erratum, there are additional things need to be done. Similar to other CPU bugs, use a CPU bug bit to indicate this erratum, and detect this erratum during early boot. Note this bug reflects the hardware thus it is detected regardless of whether the kernel is built with TDX support or not. Signed-off-by: Kai Huang Reviewed-by: Kirill A. Shutemov Reviewed-by: David Hildenbrand --- v11 -> v12: - Added Kirill's tag - Changed to detect the erratum in early_init_intel() (Kirill) v10 -> v11: - New patch --- arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/kernel/cpu/intel.c | 17 +++++++++++++++++ 2 files changed, 18 insertions(+) diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index cb8ca46213be..dc8701f8d88b 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -483,5 +483,6 @@ #define X86_BUG_RETBLEED X86_BUG(27) /* CPU is affected by RETBleed */ #define X86_BUG_EIBRS_PBRSB X86_BUG(28) /* EIBRS is vulnerable to Post Barrier RSB Predictions */ #define X86_BUG_SMT_RSB X86_BUG(29) /* CPU is vulnerable to Cross-Thread Return Address Predictions */ +#define X86_BUG_TDX_PW_MCE X86_BUG(30) /* CPU may incur #MC if non-TD software does partial write to TDX private memory */ #endif /* _ASM_X86_CPUFEATURES_H */ diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index 1c4639588ff9..e6c3107adc15 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -358,6 +358,21 @@ int intel_microcode_sanity_check(void *mc, bool print_err, int hdr_type) } EXPORT_SYMBOL_GPL(intel_microcode_sanity_check); +static void check_tdx_erratum(struct cpuinfo_x86 *c) +{ + /* + * These CPUs have an erratum. A partial write from non-TD + * software (e.g. via MOVNTI variants or UC/WC mapping) to TDX + * private memory poisons that memory, and a subsequent read of + * that memory triggers #MC. + */ + switch (c->x86_model) { + case INTEL_FAM6_SAPPHIRERAPIDS_X: + case INTEL_FAM6_EMERALDRAPIDS_X: + setup_force_cpu_bug(X86_BUG_TDX_PW_MCE); + } +} + static void early_init_intel(struct cpuinfo_x86 *c) { u64 misc_enable; @@ -509,6 +524,8 @@ static void early_init_intel(struct cpuinfo_x86 *c) */ if (detect_extended_topology_early(c) < 0) detect_ht_early(c); + + check_tdx_erratum(c); } static void bsp_init_intel(struct cpuinfo_x86 *c) From patchwork Mon Jun 26 14:12:35 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Kai" X-Patchwork-Id: 13292996 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 35396EB64D7 for ; Mon, 26 Jun 2023 14:14:39 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231240AbjFZOOi (ORCPT ); Mon, 26 Jun 2023 10:14:38 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34610 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231236AbjFZOO1 (ORCPT ); Mon, 26 Jun 2023 10:14:27 -0400 Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 888C21993; Mon, 26 Jun 2023 07:14:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1687788846; x=1719324846; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=t6DeTcVVz69NU4jP/RLKnV8J3tDyrmh8ZA/L9s72880=; b=SYpXkxy3zRO0hX8Q9yGFg3NW9UuR8ZXeBWf6jx0DxK+zOcUT61y6V2CN jw6KVAvd+8RNSDyQ1UC7eADu3HzTbCnEpFRbkwbPJPMj32twPcY6drKDi fmW7R+lyqARqD3rg57ikVV6NgrEAFqu5xXuVNmczErDRg5Z/oljt/DAmF oXnWUBmN13gA5t+zgOnpsXBcZ6nZvfEcWoJJdDfeFFWL5KasFpcLG4oi2 JAAlerYflSiBwqLG0u2G2ozRqHDbi037O3ttz0cQxigW86td74/5MRABI 4V6VaQI2NtKKgMMK+EgDCQ/gpoLeIe8+txIcEn+h80fErIGsYJb+CWjVr w==; X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="346033636" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="346033636" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:14:06 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="890292288" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="890292288" Received: from smithau-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.213.179.223]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:13:59 -0700 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, x86@kernel.org, dave.hansen@intel.com, kirill.shutemov@linux.intel.com, tony.luck@intel.com, peterz@infradead.org, tglx@linutronix.de, bp@alien8.de, mingo@redhat.com, hpa@zytor.com, seanjc@google.com, pbonzini@redhat.com, david@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, ashok.raj@intel.com, reinette.chatre@intel.com, len.brown@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, ying.huang@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, nik.borisov@suse.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v12 05/22] x86/virt/tdx: Add SEAMCALL infrastructure Date: Tue, 27 Jun 2023 02:12:35 +1200 Message-Id: X-Mailer: git-send-email 2.40.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org TDX introduces a new CPU mode: Secure Arbitration Mode (SEAM). This mode runs only the TDX module itself or other code to load the TDX module. The host kernel communicates with SEAM software via a new SEAMCALL instruction. This is conceptually similar to a guest->host hypercall, except it is made from the host to SEAM software instead. The TDX module establishes a new SEAMCALL ABI which allows the host to initialize the module and to manage VMs. Add infrastructure to make SEAMCALLs. The SEAMCALL ABI is very similar to the TDCALL ABI and leverages much TDCALL infrastructure. Also add a wrapper function of SEAMCALL to convert SEAMCALL error code to the kernel error code, and print out SEAMCALL error code to help the user to understand what went wrong. Signed-off-by: Kai Huang Reviewed-by: Isaku Yamahata --- v11 -> v12: - Moved _ASM_EXT_TABLE() for #UD/#GP to a later patch for better patch review, and removed related part from changelog. - Minor code changes in seamcall() (David) - Added Isaku's tag v10 -> v11: - No update v9 -> v10: - Make the TDX_SEAMCALL_{GP|UD} error codes unconditional but doesn't define them when INTEL_TDX_HOST is enabled. (Dave) - Slightly improved changelog to explain why add assembly code to handle #UD and #GP. v8 -> v9: - Changed patch title (Dave). - Enhanced seamcall() to include the cpu id to the error message when SEAMCALL fails. v7 -> v8: - Improved changelog (Dave): - Trim down some sentences (Dave). - Removed __seamcall() and seamcall() function name and changed accordingly (Dave). - Improved the sentence explaining why to handle #GP (Dave). - Added code to print out error message in seamcall(), following the idea that tdx_enable() to return universal error and print out error message to make clear what's going wrong (Dave). Also mention this in changelog. v6 -> v7: - No change. v5 -> v6: - Added code to handle #UD and #GP (Dave). - Moved the seamcall() wrapper function to this patch, and used a temporary __always_unused to avoid compile warning (Dave). - v3 -> v5 (no feedback on v4): - Explicitly tell TDX_SEAMCALL_VMFAILINVALID is returned if the SEAMCALL itself fails. - Improve the changelog. --- arch/x86/virt/vmx/tdx/Makefile | 2 +- arch/x86/virt/vmx/tdx/seamcall.S | 52 ++++++++++++++++++++++++++++++++ arch/x86/virt/vmx/tdx/tdx.c | 42 ++++++++++++++++++++++++++ arch/x86/virt/vmx/tdx/tdx.h | 10 ++++++ 4 files changed, 105 insertions(+), 1 deletion(-) create mode 100644 arch/x86/virt/vmx/tdx/seamcall.S create mode 100644 arch/x86/virt/vmx/tdx/tdx.h diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile index 93ca8b73e1f1..38d534f2c113 100644 --- a/arch/x86/virt/vmx/tdx/Makefile +++ b/arch/x86/virt/vmx/tdx/Makefile @@ -1,2 +1,2 @@ # SPDX-License-Identifier: GPL-2.0-only -obj-y += tdx.o +obj-y += tdx.o seamcall.o diff --git a/arch/x86/virt/vmx/tdx/seamcall.S b/arch/x86/virt/vmx/tdx/seamcall.S new file mode 100644 index 000000000000..f81be6b9c133 --- /dev/null +++ b/arch/x86/virt/vmx/tdx/seamcall.S @@ -0,0 +1,52 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#include +#include + +#include "tdxcall.S" + +/* + * __seamcall() - Host-side interface functions to SEAM software module + * (the P-SEAMLDR or the TDX module). + * + * Transform function call register arguments into the SEAMCALL register + * ABI. Return TDX_SEAMCALL_VMFAILINVALID if the SEAMCALL itself fails, + * or the completion status of the SEAMCALL leaf function. Additional + * output operands are saved in @out (if it is provided by the caller). + * + *------------------------------------------------------------------------- + * SEAMCALL ABI: + *------------------------------------------------------------------------- + * Input Registers: + * + * RAX - SEAMCALL Leaf number. + * RCX,RDX,R8-R9 - SEAMCALL Leaf specific input registers. + * + * Output Registers: + * + * RAX - SEAMCALL completion status code. + * RCX,RDX,R8-R11 - SEAMCALL Leaf specific output registers. + * + *------------------------------------------------------------------------- + * + * __seamcall() function ABI: + * + * @fn (RDI) - SEAMCALL Leaf number, moved to RAX + * @rcx (RSI) - Input parameter 1, moved to RCX + * @rdx (RDX) - Input parameter 2, moved to RDX + * @r8 (RCX) - Input parameter 3, moved to R8 + * @r9 (R8) - Input parameter 4, moved to R9 + * + * @out (R9) - struct tdx_module_output pointer + * stored temporarily in R12 (not + * used by the P-SEAMLDR or the TDX + * module). It can be NULL. + * + * Return (via RAX) the completion status of the SEAMCALL, or + * TDX_SEAMCALL_VMFAILINVALID. + */ +SYM_FUNC_START(__seamcall) + FRAME_BEGIN + TDX_MODULE_CALL host=1 + FRAME_END + RET +SYM_FUNC_END(__seamcall) diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index 908590e85749..f8233cba5931 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -12,14 +12,56 @@ #include #include #include +#include #include #include #include +#include "tdx.h" static u32 tdx_global_keyid __ro_after_init; static u32 tdx_guest_keyid_start __ro_after_init; static u32 tdx_nr_guest_keyids __ro_after_init; +/* + * Wrapper of __seamcall() to convert SEAMCALL leaf function error code + * to kernel error code. @seamcall_ret and @out contain the SEAMCALL + * leaf function return code and the additional output respectively if + * not NULL. + */ +static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9, + u64 *seamcall_ret, + struct tdx_module_output *out) +{ + u64 sret; + int cpu; + + /* Need a stable CPU id for printing error message */ + cpu = get_cpu(); + sret = __seamcall(fn, rcx, rdx, r8, r9, out); + put_cpu(); + + /* Save SEAMCALL return code if the caller wants it */ + if (seamcall_ret) + *seamcall_ret = sret; + + switch (sret) { + case 0: + /* SEAMCALL was successful */ + return 0; + case TDX_SEAMCALL_VMFAILINVALID: + pr_err_once("module is not loaded.\n"); + return -ENODEV; + default: + pr_err_once("SEAMCALL failed: CPU %d: leaf %llu, error 0x%llx.\n", + cpu, fn, sret); + if (out) + pr_err_once("additional output: rcx 0x%llx, rdx 0x%llx, r8 0x%llx, r9 0x%llx, r10 0x%llx, r11 0x%llx.\n", + out->rcx, out->rdx, out->r8, + out->r9, out->r10, out->r11); + return -EIO; + } +} + static int __init record_keyid_partitioning(u32 *tdx_keyid_start, u32 *nr_tdx_keyids) { diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h new file mode 100644 index 000000000000..48ad1a1ba737 --- /dev/null +++ b/arch/x86/virt/vmx/tdx/tdx.h @@ -0,0 +1,10 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _X86_VIRT_TDX_H +#define _X86_VIRT_TDX_H + +#include + +struct tdx_module_output; +u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9, + struct tdx_module_output *out); +#endif From patchwork Mon Jun 26 14:12:36 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Kai" X-Patchwork-Id: 13292997 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 95D51EB64D7 for ; Mon, 26 Jun 2023 14:15:10 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229825AbjFZOPH (ORCPT ); Mon, 26 Jun 2023 10:15:07 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34204 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231312AbjFZOOj (ORCPT ); Mon, 26 Jun 2023 10:14:39 -0400 Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 048392136; Mon, 26 Jun 2023 07:14:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1687788854; x=1719324854; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=0Ki7Z26euf8kOefnzsmE9r5pIZDZOVNrEdzPzosdKV8=; b=TDS1PXNQnMhvBxbuf+z787MPjPgRQXXpPvjgaJFXHdHKf9s5O++AE4Wa +0iCMHJThmtn1bSwaZScyPhTXurcDOt2S+sGON3lBYMq4WmwBQo/ZX4Kh 1ZCV6xREz6TuaSU5UIWwNgBnprNBjFz+TZsbub6JhvbSk8+Xwjriqpr2q PtEp3cYcVZ4sO1/AAzxlqDEGKzEEiFmZc6SnhVaj90GaYtlm26Xco1Va2 4fNePkwZ98o5pcgn45dCEntcxW7yXzgGv7S57W0LGmDy5NimaNoyJILx+ REPKr8xqbhfNiueQI9RAYM0sSTexaosPMz63rQ+sgnIxJLUL7qjwkiCU/ Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="346033662" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="346033662" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:14:12 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="890292303" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="890292303" Received: from smithau-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.213.179.223]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:14:06 -0700 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, x86@kernel.org, dave.hansen@intel.com, kirill.shutemov@linux.intel.com, tony.luck@intel.com, peterz@infradead.org, tglx@linutronix.de, bp@alien8.de, mingo@redhat.com, hpa@zytor.com, seanjc@google.com, pbonzini@redhat.com, david@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, ashok.raj@intel.com, reinette.chatre@intel.com, len.brown@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, ying.huang@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, nik.borisov@suse.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v12 06/22] x86/virt/tdx: Handle SEAMCALL running out of entropy error Date: Tue, 27 Jun 2023 02:12:36 +1200 Message-Id: X-Mailer: git-send-email 2.40.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org Some SEAMCALLs use the RDRAND hardware and can fail for the same reasons as RDRAND. Use the kernel RDRAND retry logic for them. Signed-off-by: Kai Huang Reviewed-by: Dave Hansen Reviewed-by: Kirill A. Shutemov Reviewed-by: David Hildenbrand --- v11 -> v12: - Added tags from Dave/Kirill/David. - Improved changelog (Dave). - Slight code improvement (David) - Initialize retry directly when declaring it. - Simplify comment around mimic rdrand_long(). v10 -> v11: - New patch --- arch/x86/virt/vmx/tdx/tdx.c | 16 ++++++++++++++-- arch/x86/virt/vmx/tdx/tdx.h | 17 +++++++++++++++++ 2 files changed, 31 insertions(+), 2 deletions(-) diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index f8233cba5931..141d12376c4d 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include "tdx.h" @@ -32,12 +33,23 @@ static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9, u64 *seamcall_ret, struct tdx_module_output *out) { + int cpu, retry = RDRAND_RETRY_LOOPS; u64 sret; - int cpu; /* Need a stable CPU id for printing error message */ cpu = get_cpu(); - sret = __seamcall(fn, rcx, rdx, r8, r9, out); + + /* + * Certain SEAMCALL leaf functions may return error due to + * running out of entropy, in which case the SEAMCALL should + * be retried. Handle this in SEAMCALL common function. + * + * Mimic rdrand_long() retry behavior. + */ + do { + sret = __seamcall(fn, rcx, rdx, r8, r9, out); + } while (sret == TDX_RND_NO_ENTROPY && --retry); + put_cpu(); /* Save SEAMCALL return code if the caller wants it */ diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h index 48ad1a1ba737..55dbb1b8c971 100644 --- a/arch/x86/virt/vmx/tdx/tdx.h +++ b/arch/x86/virt/vmx/tdx/tdx.h @@ -4,6 +4,23 @@ #include +/* + * This file contains both macros and data structures defined by the TDX + * architecture and Linux defined software data structures and functions. + * The two should not be mixed together for better readability. The + * architectural definitions come first. + */ + +/* + * TDX SEAMCALL error codes + */ +#define TDX_RND_NO_ENTROPY 0x8000020300000000ULL + +/* + * Do not put any hardware-defined TDX structure representations below + * this comment! + */ + struct tdx_module_output; u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9, struct tdx_module_output *out); From patchwork Mon Jun 26 14:12:37 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Kai" X-Patchwork-Id: 13292998 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 65AC5EB64DC for ; Mon, 26 Jun 2023 14:15:12 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231233AbjFZOPL (ORCPT ); Mon, 26 Jun 2023 10:15:11 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34608 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231249AbjFZOOs (ORCPT ); Mon, 26 Jun 2023 10:14:48 -0400 Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 659ED170F; Mon, 26 Jun 2023 07:14:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1687788860; x=1719324860; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=djNSftBcYLpF7pO3r41BBoH2hINTuguq/oJSItu9PeQ=; b=HiVFAgpcYGofIxZoCuVQW36dU83EYKMtQF9UE8eZwQsrhKqS9maVj1cw 43H4+s3kn/H27vugWeHmJVWXzUUmdCv42VM72lmY4Toy8MBnK21OOSQeD OSgUKr9LU5nyYeHjSubDqNxjcOFNmcnQ+QRHoWSW1Ifr0Z/yqL4+mDe5v rFXbZabnqklYnHQjyg5fetRxBGuoWLtzBrnQVSTWCWZKaLK2qgfvPpbKx 7xNjy9qkxx7a88Ru2EPKXtWT6cK6DYsNhGYMwypdLuGAsFozI8MBIIIcz 3DNEDhVCwMhmIzo9NeHL5gHLPwMGPhGBp+91tGx3kEUTZquOCNEGI81O7 g==; X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="346033698" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="346033698" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:14:20 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="890292322" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="890292322" Received: from smithau-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.213.179.223]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:14:13 -0700 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, x86@kernel.org, dave.hansen@intel.com, kirill.shutemov@linux.intel.com, tony.luck@intel.com, peterz@infradead.org, tglx@linutronix.de, bp@alien8.de, mingo@redhat.com, hpa@zytor.com, seanjc@google.com, pbonzini@redhat.com, david@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, ashok.raj@intel.com, reinette.chatre@intel.com, len.brown@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, ying.huang@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, nik.borisov@suse.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v12 07/22] x86/virt/tdx: Add skeleton to enable TDX on demand Date: Tue, 27 Jun 2023 02:12:37 +1200 Message-Id: <104d324cd68b12e14722ee5d85a660cccccd8892.1687784645.git.kai.huang@intel.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org To enable TDX the kernel needs to initialize TDX from two perspectives: 1) Do a set of SEAMCALLs to initialize the TDX module to make it ready to create and run TDX guests; 2) Do the per-cpu initialization SEAMCALL on one logical cpu before the kernel wants to make any other SEAMCALLs on that cpu (including those involved during module initialization and running TDX guests). The TDX module can be initialized only once in its lifetime. Instead of always initializing it at boot time, this implementation chooses an "on demand" approach to initialize TDX until there is a real need (e.g when requested by KVM). This approach has below pros: 1) It avoids consuming the memory that must be allocated by kernel and given to the TDX module as metadata (~1/256th of the TDX-usable memory), and also saves the CPU cycles of initializing the TDX module (and the metadata) when TDX is not used at all. 2) The TDX module design allows it to be updated while the system is running. The update procedure shares quite a few steps with this "on demand" initialization mechanism. The hope is that much of "on demand" mechanism can be shared with a future "update" mechanism. A boot-time TDX module implementation would not be able to share much code with the update mechanism. 3) Making SEAMCALL requires VMX to be enabled. Currently, only the KVM code mucks with VMX enabling. If the TDX module were to be initialized separately from KVM (like at boot), the boot code would need to be taught how to muck with VMX enabling and KVM would need to be taught how to cope with that. Making KVM itself responsible for TDX initialization lets the rest of the kernel stay blissfully unaware of VMX. Similar to module initialization, also make the per-cpu initialization "on demand" as it also depends on VMX being enabled. Add two functions, tdx_enable() and tdx_cpu_enable(), to enable the TDX module and enable TDX on local cpu respectively. For now tdx_enable() is a placeholder. The TODO list will be pared down as functionality is added. Export both tdx_cpu_enable() and tdx_enable() for KVM use. In tdx_enable() use a state machine protected by mutex to make sure the initialization will only be done once, as tdx_enable() can be called multiple times (i.e. KVM module can be reloaded) and may be called concurrently by other kernel components in the future. The per-cpu initialization on each cpu can only be done once during the module's life time. Use a per-cpu variable to track its status to make sure it is only done once in tdx_cpu_enable(). Also, a SEAMCALL to do TDX module global initialization must be done once on any logical cpu before any per-cpu initialization SEAMCALL. Do it inside tdx_cpu_enable() too (if hasn't been done). tdx_enable() can potentially invoke SEAMCALLs on any online cpus. The per-cpu initialization must be done before those SEAMCALLs are invoked on some cpu. To keep things simple, in tdx_cpu_enable(), always do the per-cpu initialization regardless of whether the TDX module has been initialized or not. And in tdx_enable(), don't call tdx_cpu_enable() but assume the caller has disabled CPU hotplug, done VMXON and tdx_cpu_enable() on all online cpus before calling tdx_enable(). Signed-off-by: Kai Huang --- v11 -> v12: - Simplified TDX module global init and lp init status tracking (David). - Added comment around try_init_module_global() for using raw_spin_lock() (Dave). - Added one sentence to changelog to explain why to expose tdx_enable() and tdx_cpu_enable() (Dave). - Simplifed comments around tdx_enable() and tdx_cpu_enable() to use lockdep_assert_*() instead. (Dave) - Removed redundent "TDX" in error message (Dave). v10 -> v11: - Return -NODEV instead of -EINVAL when CONFIG_INTEL_TDX_HOST is off. - Return the actual error code for tdx_enable() instead of -EINVAL. - Added Isaku's Reviewed-by. v9 -> v10: - Merged the patch to handle per-cpu initialization to this patch to tell the story better. - Changed how to handle the per-cpu initialization to only provide a tdx_cpu_enable() function to let the user of TDX to do it when the user wants to run TDX code on a certain cpu. - Changed tdx_enable() to not call cpus_read_lock() explicitly, but call lockdep_assert_cpus_held() to assume the caller has done that. - Improved comments around tdx_enable() and tdx_cpu_enable(). - Improved changelog to tell the story better accordingly. v8 -> v9: - Removed detailed TODO list in the changelog (Dave). - Added back steps to do module global initialization and per-cpu initialization in the TODO list comment. - Moved the 'enum tdx_module_status_t' from tdx.c to local tdx.h v7 -> v8: - Refined changelog (Dave). - Removed "all BIOS-enabled cpus" related code (Peter/Thomas/Dave). - Add a "TODO list" comment in init_tdx_module() to list all steps of initializing the TDX Module to tell the story (Dave). - Made tdx_enable() unverisally return -EINVAL, and removed nonsense comments (Dave). - Simplified __tdx_enable() to only handle success or failure. - TDX_MODULE_SHUTDOWN -> TDX_MODULE_ERROR - Removed TDX_MODULE_NONE (not loaded) as it is not necessary. - Improved comments (Dave). - Pointed out 'tdx_module_status' is software thing (Dave). v6 -> v7: - No change. v5 -> v6: - Added code to set status to TDX_MODULE_NONE if TDX module is not loaded (Chao) - Added Chao's Reviewed-by. - Improved comments around cpus_read_lock(). - v3->v5 (no feedback on v4): - Removed the check that SEAMRR and TDX KeyID have been detected on all present cpus. - Removed tdx_detect(). - Added num_online_cpus() to MADT-enabled CPUs check within the CPU hotplug lock and return early with error message. - Improved dmesg printing for TDX module detection and initialization. --- arch/x86/include/asm/tdx.h | 4 + arch/x86/virt/vmx/tdx/tdx.c | 162 ++++++++++++++++++++++++++++++++++++ arch/x86/virt/vmx/tdx/tdx.h | 13 +++ 3 files changed, 179 insertions(+) diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index 4dfe2e794411..d8226a50c58c 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -97,8 +97,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1, #ifdef CONFIG_INTEL_TDX_HOST bool platform_tdx_enabled(void); +int tdx_cpu_enable(void); +int tdx_enable(void); #else /* !CONFIG_INTEL_TDX_HOST */ static inline bool platform_tdx_enabled(void) { return false; } +static inline int tdx_cpu_enable(void) { return -ENODEV; } +static inline int tdx_enable(void) { return -ENODEV; } #endif /* CONFIG_INTEL_TDX_HOST */ #endif /* !__ASSEMBLY__ */ diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index 141d12376c4d..29ca18f66d61 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -13,6 +13,10 @@ #include #include #include +#include +#include +#include +#include #include #include #include @@ -23,6 +27,13 @@ static u32 tdx_global_keyid __ro_after_init; static u32 tdx_guest_keyid_start __ro_after_init; static u32 tdx_nr_guest_keyids __ro_after_init; +static bool tdx_global_initialized; +static DEFINE_RAW_SPINLOCK(tdx_global_init_lock); +static DEFINE_PER_CPU(bool, tdx_lp_initialized); + +static enum tdx_module_status_t tdx_module_status; +static DEFINE_MUTEX(tdx_module_lock); + /* * Wrapper of __seamcall() to convert SEAMCALL leaf function error code * to kernel error code. @seamcall_ret and @out contain the SEAMCALL @@ -74,6 +85,157 @@ static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9, } } +/* + * Do the module global initialization if not done yet. + * It's always called with interrupts and preemption disabled. + */ +static int try_init_module_global(void) +{ + unsigned long flags; + int ret; + + /* + * The TDX module global initialization only needs to be done + * once on any cpu. + */ + raw_spin_lock_irqsave(&tdx_global_init_lock, flags); + + if (tdx_global_initialized) { + ret = 0; + goto out; + } + + /* All '0's are just unused parameters. */ + ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL); + if (!ret) + tdx_global_initialized = true; +out: + raw_spin_unlock_irqrestore(&tdx_global_init_lock, flags); + + return ret; +} + +/** + * tdx_cpu_enable - Enable TDX on local cpu + * + * Do one-time TDX module per-cpu initialization SEAMCALL (and TDX module + * global initialization SEAMCALL if not done) on local cpu to make this + * cpu be ready to run any other SEAMCALLs. + * + * Call this function with preemption disabled. + * + * Return 0 on success, otherwise errors. + */ +int tdx_cpu_enable(void) +{ + int ret; + + if (!platform_tdx_enabled()) + return -ENODEV; + + lockdep_assert_preemption_disabled(); + + /* Already done */ + if (__this_cpu_read(tdx_lp_initialized)) + return 0; + + /* + * The TDX module global initialization is the very first step + * to enable TDX. Need to do it first (if hasn't been done) + * before the per-cpu initialization. + */ + ret = try_init_module_global(); + if (ret) + return ret; + + /* All '0's are just unused parameters */ + ret = seamcall(TDH_SYS_LP_INIT, 0, 0, 0, 0, NULL, NULL); + if (ret) + return ret; + + __this_cpu_write(tdx_lp_initialized, true); + + return 0; +} +EXPORT_SYMBOL_GPL(tdx_cpu_enable); + +static int init_tdx_module(void) +{ + /* + * TODO: + * + * - Get TDX module information and TDX-capable memory regions. + * - Build the list of TDX-usable memory regions. + * - Construct a list of "TD Memory Regions" (TDMRs) to cover + * all TDX-usable memory regions. + * - Configure the TDMRs and the global KeyID to the TDX module. + * - Configure the global KeyID on all packages. + * - Initialize all TDMRs. + * + * Return error before all steps are done. + */ + return -EINVAL; +} + +static int __tdx_enable(void) +{ + int ret; + + ret = init_tdx_module(); + if (ret) { + pr_err("module initialization failed (%d)\n", ret); + tdx_module_status = TDX_MODULE_ERROR; + return ret; + } + + pr_info("module initialized.\n"); + tdx_module_status = TDX_MODULE_INITIALIZED; + + return 0; +} + +/** + * tdx_enable - Enable TDX module to make it ready to run TDX guests + * + * This function assumes the caller has: 1) held read lock of CPU hotplug + * lock to prevent any new cpu from becoming online; 2) done both VMXON + * and tdx_cpu_enable() on all online cpus. + * + * This function can be called in parallel by multiple callers. + * + * Return 0 if TDX is enabled successfully, otherwise error. + */ +int tdx_enable(void) +{ + int ret; + + if (!platform_tdx_enabled()) + return -ENODEV; + + lockdep_assert_cpus_held(); + + mutex_lock(&tdx_module_lock); + + switch (tdx_module_status) { + case TDX_MODULE_UNKNOWN: + ret = __tdx_enable(); + break; + case TDX_MODULE_INITIALIZED: + /* Already initialized, great, tell the caller. */ + ret = 0; + break; + default: + /* Failed to initialize in the previous attempts */ + ret = -EINVAL; + break; + } + + mutex_unlock(&tdx_module_lock); + + return ret; +} +EXPORT_SYMBOL_GPL(tdx_enable); + static int __init record_keyid_partitioning(u32 *tdx_keyid_start, u32 *nr_tdx_keyids) { diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h index 55dbb1b8c971..9fb46033c852 100644 --- a/arch/x86/virt/vmx/tdx/tdx.h +++ b/arch/x86/virt/vmx/tdx/tdx.h @@ -16,11 +16,24 @@ */ #define TDX_RND_NO_ENTROPY 0x8000020300000000ULL +/* + * TDX module SEAMCALL leaf functions + */ +#define TDH_SYS_INIT 33 +#define TDH_SYS_LP_INIT 35 + /* * Do not put any hardware-defined TDX structure representations below * this comment! */ +/* Kernel defined TDX module status during module initialization. */ +enum tdx_module_status_t { + TDX_MODULE_UNKNOWN, + TDX_MODULE_INITIALIZED, + TDX_MODULE_ERROR +}; + struct tdx_module_output; u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9, struct tdx_module_output *out); From patchwork Mon Jun 26 14:12:38 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Kai" X-Patchwork-Id: 13292999 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C35AAEB64DA for ; Mon, 26 Jun 2023 14:15:15 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231315AbjFZOPN (ORCPT ); Mon, 26 Jun 2023 10:15:13 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35222 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231372AbjFZOOz (ORCPT ); Mon, 26 Jun 2023 10:14:55 -0400 Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 25DF210C6; Mon, 26 Jun 2023 07:14:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1687788867; x=1719324867; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Ixw5+FCRyQHe1x1K29AuSwyiCMpy80rpmMrG2obNERo=; b=XRrLSCijnxsq4FI22RHIIAcZMIu+eYh26ZQO8Z4yAnfsq3T5HLRh6ylR W6I0BUrjPo0zo3S0Gczkb45OWkNtQM9VgfJhNopyg5kQTrfW5Pk/HeerL BHahJAWf7oC4SO/FJpBAzStzCP8MTFnmNMrKkR3+PQV4u5/8tffOyWlpS ZNNaGIFoOMcVJxSQstmvzt2+ddr/6uvKbImUzZ/nF9wYc0xJSEj/4d1Zv Ltk5f8hSf34b6S+MQ8yawnYhdKlihUrDBm7OTUu5jWT7qw3niv3ocRTlz Q9PUIebiHbOXz8EYK4Vnj1+Xfs5u9UriakryLChJAVBjDFRJIR0EvhJ4H w==; X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="346033732" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="346033732" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:14:26 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="890292329" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="890292329" Received: from smithau-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.213.179.223]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:14:20 -0700 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, x86@kernel.org, dave.hansen@intel.com, kirill.shutemov@linux.intel.com, tony.luck@intel.com, peterz@infradead.org, tglx@linutronix.de, bp@alien8.de, mingo@redhat.com, hpa@zytor.com, seanjc@google.com, pbonzini@redhat.com, david@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, ashok.raj@intel.com, reinette.chatre@intel.com, len.brown@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, ying.huang@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, nik.borisov@suse.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v12 08/22] x86/virt/tdx: Get information about TDX module and TDX-capable memory Date: Tue, 27 Jun 2023 02:12:38 +1200 Message-Id: X-Mailer: git-send-email 2.40.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org Start to transit out the "multi-steps" to initialize the TDX module. TDX provides increased levels of memory confidentiality and integrity. This requires special hardware support for features like memory encryption and storage of memory integrity checksums. Not all memory satisfies these requirements. As a result, TDX introduced the concept of a "Convertible Memory Region" (CMR). During boot, the firmware builds a list of all of the memory ranges which can provide the TDX security guarantees. CMRs tell the kernel which memory is TDX compatible. The kernel takes CMRs (plus a little more metadata) and constructs "TD Memory Regions" (TDMRs). TDMRs let the kernel grant TDX protections to some or all of the CMR areas. The TDX module also reports necessary information to let the kernel build TDMRs and run TDX guests in structure 'tdsysinfo_struct'. The list of CMRs, along with the TDX module information, is available to the kernel by querying the TDX module. As a preparation to construct TDMRs, get the TDX module information and the list of CMRs. Print out CMRs to help user to decode which memory regions are TDX convertible. The 'tdsysinfo_struct' is fairly large (1024 bytes) and contains a lot of info about the TDX module. Fully define the entire structure, but only use the fields necessary to build the TDMRs and pr_info() some basics about the module. The rest of the fields will get used by KVM. Signed-off-by: Kai Huang Reviewed-by: Isaku Yamahata --- v11 -> v12: - Changed to use dynamic allocation for TDSYSINFO_STRUCT and CMR array (Kirill). - Keep SEAMCALL leaf macro definitions in order (Kirill) - Removed is_cmr_empty() but open code directly (David) - 'atribute' -> 'attribute' (David) v10 -> v11: - No change. v9 -> v10: - Added back "start to transit out..." as now per-cpu init has been moved out from tdx_enable(). v8 -> v9: - Removed "start to trransit out ..." part in changelog since this patch is no longer the first step anymore. - Changed to declare 'tdsysinfo' and 'cmr_array' as local static, and changed changelog accordingly (Dave). - Improved changelog to explain why to declare 'tdsysinfo_struct' in full but only use a few members of them (Dave). v7 -> v8: (Dave) - Improved changelog to tell this is the first patch to transit out the "multi-steps" init_tdx_module(). - Removed all CMR check/trim code but to depend on later SEAMCALL. - Variable 'vertical alignment' in print TDX module information. - Added DECLARE_PADDED_STRUCT() for padded structure. - Made tdx_sysinfo and tdx_cmr_array[] to be function local variable (and rename them accordingly), and added -Wframe-larger-than=4096 flag to silence the build warning. v6 -> v7: - Simplified the check of CMRs due to the fact that TDX actually verifies CMRs (that are passed by the BIOS) before enabling TDX. - Changed the function name from check_cmrs() -> trim_empty_cmrs(). - Added CMR page aligned check so that later patch can just get the PFN using ">> PAGE_SHIFT". v5 -> v6: - Added to also print TDX module's attribute (Isaku). - Removed all arguments in tdx_gete_sysinfo() to use static variables of 'tdx_sysinfo' and 'tdx_cmr_array' directly as they are all used directly in other functions in later patches. - Added Isaku's Reviewed-by. - v3 -> v5 (no feedback on v4): - Renamed sanitize_cmrs() to check_cmrs(). - Removed unnecessary sanity check against tdx_sysinfo and tdx_cmr_array actual size returned by TDH.SYS.INFO. - Changed -EFAULT to -EINVAL in couple places. - Added comments around tdx_sysinfo and tdx_cmr_array saying they are used by TDH.SYS.INFO ABI. - Changed to pass 'tdx_sysinfo' and 'tdx_cmr_array' as function arguments in tdx_get_sysinfo(). - Changed to only print BIOS-CMR when check_cmrs() fails. --- arch/x86/virt/vmx/tdx/tdx.c | 79 ++++++++++++++++++++++++++++++++++++- arch/x86/virt/vmx/tdx/tdx.h | 60 ++++++++++++++++++++++++++++ 2 files changed, 137 insertions(+), 2 deletions(-) diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index 29ca18f66d61..a2129cbe056e 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include "tdx.h" @@ -159,12 +160,79 @@ int tdx_cpu_enable(void) } EXPORT_SYMBOL_GPL(tdx_cpu_enable); +static void print_cmrs(struct cmr_info *cmr_array, int nr_cmrs) +{ + int i; + + for (i = 0; i < nr_cmrs; i++) { + struct cmr_info *cmr = &cmr_array[i]; + + /* + * The array of CMRs reported via TDH.SYS.INFO can + * contain tail empty CMRs. Don't print them. + */ + if (!cmr->size) + break; + + pr_info("CMR: [0x%llx, 0x%llx)\n", cmr->base, + cmr->base + cmr->size); + } +} + +static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo, + struct cmr_info *cmr_array) +{ + struct tdx_module_output out; + u64 sysinfo_pa, cmr_array_pa; + int ret; + + sysinfo_pa = __pa(sysinfo); + cmr_array_pa = __pa(cmr_array); + ret = seamcall(TDH_SYS_INFO, sysinfo_pa, TDSYSINFO_STRUCT_SIZE, + cmr_array_pa, MAX_CMRS, NULL, &out); + if (ret) + return ret; + + pr_info("TDX module: attributes 0x%x, vendor_id 0x%x, major_version %u, minor_version %u, build_date %u, build_num %u", + sysinfo->attributes, sysinfo->vendor_id, + sysinfo->major_version, sysinfo->minor_version, + sysinfo->build_date, sysinfo->build_num); + + /* R9 contains the actual entries written to the CMR array. */ + print_cmrs(cmr_array, out.r9); + + return 0; +} + static int init_tdx_module(void) { + struct tdsysinfo_struct *sysinfo; + struct cmr_info *cmr_array; + int ret; + + /* + * Get the TDSYSINFO_STRUCT and CMRs from the TDX module. + * + * The buffers of the TDSYSINFO_STRUCT and the CMR array passed + * to the TDX module must be 1024-bytes and 512-bytes aligned + * respectively. Allocate one page to accommodate them both and + * also meet those alignment requirements. + */ + sysinfo = (struct tdsysinfo_struct *)__get_free_page(GFP_KERNEL); + if (!sysinfo) + return -ENOMEM; + cmr_array = (struct cmr_info *)((unsigned long)sysinfo + PAGE_SIZE / 2); + + BUILD_BUG_ON(PAGE_SIZE / 2 < TDSYSINFO_STRUCT_SIZE); + BUILD_BUG_ON(PAGE_SIZE / 2 < sizeof(struct cmr_info) * MAX_CMRS); + + ret = tdx_get_sysinfo(sysinfo, cmr_array); + if (ret) + goto out; + /* * TODO: * - * - Get TDX module information and TDX-capable memory regions. * - Build the list of TDX-usable memory regions. * - Construct a list of "TD Memory Regions" (TDMRs) to cover * all TDX-usable memory regions. @@ -174,7 +242,14 @@ static int init_tdx_module(void) * * Return error before all steps are done. */ - return -EINVAL; + ret = -EINVAL; +out: + /* + * For now both @sysinfo and @cmr_array are only used during + * module initialization, so always free them. + */ + free_page((unsigned long)sysinfo); + return ret; } static int __tdx_enable(void) diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h index 9fb46033c852..8ab2d40971ea 100644 --- a/arch/x86/virt/vmx/tdx/tdx.h +++ b/arch/x86/virt/vmx/tdx/tdx.h @@ -3,6 +3,8 @@ #define _X86_VIRT_TDX_H #include +#include +#include /* * This file contains both macros and data structures defined by the TDX @@ -19,9 +21,67 @@ /* * TDX module SEAMCALL leaf functions */ +#define TDH_SYS_INFO 32 #define TDH_SYS_INIT 33 #define TDH_SYS_LP_INIT 35 +struct cmr_info { + u64 base; + u64 size; +} __packed; + +#define MAX_CMRS 32 + +struct cpuid_config { + u32 leaf; + u32 sub_leaf; + u32 eax; + u32 ebx; + u32 ecx; + u32 edx; +} __packed; + +#define TDSYSINFO_STRUCT_SIZE 1024 + +/* + * The size of this structure itself is flexible. The actual structure + * passed to TDH.SYS.INFO must be padded to 1024 bytes and be 1204-bytes + * aligned. + */ +struct tdsysinfo_struct { + /* TDX-SEAM Module Info */ + u32 attributes; + u32 vendor_id; + u32 build_date; + u16 build_num; + u16 minor_version; + u16 major_version; + u8 reserved0[14]; + /* Memory Info */ + u16 max_tdmrs; + u16 max_reserved_per_tdmr; + u16 pamt_entry_size; + u8 reserved1[10]; + /* Control Struct Info */ + u16 tdcs_base_size; + u8 reserved2[2]; + u16 tdvps_base_size; + u8 tdvps_xfam_dependent_size; + u8 reserved3[9]; + /* TD Capabilities */ + u64 attributes_fixed0; + u64 attributes_fixed1; + u64 xfam_fixed0; + u64 xfam_fixed1; + u8 reserved4[32]; + u32 num_cpuid_config; + /* + * The actual number of CPUID_CONFIG depends on above + * 'num_cpuid_config'. + */ + DECLARE_FLEX_ARRAY(struct cpuid_config, cpuid_configs); +} __packed; + /* * Do not put any hardware-defined TDX structure representations below * this comment! From patchwork Mon Jun 26 14:12:39 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Kai" X-Patchwork-Id: 13293000 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A9885EB64D7 for ; Mon, 26 Jun 2023 14:15:33 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231210AbjFZOPc (ORCPT ); Mon, 26 Jun 2023 10:15:32 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35416 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230274AbjFZOPI (ORCPT ); Mon, 26 Jun 2023 10:15:08 -0400 Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B7AE41993; Mon, 26 Jun 2023 07:14:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1687788879; x=1719324879; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=tr7/JhLkJTmM0KxdS6UE+PrfI2lhwEOuCowE3zg/i+k=; b=F6Ec4f4ASH6R8Ulv89jH3B3gxHOBCFbJr0jieWyn2gamWlxzTWDAo+CC oKe6OGukWUN45q7kf3CnlFSqUvgbPh6nwANzo4S3ShfRCFE3ZPxKQbWqw uO/9n9BO4KpYmJmLssSaWEqOw9Kj34mDkCBAaMNYVa5x2kwJH8d4W8b6q gqOTmgRjagEp+88sxE17d55xar/gOFaYtVP1HU2XgWVNMKadyAyJhgYzc ieg85MKLn5cox/76bZlSgtn8dut7fD5yDKpVuZT/JwteQCMdV3of8gMrz wx1JGJcFKsUzBlgbldv1hlNZNWKu5APyNxCZhbKMZyeiUp29A+jb5KxD9 g==; X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="346033768" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="346033768" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:14:33 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="890292340" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="890292340" Received: from smithau-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.213.179.223]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:14:26 -0700 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, x86@kernel.org, dave.hansen@intel.com, kirill.shutemov@linux.intel.com, tony.luck@intel.com, peterz@infradead.org, tglx@linutronix.de, bp@alien8.de, mingo@redhat.com, hpa@zytor.com, seanjc@google.com, pbonzini@redhat.com, david@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, ashok.raj@intel.com, reinette.chatre@intel.com, len.brown@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, ying.huang@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, nik.borisov@suse.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v12 09/22] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory Date: Tue, 27 Jun 2023 02:12:39 +1200 Message-Id: <999b47f30fbe2535c37a5e8d602c6c27ac6212dd.1687784645.git.kai.huang@intel.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org As a step of initializing the TDX module, the kernel needs to tell the TDX module which memory regions can be used by the TDX module as TDX guest memory. TDX reports a list of "Convertible Memory Region" (CMR) to tell the kernel which memory is TDX compatible. The kernel needs to build a list of memory regions (out of CMRs) as "TDX-usable" memory and pass them to the TDX module. Once this is done, those "TDX-usable" memory regions are fixed during module's lifetime. To keep things simple, assume that all TDX-protected memory will come from the page allocator. Make sure all pages in the page allocator *are* TDX-usable memory. As TDX-usable memory is a fixed configuration, take a snapshot of the memory configuration from memblocks at the time of module initialization (memblocks are modified on memory hotplug). This snapshot is used to enable TDX support for *this* memory configuration only. Use a memory hotplug notifier to ensure that no other RAM can be added outside of this configuration. This approach requires all memblock memory regions at the time of module initialization to be TDX convertible memory to work, otherwise module initialization will fail in a later SEAMCALL when passing those regions to the module. This approach works when all boot-time "system RAM" is TDX convertible memory, and no non-TDX-convertible memory is hot-added to the core-mm before module initialization. For instance, on the first generation of TDX machines, both CXL memory and NVDIMM are not TDX convertible memory. Using kmem driver to hot-add any CXL memory or NVDIMM to the core-mm before module initialization will result in failure to initialize the module. The SEAMCALL error code will be available in the dmesg to help user to understand the failure. Signed-off-by: Kai Huang Reviewed-by: "Huang, Ying" Reviewed-by: Isaku Yamahata Reviewed-by: Dave Hansen Reviewed-by: Kirill A. Shutemov --- v11 -> v12: - Added tags from Dave/Kirill. v10 -> v11: - Added Isaku's Reviewed-by. v9 -> v10: - Moved empty @tdx_memlist check out of is_tdx_memory() to make the logic better. - Added Ying's Reviewed-by. v8 -> v9: - Replace "The initial support ..." with timeless sentence in both changelog and comments(Dave). - Fix run-on sentence in changelog, and senstence to explain why to stash off memblock (Dave). - Tried to improve why to choose this approach and how it work in changelog based on Dave's suggestion. - Many other comments enhancement (Dave). v7 -> v8: - Trimed down changelog (Dave). - Changed to use PHYS_PFN() and PFN_PHYS() throughout this series (Ying). - Moved memory hotplug handling from add_arch_memory() to memory_notifier (Dan/David). - Removed 'nid' from 'struct tdx_memblock' to later patch (Dave). - {build|free}_tdx_memory() -> {build|}free_tdx_memlist() (Dave). - Removed pfn_covered_by_cmr() check as no code to trim CMRs now. - Improve the comment around first 1MB (Dave). - Added a comment around reserve_real_mode() to point out TDX code relies on first 1MB being reserved (Ying). - Added comment to explain why the new online memory range cannot cross multiple TDX memory blocks (Dave). - Improved other comments (Dave). --- arch/x86/Kconfig | 1 + arch/x86/kernel/setup.c | 2 + arch/x86/virt/vmx/tdx/tdx.c | 161 +++++++++++++++++++++++++++++++++++- arch/x86/virt/vmx/tdx/tdx.h | 6 ++ 4 files changed, 169 insertions(+), 1 deletion(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index f0f3f1a2c8e0..2226d8a4c749 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1958,6 +1958,7 @@ config INTEL_TDX_HOST depends on X86_64 depends on KVM_INTEL depends on X86_X2APIC + select ARCH_KEEP_MEMBLOCK help Intel Trust Domain Extensions (TDX) protects guest VMs from malicious host and certain physical attacks. This option enables necessary TDX diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index 16babff771bd..fd94f8186b9c 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -1159,6 +1159,8 @@ void __init setup_arch(char **cmdline_p) * * Moreover, on machines with SandyBridge graphics or in setups that use * crashkernel the entire 1M is reserved anyway. + * + * Note the host kernel TDX also requires the first 1MB being reserved. */ x86_platform.realmode_reserve(); diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index a2129cbe056e..127036f06752 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -17,6 +17,13 @@ #include #include #include +#include +#include +#include +#include +#include +#include +#include #include #include #include @@ -35,6 +42,9 @@ static DEFINE_PER_CPU(bool, tdx_lp_initialized); static enum tdx_module_status_t tdx_module_status; static DEFINE_MUTEX(tdx_module_lock); +/* All TDX-usable memory regions. Protected by mem_hotplug_lock. */ +static LIST_HEAD(tdx_memlist); + /* * Wrapper of __seamcall() to convert SEAMCALL leaf function error code * to kernel error code. @seamcall_ret and @out contain the SEAMCALL @@ -204,6 +214,79 @@ static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo, return 0; } +/* + * Add a memory region as a TDX memory block. The caller must make sure + * all memory regions are added in address ascending order and don't + * overlap. + */ +static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn, + unsigned long end_pfn) +{ + struct tdx_memblock *tmb; + + tmb = kmalloc(sizeof(*tmb), GFP_KERNEL); + if (!tmb) + return -ENOMEM; + + INIT_LIST_HEAD(&tmb->list); + tmb->start_pfn = start_pfn; + tmb->end_pfn = end_pfn; + + /* @tmb_list is protected by mem_hotplug_lock */ + list_add_tail(&tmb->list, tmb_list); + return 0; +} + +static void free_tdx_memlist(struct list_head *tmb_list) +{ + /* @tmb_list is protected by mem_hotplug_lock */ + while (!list_empty(tmb_list)) { + struct tdx_memblock *tmb = list_first_entry(tmb_list, + struct tdx_memblock, list); + + list_del(&tmb->list); + kfree(tmb); + } +} + +/* + * Ensure that all memblock memory regions are convertible to TDX + * memory. Once this has been established, stash the memblock + * ranges off in a secondary structure because memblock is modified + * in memory hotplug while TDX memory regions are fixed. + */ +static int build_tdx_memlist(struct list_head *tmb_list) +{ + unsigned long start_pfn, end_pfn; + int i, ret; + + for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) { + /* + * The first 1MB is not reported as TDX convertible memory. + * Although the first 1MB is always reserved and won't end up + * to the page allocator, it is still in memblock's memory + * regions. Skip them manually to exclude them as TDX memory. + */ + start_pfn = max(start_pfn, PHYS_PFN(SZ_1M)); + if (start_pfn >= end_pfn) + continue; + + /* + * Add the memory regions as TDX memory. The regions in + * memblock has already guaranteed they are in address + * ascending order and don't overlap. + */ + ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn); + if (ret) + goto err; + } + + return 0; +err: + free_tdx_memlist(tmb_list); + return ret; +} + static int init_tdx_module(void) { struct tdsysinfo_struct *sysinfo; @@ -230,10 +313,25 @@ static int init_tdx_module(void) if (ret) goto out; + /* + * To keep things simple, assume that all TDX-protected memory + * will come from the page allocator. Make sure all pages in the + * page allocator are TDX-usable memory. + * + * Build the list of "TDX-usable" memory regions which cover all + * pages in the page allocator to guarantee that. Do it while + * holding mem_hotplug_lock read-lock as the memory hotplug code + * path reads the @tdx_memlist to reject any new memory. + */ + get_online_mems(); + + ret = build_tdx_memlist(&tdx_memlist); + if (ret) + goto out_put_tdxmem; + /* * TODO: * - * - Build the list of TDX-usable memory regions. * - Construct a list of "TD Memory Regions" (TDMRs) to cover * all TDX-usable memory regions. * - Configure the TDMRs and the global KeyID to the TDX module. @@ -243,6 +341,12 @@ static int init_tdx_module(void) * Return error before all steps are done. */ ret = -EINVAL; +out_put_tdxmem: + /* + * @tdx_memlist is written here and read at memory hotplug time. + * Lock out memory hotplug code while building it. + */ + put_online_mems(); out: /* * For now both @sysinfo and @cmr_array are only used during @@ -339,6 +443,54 @@ static int __init record_keyid_partitioning(u32 *tdx_keyid_start, return 0; } +static bool is_tdx_memory(unsigned long start_pfn, unsigned long end_pfn) +{ + struct tdx_memblock *tmb; + + /* + * This check assumes that the start_pfn<->end_pfn range does not + * cross multiple @tdx_memlist entries. A single memory online + * event across multiple memblocks (from which @tdx_memlist + * entries are derived at the time of module initialization) is + * not possible. This is because memory offline/online is done + * on granularity of 'struct memory_block', and the hotpluggable + * memory region (one memblock) must be multiple of memory_block. + */ + list_for_each_entry(tmb, &tdx_memlist, list) { + if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn) + return true; + } + return false; +} + +static int tdx_memory_notifier(struct notifier_block *nb, unsigned long action, + void *v) +{ + struct memory_notify *mn = v; + + if (action != MEM_GOING_ONLINE) + return NOTIFY_OK; + + /* + * Empty list means TDX isn't enabled. Allow any memory + * to go online. + */ + if (list_empty(&tdx_memlist)) + return NOTIFY_OK; + + /* + * The TDX memory configuration is static and can not be + * changed. Reject onlining any memory which is outside of + * the static configuration whether it supports TDX or not. + */ + return is_tdx_memory(mn->start_pfn, mn->start_pfn + mn->nr_pages) ? + NOTIFY_OK : NOTIFY_BAD; +} + +static struct notifier_block tdx_memory_nb = { + .notifier_call = tdx_memory_notifier, +}; + static int __init tdx_init(void) { u32 tdx_keyid_start, nr_tdx_keyids; @@ -362,6 +514,13 @@ static int __init tdx_init(void) return -ENODEV; } + err = register_memory_notifier(&tdx_memory_nb); + if (err) { + pr_info("initialization failed: register_memory_notifier() failed (%d)\n", + err); + return -ENODEV; + } + /* * Just use the first TDX KeyID as the 'global KeyID' and * leave the rest for TDX guests. diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h index 8ab2d40971ea..37ee7c5dce1c 100644 --- a/arch/x86/virt/vmx/tdx/tdx.h +++ b/arch/x86/virt/vmx/tdx/tdx.h @@ -94,6 +94,12 @@ enum tdx_module_status_t { TDX_MODULE_ERROR }; +struct tdx_memblock { + struct list_head list; + unsigned long start_pfn; + unsigned long end_pfn; +}; + struct tdx_module_output; u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9, struct tdx_module_output *out); From patchwork Mon Jun 26 14:12:40 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Kai" X-Patchwork-Id: 13293001 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 42D5EEB64DC for ; Mon, 26 Jun 2023 14:15:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231304AbjFZOPo (ORCPT ); Mon, 26 Jun 2023 10:15:44 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35038 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231317AbjFZOPO (ORCPT ); Mon, 26 Jun 2023 10:15:14 -0400 Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5C2721BEB; Mon, 26 Jun 2023 07:14:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1687788888; x=1719324888; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=wvJSBN1kOn+AYyboyGj5Bhc8rXHBHFIzE6dnntQRy8s=; b=U+96Ot/yUAP8UudKJNtPSglTN8Mn2ezMYJlOmM6GQznMVUqBkxfFTHYs kGDhwZIV/Kwk4BEVXdWBDSpwnqaAxcRzc2u0QTPicKwz0MYMMNDA8QOMA alwiKJI2sjcuanhqNsF/LohBb7NK3XrUQU7K8crGm6BjHcsQXebHVBzuH fUlLKhGPKx5wHZiibiJR2DSkk/GIN9trbyrJX0ZpAoACr7rebqSIEMOay PdtGLP3P537MXfKsR6op+SfWJ69XvWMtP/E9VGlGWay91KsAY65/MmXFm 8Z43qj7eoE6hx5d37lJ63lQvfK9RP1ca/CH5gASiFifbzA8Qg42nbxO4i w==; X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="346033799" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="346033799" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:14:40 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="890292348" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="890292348" Received: from smithau-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.213.179.223]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:14:33 -0700 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, x86@kernel.org, dave.hansen@intel.com, kirill.shutemov@linux.intel.com, tony.luck@intel.com, peterz@infradead.org, tglx@linutronix.de, bp@alien8.de, mingo@redhat.com, hpa@zytor.com, seanjc@google.com, pbonzini@redhat.com, david@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, ashok.raj@intel.com, reinette.chatre@intel.com, len.brown@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, ying.huang@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, nik.borisov@suse.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v12 10/22] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions Date: Tue, 27 Jun 2023 02:12:40 +1200 Message-Id: <9029b7497505c861c4a0d47b8f6e442df376bd32.1687784645.git.kai.huang@intel.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org After the kernel selects all TDX-usable memory regions, the kernel needs to pass those regions to the TDX module via data structure "TD Memory Region" (TDMR). Add a placeholder to construct a list of TDMRs (in multiple steps) to cover all TDX-usable memory regions. === Long Version === TDX provides increased levels of memory confidentiality and integrity. This requires special hardware support for features like memory encryption and storage of memory integrity checksums. Not all memory satisfies these requirements. As a result, TDX introduced the concept of a "Convertible Memory Region" (CMR). During boot, the firmware builds a list of all of the memory ranges which can provide the TDX security guarantees. The list of these ranges is available to the kernel by querying the TDX module. The TDX architecture needs additional metadata to record things like which TD guest "owns" a given page of memory. This metadata essentially serves as the 'struct page' for the TDX module. The space for this metadata is not reserved by the hardware up front and must be allocated by the kernel and given to the TDX module. Since this metadata consumes space, the VMM can choose whether or not to allocate it for a given area of convertible memory. If it chooses not to, the memory cannot receive TDX protections and can not be used by TDX guests as private memory. For every memory region that the VMM wants to use as TDX memory, it sets up a "TD Memory Region" (TDMR). Each TDMR represents a physically contiguous convertible range and must also have its own physically contiguous metadata table, referred to as a Physical Address Metadata Table (PAMT), to track status for each page in the TDMR range. Unlike a CMR, each TDMR requires 1G granularity and alignment. To support physical RAM areas that don't meet those strict requirements, each TDMR permits a number of internal "reserved areas" which can be placed over memory holes. If PAMT metadata is placed within a TDMR it must be covered by one of these reserved areas. Let's summarize the concepts: CMR - Firmware-enumerated physical ranges that support TDX. CMRs are 4K aligned. TDMR - Physical address range which is chosen by the kernel to support TDX. 1G granularity and alignment required. Each TDMR has reserved areas where TDX memory holes and overlapping PAMTs can be represented. PAMT - Physically contiguous TDX metadata. One table for each page size per TDMR. Roughly 1/256th of TDMR in size. 256G TDMR = ~1G PAMT. As one step of initializing the TDX module, the kernel configures TDX-usable memory regions by passing a list of TDMRs to the TDX module. Constructing the list of TDMRs consists below steps: 1) Fill out TDMRs to cover all memory regions that the TDX module will use for TD memory. 2) Allocate and set up PAMT for each TDMR. 3) Designate reserved areas for each TDMR. Add a placeholder to construct TDMRs to do the above steps. To keep things simple, just allocate enough space to hold maximum number of TDMRs up front. Always free the buffer of TDMRs since they are only used during module initialization. Signed-off-by: Kai Huang Reviewed-by: Isaku Yamahata Reviewed-by: Dave Hansen Reviewed-by: Kirill A. Shutemov --- v11 -> v12: - Added tags from Dave/Kirill. v10 -> v11: - Changed to keep TDMRs after module initialization to deal with TDX erratum in future patches. v9 -> v10: - Changed the TDMR list from static variable back to local variable as now TDX module isn't disabled when tdx_cpu_enable() fails. v8 -> v9: - Changes around 'struct tdmr_info_list' (Dave): - Moved the declaration from tdx.c to tdx.h. - Renamed 'first_tdmr' to 'tdmrs'. - 'nr_tdmrs' -> 'nr_consumed_tdmrs'. - Changed 'tdmrs' to 'void *'. - Improved comments for all structure members. - Added a missing empty line in alloc_tdmr_list() (Dave). v7 -> v8: - Improved changelog to tell this is one step of "TODO list" in init_tdx_module(). - Other changelog improvement suggested by Dave (with "Create TDMRs" to "Fill out TDMRs" to align with the code). - Added a "TODO list" comment to lay out the steps to construct TDMRs, following the same idea of "TODO list" in tdx_module_init(). - Introduced 'struct tdmr_info_list' (Dave) - Further added additional members (tdmr_sz/max_tdmrs/nr_tdmrs) to simplify getting TDMR by given index, and reduce passing arguments around functions. - Added alloc_tdmr_list()/free_tdmr_list() accordingly, which internally uses tdmr_size_single() (Dave). - tdmr_num -> nr_tdmrs (Dave). v6 -> v7: - Improved commit message to explain 'int' overflow cannot happen in cal_tdmr_size() and alloc_tdmr_array(). -- Andy/Dave. v5 -> v6: - construct_tdmrs_memblock() -> construct_tdmrs() as 'tdx_memblock' is used instead of memblock. - Added Isaku's Reviewed-by. - v3 -> v5 (no feedback on v4): - Moved calculating TDMR size to this patch. - Changed to use alloc_pages_exact() to allocate buffer for all TDMRs once, instead of allocating each TDMR individually. - Removed "crypto protection" in the changelog. - -EFAULT -> -EINVAL in couple of places. --- arch/x86/virt/vmx/tdx/tdx.c | 97 ++++++++++++++++++++++++++++++++++++- arch/x86/virt/vmx/tdx/tdx.h | 32 ++++++++++++ 2 files changed, 127 insertions(+), 2 deletions(-) diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index 127036f06752..e28615b60f9b 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -24,6 +24,7 @@ #include #include #include +#include #include #include #include @@ -287,9 +288,84 @@ static int build_tdx_memlist(struct list_head *tmb_list) return ret; } +/* Calculate the actual TDMR size */ +static int tdmr_size_single(u16 max_reserved_per_tdmr) +{ + int tdmr_sz; + + /* + * The actual size of TDMR depends on the maximum + * number of reserved areas. + */ + tdmr_sz = sizeof(struct tdmr_info); + tdmr_sz += sizeof(struct tdmr_reserved_area) * max_reserved_per_tdmr; + + return ALIGN(tdmr_sz, TDMR_INFO_ALIGNMENT); +} + +static int alloc_tdmr_list(struct tdmr_info_list *tdmr_list, + struct tdsysinfo_struct *sysinfo) +{ + size_t tdmr_sz, tdmr_array_sz; + void *tdmr_array; + + tdmr_sz = tdmr_size_single(sysinfo->max_reserved_per_tdmr); + tdmr_array_sz = tdmr_sz * sysinfo->max_tdmrs; + + /* + * To keep things simple, allocate all TDMRs together. + * The buffer needs to be physically contiguous to make + * sure each TDMR is physically contiguous. + */ + tdmr_array = alloc_pages_exact(tdmr_array_sz, + GFP_KERNEL | __GFP_ZERO); + if (!tdmr_array) + return -ENOMEM; + + tdmr_list->tdmrs = tdmr_array; + + /* + * Keep the size of TDMR to find the target TDMR + * at a given index in the TDMR list. + */ + tdmr_list->tdmr_sz = tdmr_sz; + tdmr_list->max_tdmrs = sysinfo->max_tdmrs; + tdmr_list->nr_consumed_tdmrs = 0; + + return 0; +} + +static void free_tdmr_list(struct tdmr_info_list *tdmr_list) +{ + free_pages_exact(tdmr_list->tdmrs, + tdmr_list->max_tdmrs * tdmr_list->tdmr_sz); +} + +/* + * Construct a list of TDMRs on the preallocated space in @tdmr_list + * to cover all TDX memory regions in @tmb_list based on the TDX module + * information in @sysinfo. + */ +static int construct_tdmrs(struct list_head *tmb_list, + struct tdmr_info_list *tdmr_list, + struct tdsysinfo_struct *sysinfo) +{ + /* + * TODO: + * + * - Fill out TDMRs to cover all TDX memory regions. + * - Allocate and set up PAMTs for each TDMR. + * - Designate reserved areas for each TDMR. + * + * Return -EINVAL until constructing TDMRs is done + */ + return -EINVAL; +} + static int init_tdx_module(void) { struct tdsysinfo_struct *sysinfo; + struct tdmr_info_list tdmr_list; struct cmr_info *cmr_array; int ret; @@ -329,11 +405,19 @@ static int init_tdx_module(void) if (ret) goto out_put_tdxmem; + /* Allocate enough space for constructing TDMRs */ + ret = alloc_tdmr_list(&tdmr_list, sysinfo); + if (ret) + goto out_free_tdxmem; + + /* Cover all TDX-usable memory regions in TDMRs */ + ret = construct_tdmrs(&tdx_memlist, &tdmr_list, sysinfo); + if (ret) + goto out_free_tdmrs; + /* * TODO: * - * - Construct a list of "TD Memory Regions" (TDMRs) to cover - * all TDX-usable memory regions. * - Configure the TDMRs and the global KeyID to the TDX module. * - Configure the global KeyID on all packages. * - Initialize all TDMRs. @@ -341,6 +425,15 @@ static int init_tdx_module(void) * Return error before all steps are done. */ ret = -EINVAL; +out_free_tdmrs: + /* + * Always free the buffer of TDMRs as they are only used during + * module initialization. + */ + free_tdmr_list(&tdmr_list); +out_free_tdxmem: + if (ret) + free_tdx_memlist(&tdx_memlist); out_put_tdxmem: /* * @tdx_memlist is written here and read at memory hotplug time. diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h index 37ee7c5dce1c..193764afc602 100644 --- a/arch/x86/virt/vmx/tdx/tdx.h +++ b/arch/x86/virt/vmx/tdx/tdx.h @@ -82,6 +82,29 @@ struct tdsysinfo_struct { DECLARE_FLEX_ARRAY(struct cpuid_config, cpuid_configs); } __packed; +struct tdmr_reserved_area { + u64 offset; + u64 size; +} __packed; + +#define TDMR_INFO_ALIGNMENT 512 + +struct tdmr_info { + u64 base; + u64 size; + u64 pamt_1g_base; + u64 pamt_1g_size; + u64 pamt_2m_base; + u64 pamt_2m_size; + u64 pamt_4k_base; + u64 pamt_4k_size; + /* + * Actual number of reserved areas depends on + * 'struct tdsysinfo_struct'::max_reserved_per_tdmr. + */ + DECLARE_FLEX_ARRAY(struct tdmr_reserved_area, reserved_areas); +} __packed __aligned(TDMR_INFO_ALIGNMENT); + /* * Do not put any hardware-defined TDX structure representations below * this comment! @@ -100,6 +123,15 @@ struct tdx_memblock { unsigned long end_pfn; }; +struct tdmr_info_list { + void *tdmrs; /* Flexible array to hold 'tdmr_info's */ + int nr_consumed_tdmrs; /* How many 'tdmr_info's are in use */ + + /* Metadata for finding target 'tdmr_info' and freeing @tdmrs */ + int tdmr_sz; /* Size of one 'tdmr_info' */ + int max_tdmrs; /* How many 'tdmr_info's are allocated */ +}; + struct tdx_module_output; u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9, struct tdx_module_output *out); From patchwork Mon Jun 26 14:12:41 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Kai" X-Patchwork-Id: 13293002 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4D322EB64DC for ; Mon, 26 Jun 2023 14:15:55 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229681AbjFZOPx (ORCPT ); Mon, 26 Jun 2023 10:15:53 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35186 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231397AbjFZOPa (ORCPT ); Mon, 26 Jun 2023 10:15:30 -0400 Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 827ED212B; Mon, 26 Jun 2023 07:14:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1687788895; x=1719324895; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=euqo82fz3Bf0isL6tyk0IDxgpH21RqTQiuaH2+Qt2xg=; b=NvB+Dko+QAot2ysKpDjaxlrp7XPcIfNOdtqpU+llMwJrF3AHX4GKpheo R7GAIbpcXs+pGMC+Mz7vW7ZcelAIAMzDEIb09zh1mrMO/UdMEWq0dndBx NVeTrP3ZoxW0lrNdeNPPAu5vEWqtPc4FV/QHyOcFOZVSIhNw/tLHpJXAO /2e5NdiUaJRtn/dJO5GlKm/G9UZZf3Qai96QwuVWNVPbMMAXkYY6kTP4Y Z2eKP90goYb1Hg0hQB9WrzOpTrRGU/mutG/P5/wV9yUlcyZwTz7IJgkv5 3jcMu9Rxf9y6Ii+XoC6eqpOJz4+d8iU5wHvwPHjoinWWh+05KtIF0OPQ7 A==; X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="346033836" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="346033836" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:14:47 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="890292354" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="890292354" Received: from smithau-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.213.179.223]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:14:40 -0700 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, x86@kernel.org, dave.hansen@intel.com, kirill.shutemov@linux.intel.com, tony.luck@intel.com, peterz@infradead.org, tglx@linutronix.de, bp@alien8.de, mingo@redhat.com, hpa@zytor.com, seanjc@google.com, pbonzini@redhat.com, david@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, ashok.raj@intel.com, reinette.chatre@intel.com, len.brown@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, ying.huang@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, nik.borisov@suse.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v12 11/22] x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions Date: Tue, 27 Jun 2023 02:12:41 +1200 Message-Id: X-Mailer: git-send-email 2.40.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org Start to transit out the "multi-steps" to construct a list of "TD Memory Regions" (TDMRs) to cover all TDX-usable memory regions. The kernel configures TDX-usable memory regions by passing a list of TDMRs "TD Memory Regions" (TDMRs) to the TDX module. Each TDMR contains the information of the base/size of a memory region, the base/size of the associated Physical Address Metadata Table (PAMT) and a list of reserved areas in the region. Do the first step to fill out a number of TDMRs to cover all TDX memory regions. To keep it simple, always try to use one TDMR for each memory region. As the first step only set up the base/size for each TDMR. Each TDMR must be 1G aligned and the size must be in 1G granularity. This implies that one TDMR could cover multiple memory regions. If a memory region spans the 1GB boundary and the former part is already covered by the previous TDMR, just use a new TDMR for the remaining part. TDX only supports a limited number of TDMRs. Disable TDX if all TDMRs are consumed but there is more memory region to cover. There are fancier things that could be done like trying to merge adjacent TDMRs. This would allow more pathological memory layouts to be supported. But, current systems are not even close to exhausting the existing TDMR resources in practice. For now, keep it simple. Signed-off-by: Kai Huang Reviewed-by: Kirill A. Shutemov Reviewed-by: Kuppuswamy Sathyanarayanan Reviewed-by: Yuan Yao --- v11 -> v12: - Improved comments around looping over TDX memblock to create TDMRs. (Dave). - Added code to pr_warn() when consumed TDMRs reaching maximum TDMRs (Dave). - BIT_ULL(30) -> SZ_1G (Kirill) - Removed unused TDMR_PFN_ALIGNMENT (Sathy) - Added tags from Kirill/Sathy v10 -> v11: - No update v9 -> v10: - No change. v8 -> v9: - Added the last paragraph in the changelog (Dave). - Removed unnecessary type cast in tdmr_entry() (Dave). --- arch/x86/virt/vmx/tdx/tdx.c | 103 +++++++++++++++++++++++++++++++++++- arch/x86/virt/vmx/tdx/tdx.h | 3 ++ 2 files changed, 105 insertions(+), 1 deletion(-) diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index e28615b60f9b..2ffc1517a93b 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -341,6 +341,102 @@ static void free_tdmr_list(struct tdmr_info_list *tdmr_list) tdmr_list->max_tdmrs * tdmr_list->tdmr_sz); } +/* Get the TDMR from the list at the given index. */ +static struct tdmr_info *tdmr_entry(struct tdmr_info_list *tdmr_list, + int idx) +{ + int tdmr_info_offset = tdmr_list->tdmr_sz * idx; + + return (void *)tdmr_list->tdmrs + tdmr_info_offset; +} + +#define TDMR_ALIGNMENT SZ_1G +#define TDMR_ALIGN_DOWN(_addr) ALIGN_DOWN((_addr), TDMR_ALIGNMENT) +#define TDMR_ALIGN_UP(_addr) ALIGN((_addr), TDMR_ALIGNMENT) + +static inline u64 tdmr_end(struct tdmr_info *tdmr) +{ + return tdmr->base + tdmr->size; +} + +/* + * Take the memory referenced in @tmb_list and populate the + * preallocated @tdmr_list, following all the special alignment + * and size rules for TDMR. + */ +static int fill_out_tdmrs(struct list_head *tmb_list, + struct tdmr_info_list *tdmr_list) +{ + struct tdx_memblock *tmb; + int tdmr_idx = 0; + + /* + * Loop over TDX memory regions and fill out TDMRs to cover them. + * To keep it simple, always try to use one TDMR to cover one + * memory region. + * + * In practice TDX supports at least 64 TDMRs. A 2-socket system + * typically only consumes less than 10 of those. This code is + * dumb and simple and may use more TMDRs than is strictly + * required. + */ + list_for_each_entry(tmb, tmb_list, list) { + struct tdmr_info *tdmr = tdmr_entry(tdmr_list, tdmr_idx); + u64 start, end; + + start = TDMR_ALIGN_DOWN(PFN_PHYS(tmb->start_pfn)); + end = TDMR_ALIGN_UP(PFN_PHYS(tmb->end_pfn)); + + /* + * A valid size indicates the current TDMR has already + * been filled out to cover the previous memory region(s). + */ + if (tdmr->size) { + /* + * Loop to the next if the current memory region + * has already been fully covered. + */ + if (end <= tdmr_end(tdmr)) + continue; + + /* Otherwise, skip the already covered part. */ + if (start < tdmr_end(tdmr)) + start = tdmr_end(tdmr); + + /* + * Create a new TDMR to cover the current memory + * region, or the remaining part of it. + */ + tdmr_idx++; + if (tdmr_idx >= tdmr_list->max_tdmrs) { + pr_warn("initialization failed: TDMRs exhausted.\n"); + return -ENOSPC; + } + + tdmr = tdmr_entry(tdmr_list, tdmr_idx); + } + + tdmr->base = start; + tdmr->size = end - start; + } + + /* @tdmr_idx is always the index of the last valid TDMR. */ + tdmr_list->nr_consumed_tdmrs = tdmr_idx + 1; + + /* + * Warn early that kernel is about to run out of TDMRs. + * + * This is an indication that TDMR allocation has to be + * reworked to be smarter to not run into an issue. + */ + if (tdmr_list->max_tdmrs - tdmr_list->nr_consumed_tdmrs < TDMR_NR_WARN) + pr_warn("consumed TDMRs reaching limit: %d used out of %d\n", + tdmr_list->nr_consumed_tdmrs, + tdmr_list->max_tdmrs); + + return 0; +} + /* * Construct a list of TDMRs on the preallocated space in @tdmr_list * to cover all TDX memory regions in @tmb_list based on the TDX module @@ -350,10 +446,15 @@ static int construct_tdmrs(struct list_head *tmb_list, struct tdmr_info_list *tdmr_list, struct tdsysinfo_struct *sysinfo) { + int ret; + + ret = fill_out_tdmrs(tmb_list, tdmr_list); + if (ret) + return ret; + /* * TODO: * - * - Fill out TDMRs to cover all TDX memory regions. * - Allocate and set up PAMTs for each TDMR. * - Designate reserved areas for each TDMR. * diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h index 193764afc602..3086f7ad0522 100644 --- a/arch/x86/virt/vmx/tdx/tdx.h +++ b/arch/x86/virt/vmx/tdx/tdx.h @@ -123,6 +123,9 @@ struct tdx_memblock { unsigned long end_pfn; }; +/* Warn if kernel has less than TDMR_NR_WARN TDMRs after allocation */ +#define TDMR_NR_WARN 4 + struct tdmr_info_list { void *tdmrs; /* Flexible array to hold 'tdmr_info's */ int nr_consumed_tdmrs; /* How many 'tdmr_info's are in use */ From patchwork Mon Jun 26 14:12:42 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Kai" X-Patchwork-Id: 13293003 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1000EC0015E for ; Mon, 26 Jun 2023 14:16:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231432AbjFZOQE (ORCPT ); Mon, 26 Jun 2023 10:16:04 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35886 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230159AbjFZOPg (ORCPT ); Mon, 26 Jun 2023 10:15:36 -0400 Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 88DE51727; Mon, 26 Jun 2023 07:15:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1687788908; x=1719324908; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=CJCYWPgwh1KbGUfRPHdaovtkQz/KdxfPG8N6GA0833Y=; b=LpDcgoIEt6D0yRigWdCNnULc6vq9x9wagkdxWqJJNYzPc9gV375XJfOo /rjvzv+M/UkcXwvOAgTVZs/ifX5ePe2v9ZbQVMKTsGFvyZG0/WKMK02MO Gm1bsnF5uT8dWf1LhXBTDBhP2CombMZBk+jNsCyA1RECEfRHa9oCgV0K5 bec0nkrW/iKYrMcEIj2KO+gej1eRRIZPVtGmMiylUVFqphkW0/qT3Xqcd 3TFNF027Z+dNH/bXzJ6cayWi8ucmCrr0UHZZSRMeen9Pr/y66k+Xw1Bt2 l53erz4g7wsn0AbSfvuI930jZ2Bz3ORNrKagpdffcI2L6RKiaIpf1Xdpb w==; X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="346033873" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="346033873" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:14:56 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="890292365" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="890292365" Received: from smithau-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.213.179.223]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:14:47 -0700 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, x86@kernel.org, dave.hansen@intel.com, kirill.shutemov@linux.intel.com, tony.luck@intel.com, peterz@infradead.org, tglx@linutronix.de, bp@alien8.de, mingo@redhat.com, hpa@zytor.com, seanjc@google.com, pbonzini@redhat.com, david@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, ashok.raj@intel.com, reinette.chatre@intel.com, len.brown@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, ying.huang@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, nik.borisov@suse.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v12 12/22] x86/virt/tdx: Allocate and set up PAMTs for TDMRs Date: Tue, 27 Jun 2023 02:12:42 +1200 Message-Id: <85ea233226ec7a05e8c5627a499e97ea4cbd6950.1687784645.git.kai.huang@intel.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org The TDX module uses additional metadata to record things like which guest "owns" a given page of memory. This metadata, referred as Physical Address Metadata Table (PAMT), essentially serves as the 'struct page' for the TDX module. PAMTs are not reserved by hardware up front. They must be allocated by the kernel and then given to the TDX module during module initialization. TDX supports 3 page sizes: 4K, 2M, and 1G. Each "TD Memory Region" (TDMR) has 3 PAMTs to track the 3 supported page sizes. Each PAMT must be a physically contiguous area from a Convertible Memory Region (CMR). However, the PAMTs which track pages in one TDMR do not need to reside within that TDMR but can be anywhere in CMRs. If one PAMT overlaps with any TDMR, the overlapping part must be reported as a reserved area in that particular TDMR. Use alloc_contig_pages() since PAMT must be a physically contiguous area and it may be potentially large (~1/256th of the size of the given TDMR). The downside is alloc_contig_pages() may fail at runtime. One (bad) mitigation is to launch a TDX guest early during system boot to get those PAMTs allocated at early time, but the only way to fix is to add a boot option to allocate or reserve PAMTs during kernel boot. It is imperfect but will be improved on later. TDX only supports a limited number of reserved areas per TDMR to cover both PAMTs and memory holes within the given TDMR. If many PAMTs are allocated within a single TDMR, the reserved areas may not be sufficient to cover all of them. Adopt the following policies when allocating PAMTs for a given TDMR: - Allocate three PAMTs of the TDMR in one contiguous chunk to minimize the total number of reserved areas consumed for PAMTs. - Try to first allocate PAMT from the local node of the TDMR for better NUMA locality. Also dump out how many pages are allocated for PAMTs when the TDX module is initialized successfully. This helps answer the eternal "where did all my memory go?" questions. Signed-off-by: Kai Huang Reviewed-by: Isaku Yamahata Reviewed-by: Dave Hansen Reviewed-by: Kirill A. Shutemov Reviewed-by: Yuan Yao --- v11 -> v12: - Moved TDX_PS_NUM from tdx.c to (Kirill) - "<= TDX_PS_1G" -> "< TDX_PS_NUM" (Kirill) - Changed tdmr_get_pamt() to return base and size instead of base_pfn and npages and related code directly (Dave). - Simplified PAMT kb counting. (Dave) - tdmrs_count_pamt_pages() -> tdmr_count_pamt_kb() (Kirill/Dave) v10 -> v11: - No update v9 -> v10: - Removed code change in disable_tdx_module() as it doesn't exist anymore. v8 -> v9: - Added TDX_PS_NR macro instead of open-coding (Dave). - Better alignment of 'pamt_entry_size' in tdmr_set_up_pamt() (Dave). - Changed to print out PAMTs in "KBs" instead of "pages" (Dave). - Added Dave's Reviewed-by. v7 -> v8: (Dave) - Changelog: - Added a sentence to state PAMT allocation will be improved. - Others suggested by Dave. - Moved 'nid' of 'struct tdx_memblock' to this patch. - Improved comments around tdmr_get_nid(). - WARN_ON_ONCE() -> pr_warn() in tdmr_get_nid(). - Other changes due to 'struct tdmr_info_list'. v6 -> v7: - Changes due to using macros instead of 'enum' for TDX supported page sizes. v5 -> v6: - Rebase due to using 'tdx_memblock' instead of memblock. - 'int pamt_entry_nr' -> 'unsigned long nr_pamt_entries' (Dave/Sagis). - Improved comment around tdmr_get_nid() (Dave). - Improved comment in tdmr_set_up_pamt() around breaking the PAMT into PAMTs for 4K/2M/1G (Dave). - tdmrs_get_pamt_pages() -> tdmrs_count_pamt_pages() (Dave). - v3 -> v5 (no feedback on v4): - Used memblock to get the NUMA node for given TDMR. - Removed tdmr_get_pamt_sz() helper but use open-code instead. - Changed to use 'switch .. case..' for each TDX supported page size in tdmr_get_pamt_sz() (the original __tdmr_get_pamt_sz()). - Added printing out memory used for PAMT allocation when TDX module is initialized successfully. - Explained downside of alloc_contig_pages() in changelog. - Addressed other minor comments. --- arch/x86/Kconfig | 1 + arch/x86/include/asm/tdx.h | 1 + arch/x86/virt/vmx/tdx/tdx.c | 215 +++++++++++++++++++++++++++++++++++- arch/x86/virt/vmx/tdx/tdx.h | 1 + 4 files changed, 213 insertions(+), 5 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 2226d8a4c749..ad364f01de33 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1959,6 +1959,7 @@ config INTEL_TDX_HOST depends on KVM_INTEL depends on X86_X2APIC select ARCH_KEEP_MEMBLOCK + depends on CONTIG_ALLOC help Intel Trust Domain Extensions (TDX) protects guest VMs from malicious host and certain physical attacks. This option enables necessary TDX diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index d8226a50c58c..91416fd600cd 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -24,6 +24,7 @@ #define TDX_PS_4K 0 #define TDX_PS_2M 1 #define TDX_PS_1G 2 +#define TDX_PS_NR (TDX_PS_1G + 1) /* * Used to gather the output registers values of the TDCALL and SEAMCALL diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index 2ffc1517a93b..fd5417577f26 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -221,7 +221,7 @@ static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo, * overlap. */ static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn, - unsigned long end_pfn) + unsigned long end_pfn, int nid) { struct tdx_memblock *tmb; @@ -232,6 +232,7 @@ static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn, INIT_LIST_HEAD(&tmb->list); tmb->start_pfn = start_pfn; tmb->end_pfn = end_pfn; + tmb->nid = nid; /* @tmb_list is protected by mem_hotplug_lock */ list_add_tail(&tmb->list, tmb_list); @@ -259,9 +260,9 @@ static void free_tdx_memlist(struct list_head *tmb_list) static int build_tdx_memlist(struct list_head *tmb_list) { unsigned long start_pfn, end_pfn; - int i, ret; + int i, nid, ret; - for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) { + for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) { /* * The first 1MB is not reported as TDX convertible memory. * Although the first 1MB is always reserved and won't end up @@ -277,7 +278,7 @@ static int build_tdx_memlist(struct list_head *tmb_list) * memblock has already guaranteed they are in address * ascending order and don't overlap. */ - ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn); + ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn, nid); if (ret) goto err; } @@ -437,6 +438,202 @@ static int fill_out_tdmrs(struct list_head *tmb_list, return 0; } +/* + * Calculate PAMT size given a TDMR and a page size. The returned + * PAMT size is always aligned up to 4K page boundary. + */ +static unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr, int pgsz, + u16 pamt_entry_size) +{ + unsigned long pamt_sz, nr_pamt_entries; + + switch (pgsz) { + case TDX_PS_4K: + nr_pamt_entries = tdmr->size >> PAGE_SHIFT; + break; + case TDX_PS_2M: + nr_pamt_entries = tdmr->size >> PMD_SHIFT; + break; + case TDX_PS_1G: + nr_pamt_entries = tdmr->size >> PUD_SHIFT; + break; + default: + WARN_ON_ONCE(1); + return 0; + } + + pamt_sz = nr_pamt_entries * pamt_entry_size; + /* TDX requires PAMT size must be 4K aligned */ + pamt_sz = ALIGN(pamt_sz, PAGE_SIZE); + + return pamt_sz; +} + +/* + * Locate a NUMA node which should hold the allocation of the @tdmr + * PAMT. This node will have some memory covered by the TDMR. The + * relative amount of memory covered is not considered. + */ +static int tdmr_get_nid(struct tdmr_info *tdmr, struct list_head *tmb_list) +{ + struct tdx_memblock *tmb; + + /* + * A TDMR must cover at least part of one TMB. That TMB will end + * after the TDMR begins. But, that TMB may have started before + * the TDMR. Find the next 'tmb' that _ends_ after this TDMR + * begins. Ignore 'tmb' start addresses. They are irrelevant. + */ + list_for_each_entry(tmb, tmb_list, list) { + if (tmb->end_pfn > PHYS_PFN(tdmr->base)) + return tmb->nid; + } + + /* + * Fall back to allocating the TDMR's metadata from node 0 when + * no TDX memory block can be found. This should never happen + * since TDMRs originate from TDX memory blocks. + */ + pr_warn("TDMR [0x%llx, 0x%llx): unable to find local NUMA node for PAMT allocation, fallback to use node 0.\n", + tdmr->base, tdmr_end(tdmr)); + return 0; +} + +/* + * Allocate PAMTs from the local NUMA node of some memory in @tmb_list + * within @tdmr, and set up PAMTs for @tdmr. + */ +static int tdmr_set_up_pamt(struct tdmr_info *tdmr, + struct list_head *tmb_list, + u16 pamt_entry_size) +{ + unsigned long pamt_base[TDX_PS_NR]; + unsigned long pamt_size[TDX_PS_NR]; + unsigned long tdmr_pamt_base; + unsigned long tdmr_pamt_size; + struct page *pamt; + int pgsz, nid; + + nid = tdmr_get_nid(tdmr, tmb_list); + + /* + * Calculate the PAMT size for each TDX supported page size + * and the total PAMT size. + */ + tdmr_pamt_size = 0; + for (pgsz = TDX_PS_4K; pgsz < TDX_PS_NR ; pgsz++) { + pamt_size[pgsz] = tdmr_get_pamt_sz(tdmr, pgsz, + pamt_entry_size); + tdmr_pamt_size += pamt_size[pgsz]; + } + + /* + * Allocate one chunk of physically contiguous memory for all + * PAMTs. This helps minimize the PAMT's use of reserved areas + * in overlapped TDMRs. + */ + pamt = alloc_contig_pages(tdmr_pamt_size >> PAGE_SHIFT, GFP_KERNEL, + nid, &node_online_map); + if (!pamt) + return -ENOMEM; + + /* + * Break the contiguous allocation back up into the + * individual PAMTs for each page size. + */ + tdmr_pamt_base = page_to_pfn(pamt) << PAGE_SHIFT; + for (pgsz = TDX_PS_4K; pgsz < TDX_PS_NR; pgsz++) { + pamt_base[pgsz] = tdmr_pamt_base; + tdmr_pamt_base += pamt_size[pgsz]; + } + + tdmr->pamt_4k_base = pamt_base[TDX_PS_4K]; + tdmr->pamt_4k_size = pamt_size[TDX_PS_4K]; + tdmr->pamt_2m_base = pamt_base[TDX_PS_2M]; + tdmr->pamt_2m_size = pamt_size[TDX_PS_2M]; + tdmr->pamt_1g_base = pamt_base[TDX_PS_1G]; + tdmr->pamt_1g_size = pamt_size[TDX_PS_1G]; + + return 0; +} + +static void tdmr_get_pamt(struct tdmr_info *tdmr, unsigned long *pamt_base, + unsigned long *pamt_size) +{ + unsigned long pamt_bs, pamt_sz; + + /* + * The PAMT was allocated in one contiguous unit. The 4K PAMT + * should always point to the beginning of that allocation. + */ + pamt_bs = tdmr->pamt_4k_base; + pamt_sz = tdmr->pamt_4k_size + tdmr->pamt_2m_size + tdmr->pamt_1g_size; + + WARN_ON_ONCE((pamt_bs & ~PAGE_MASK) || (pamt_sz & ~PAGE_MASK)); + + *pamt_base = pamt_bs; + *pamt_size = pamt_sz; +} + +static void tdmr_free_pamt(struct tdmr_info *tdmr) +{ + unsigned long pamt_base, pamt_size; + + tdmr_get_pamt(tdmr, &pamt_base, &pamt_size); + + /* Do nothing if PAMT hasn't been allocated for this TDMR */ + if (!pamt_size) + return; + + if (WARN_ON_ONCE(!pamt_base)) + return; + + free_contig_range(pamt_base >> PAGE_SHIFT, pamt_size >> PAGE_SHIFT); +} + +static void tdmrs_free_pamt_all(struct tdmr_info_list *tdmr_list) +{ + int i; + + for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) + tdmr_free_pamt(tdmr_entry(tdmr_list, i)); +} + +/* Allocate and set up PAMTs for all TDMRs */ +static int tdmrs_set_up_pamt_all(struct tdmr_info_list *tdmr_list, + struct list_head *tmb_list, + u16 pamt_entry_size) +{ + int i, ret = 0; + + for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) { + ret = tdmr_set_up_pamt(tdmr_entry(tdmr_list, i), tmb_list, + pamt_entry_size); + if (ret) + goto err; + } + + return 0; +err: + tdmrs_free_pamt_all(tdmr_list); + return ret; +} + +static unsigned long tdmrs_count_pamt_kb(struct tdmr_info_list *tdmr_list) +{ + unsigned long pamt_size = 0; + int i; + + for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) { + unsigned long base, size; + + tdmr_get_pamt(tdmr_entry(tdmr_list, i), &base, &size); + pamt_size += size; + } + + return pamt_size / 1024; +} + /* * Construct a list of TDMRs on the preallocated space in @tdmr_list * to cover all TDX memory regions in @tmb_list based on the TDX module @@ -452,10 +649,13 @@ static int construct_tdmrs(struct list_head *tmb_list, if (ret) return ret; + ret = tdmrs_set_up_pamt_all(tdmr_list, tmb_list, + sysinfo->pamt_entry_size); + if (ret) + return ret; /* * TODO: * - * - Allocate and set up PAMTs for each TDMR. * - Designate reserved areas for each TDMR. * * Return -EINVAL until constructing TDMRs is done @@ -526,6 +726,11 @@ static int init_tdx_module(void) * Return error before all steps are done. */ ret = -EINVAL; + if (ret) + tdmrs_free_pamt_all(&tdmr_list); + else + pr_info("%lu KBs allocated for PAMT.\n", + tdmrs_count_pamt_kb(&tdmr_list)); out_free_tdmrs: /* * Always free the buffer of TDMRs as they are only used during diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h index 3086f7ad0522..9b5a65f37e8b 100644 --- a/arch/x86/virt/vmx/tdx/tdx.h +++ b/arch/x86/virt/vmx/tdx/tdx.h @@ -121,6 +121,7 @@ struct tdx_memblock { struct list_head list; unsigned long start_pfn; unsigned long end_pfn; + int nid; }; /* Warn if kernel has less than TDMR_NR_WARN TDMRs after allocation */ From patchwork Mon Jun 26 14:12:43 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Kai" X-Patchwork-Id: 13293004 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A1ED5EB64D7 for ; Mon, 26 Jun 2023 14:16:12 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229805AbjFZOQK (ORCPT ); Mon, 26 Jun 2023 10:16:10 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35638 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230192AbjFZOPr (ORCPT ); Mon, 26 Jun 2023 10:15:47 -0400 Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 741551734; Mon, 26 Jun 2023 07:15:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1687788914; x=1719324914; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=8oV4KREwo5PEDhmXx7KbXfkbLGkGBGLVGetPY+xR228=; b=kF1n26RsQ1aUstlH35TNGpMEtXHzrysNRVKsi18yKH5w55cMmCOpTLLO 7Z8R3XHTt2nJS4G93jcxT6AKHyDkyJtaxQ/tIW52xVPKHTRjaECXIw/3j vd2WZvbSIZoJ+ZCgRsnV+vUlXupf0/tXcSh9CgiaGt8ApcOLSGUEnor0d kRIq636t8HMXLXt2bg0xOLhSsiOErcF2kmB7BW7G8Dr4TK9XchO768MPq 0e1/5irEEKO6HC6xIl6qNy5btBZK52sVyVFMw+rKWEVdTb51Hgy1tkUhx 8bDAZFmcWlN1xKzyRohl2HH/5ZDr6CJpvlBcAt5JuZ1GcMjheOaj1jb9K Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="346033901" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="346033901" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:15:02 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="890292378" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="890292378" Received: from smithau-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.213.179.223]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:14:55 -0700 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, x86@kernel.org, dave.hansen@intel.com, kirill.shutemov@linux.intel.com, tony.luck@intel.com, peterz@infradead.org, tglx@linutronix.de, bp@alien8.de, mingo@redhat.com, hpa@zytor.com, seanjc@google.com, pbonzini@redhat.com, david@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, ashok.raj@intel.com, reinette.chatre@intel.com, len.brown@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, ying.huang@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, nik.borisov@suse.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v12 13/22] x86/virt/tdx: Designate reserved areas for all TDMRs Date: Tue, 27 Jun 2023 02:12:43 +1200 Message-Id: <932971243b1b842a59d3fb2b6506823bd732db18.1687784645.git.kai.huang@intel.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org As the last step of constructing TDMRs, populate reserved areas for all TDMRs. For each TDMR, put all memory holes within this TDMR to the reserved areas. And for all PAMTs which overlap with this TDMR, put all the overlapping parts to reserved areas too. Signed-off-by: Kai Huang Reviewed-by: Isaku Yamahata Reviewed-by: Kirill A. Shutemov Reviewed-by: Yuan Yao --- v11 -> v12: - Code change due to tdmr_get_pamt() change from returning pfn/npages to base/size - Added Kirill's tag v10 -> v11: - No update v9 -> v10: - No change. v8 -> v9: - Added comment around 'tdmr_add_rsvd_area()' to point out it doesn't do optimization to save reserved areas. (Dave). v7 -> v8: (Dave) - "set_up" -> "populate" in function name change (Dave). - Improved comment suggested by Dave. - Other changes due to 'struct tdmr_info_list'. v6 -> v7: - No change. v5 -> v6: - Rebase due to using 'tdx_memblock' instead of memblock. - Split tdmr_set_up_rsvd_areas() into two functions to handle memory hole and PAMT respectively. - Added Isaku's Reviewed-by. --- arch/x86/virt/vmx/tdx/tdx.c | 217 ++++++++++++++++++++++++++++++++++-- 1 file changed, 209 insertions(+), 8 deletions(-) diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index fd5417577f26..2bcace5cb25c 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -25,6 +25,7 @@ #include #include #include +#include #include #include #include @@ -634,6 +635,207 @@ static unsigned long tdmrs_count_pamt_kb(struct tdmr_info_list *tdmr_list) return pamt_size / 1024; } +static int tdmr_add_rsvd_area(struct tdmr_info *tdmr, int *p_idx, u64 addr, + u64 size, u16 max_reserved_per_tdmr) +{ + struct tdmr_reserved_area *rsvd_areas = tdmr->reserved_areas; + int idx = *p_idx; + + /* Reserved area must be 4K aligned in offset and size */ + if (WARN_ON(addr & ~PAGE_MASK || size & ~PAGE_MASK)) + return -EINVAL; + + if (idx >= max_reserved_per_tdmr) { + pr_warn("initialization failed: TDMR [0x%llx, 0x%llx): reserved areas exhausted.\n", + tdmr->base, tdmr_end(tdmr)); + return -ENOSPC; + } + + /* + * Consume one reserved area per call. Make no effort to + * optimize or reduce the number of reserved areas which are + * consumed by contiguous reserved areas, for instance. + */ + rsvd_areas[idx].offset = addr - tdmr->base; + rsvd_areas[idx].size = size; + + *p_idx = idx + 1; + + return 0; +} + +/* + * Go through @tmb_list to find holes between memory areas. If any of + * those holes fall within @tdmr, set up a TDMR reserved area to cover + * the hole. + */ +static int tdmr_populate_rsvd_holes(struct list_head *tmb_list, + struct tdmr_info *tdmr, + int *rsvd_idx, + u16 max_reserved_per_tdmr) +{ + struct tdx_memblock *tmb; + u64 prev_end; + int ret; + + /* + * Start looking for reserved blocks at the + * beginning of the TDMR. + */ + prev_end = tdmr->base; + list_for_each_entry(tmb, tmb_list, list) { + u64 start, end; + + start = PFN_PHYS(tmb->start_pfn); + end = PFN_PHYS(tmb->end_pfn); + + /* Break if this region is after the TDMR */ + if (start >= tdmr_end(tdmr)) + break; + + /* Exclude regions before this TDMR */ + if (end < tdmr->base) + continue; + + /* + * Skip over memory areas that + * have already been dealt with. + */ + if (start <= prev_end) { + prev_end = end; + continue; + } + + /* Add the hole before this region */ + ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, prev_end, + start - prev_end, + max_reserved_per_tdmr); + if (ret) + return ret; + + prev_end = end; + } + + /* Add the hole after the last region if it exists. */ + if (prev_end < tdmr_end(tdmr)) { + ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, prev_end, + tdmr_end(tdmr) - prev_end, + max_reserved_per_tdmr); + if (ret) + return ret; + } + + return 0; +} + +/* + * Go through @tdmr_list to find all PAMTs. If any of those PAMTs + * overlaps with @tdmr, set up a TDMR reserved area to cover the + * overlapping part. + */ +static int tdmr_populate_rsvd_pamts(struct tdmr_info_list *tdmr_list, + struct tdmr_info *tdmr, + int *rsvd_idx, + u16 max_reserved_per_tdmr) +{ + int i, ret; + + for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) { + struct tdmr_info *tmp = tdmr_entry(tdmr_list, i); + unsigned long pamt_base, pamt_size, pamt_end; + + tdmr_get_pamt(tmp, &pamt_base, &pamt_size); + /* Each TDMR must already have PAMT allocated */ + WARN_ON_ONCE(!pamt_size|| !pamt_base); + + pamt_end = pamt_base + pamt_size; + /* Skip PAMTs outside of the given TDMR */ + if ((pamt_end <= tdmr->base) || + (pamt_base >= tdmr_end(tdmr))) + continue; + + /* Only mark the part within the TDMR as reserved */ + if (pamt_base < tdmr->base) + pamt_base = tdmr->base; + if (pamt_end > tdmr_end(tdmr)) + pamt_end = tdmr_end(tdmr); + + ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, pamt_base, + pamt_end - pamt_base, + max_reserved_per_tdmr); + if (ret) + return ret; + } + + return 0; +} + +/* Compare function called by sort() for TDMR reserved areas */ +static int rsvd_area_cmp_func(const void *a, const void *b) +{ + struct tdmr_reserved_area *r1 = (struct tdmr_reserved_area *)a; + struct tdmr_reserved_area *r2 = (struct tdmr_reserved_area *)b; + + if (r1->offset + r1->size <= r2->offset) + return -1; + if (r1->offset >= r2->offset + r2->size) + return 1; + + /* Reserved areas cannot overlap. The caller must guarantee. */ + WARN_ON_ONCE(1); + return -1; +} + +/* + * Populate reserved areas for the given @tdmr, including memory holes + * (via @tmb_list) and PAMTs (via @tdmr_list). + */ +static int tdmr_populate_rsvd_areas(struct tdmr_info *tdmr, + struct list_head *tmb_list, + struct tdmr_info_list *tdmr_list, + u16 max_reserved_per_tdmr) +{ + int ret, rsvd_idx = 0; + + ret = tdmr_populate_rsvd_holes(tmb_list, tdmr, &rsvd_idx, + max_reserved_per_tdmr); + if (ret) + return ret; + + ret = tdmr_populate_rsvd_pamts(tdmr_list, tdmr, &rsvd_idx, + max_reserved_per_tdmr); + if (ret) + return ret; + + /* TDX requires reserved areas listed in address ascending order */ + sort(tdmr->reserved_areas, rsvd_idx, sizeof(struct tdmr_reserved_area), + rsvd_area_cmp_func, NULL); + + return 0; +} + +/* + * Populate reserved areas for all TDMRs in @tdmr_list, including memory + * holes (via @tmb_list) and PAMTs. + */ +static int tdmrs_populate_rsvd_areas_all(struct tdmr_info_list *tdmr_list, + struct list_head *tmb_list, + u16 max_reserved_per_tdmr) +{ + int i; + + for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) { + int ret; + + ret = tdmr_populate_rsvd_areas(tdmr_entry(tdmr_list, i), + tmb_list, tdmr_list, max_reserved_per_tdmr); + if (ret) + return ret; + } + + return 0; +} + /* * Construct a list of TDMRs on the preallocated space in @tdmr_list * to cover all TDX memory regions in @tmb_list based on the TDX module @@ -653,14 +855,13 @@ static int construct_tdmrs(struct list_head *tmb_list, sysinfo->pamt_entry_size); if (ret) return ret; - /* - * TODO: - * - * - Designate reserved areas for each TDMR. - * - * Return -EINVAL until constructing TDMRs is done - */ - return -EINVAL; + + ret = tdmrs_populate_rsvd_areas_all(tdmr_list, tmb_list, + sysinfo->max_reserved_per_tdmr); + if (ret) + tdmrs_free_pamt_all(tdmr_list); + + return ret; } static int init_tdx_module(void) From patchwork Mon Jun 26 14:12:44 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Kai" X-Patchwork-Id: 13293005 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 247C7EB64D7 for ; Mon, 26 Jun 2023 14:16:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231467AbjFZOQW (ORCPT ); Mon, 26 Jun 2023 10:16:22 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35240 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231390AbjFZOPy (ORCPT ); Mon, 26 Jun 2023 10:15:54 -0400 Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 05BD626B7; Mon, 26 Jun 2023 07:15:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1687788921; x=1719324921; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=7sUUC3QHhe9v8agSed4LLafdaNDuQsaczLJKmfjGoz0=; b=AORaLCY36TQldmvEjWQeGtBKQ+CzUx1Cj3oXZelipvUWeTt3xbIsxN8E CQsj0oRlxZGLuofVy3/IcrUyKJl2tCd299e+U+JLmJYz23Wput6V8HwM5 hm7O+VX4Eu7Ow7vSbtqYrCe2kKwCeZn9bEdMP2h8mqZc9cZdr0qLvQFa7 rXK5PsrwY7qmRtyRKDitleUVCRiC+6XOqYqF+dVbzBQ7JVDnrmxW3Khmf 9TrGOZui9pW4D7ob61Z6h4UUYY4eV5Djm+QUflfXR0CRGLauniomvSZmx gCIE6/x3Rt4B5Im7fN+6LQRAyF4bypwCx/emQzr5n+Xef2a4UJnEefsfD g==; X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="346033934" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="346033934" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:15:09 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="890292396" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="890292396" Received: from smithau-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.213.179.223]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:15:02 -0700 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, x86@kernel.org, dave.hansen@intel.com, kirill.shutemov@linux.intel.com, tony.luck@intel.com, peterz@infradead.org, tglx@linutronix.de, bp@alien8.de, mingo@redhat.com, hpa@zytor.com, seanjc@google.com, pbonzini@redhat.com, david@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, ashok.raj@intel.com, reinette.chatre@intel.com, len.brown@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, ying.huang@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, nik.borisov@suse.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v12 14/22] x86/virt/tdx: Configure TDX module with the TDMRs and global KeyID Date: Tue, 27 Jun 2023 02:12:44 +1200 Message-Id: <0978700f954d311a5580b746ec44124d1cb65c28.1687784645.git.kai.huang@intel.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org The TDX module uses a private KeyID as the "global KeyID" for mapping things like the PAMT and other TDX metadata. This KeyID has already been reserved when detecting TDX during the kernel early boot. After the list of "TD Memory Regions" (TDMRs) has been constructed to cover all TDX-usable memory regions, the next step is to pass them to the TDX module together with the global KeyID. Signed-off-by: Kai Huang Reviewed-by: Isaku Yamahata Reviewed-by: Kirill A. Shutemov Reviewed-by: Yuan Yao --- v11 -> v12: - Added Kirill's tag v10 -> v11: - No update v9 -> v10: - Code change due to change static 'tdx_tdmr_list' to local 'tdmr_list'. v8 -> v9: - Improved changlog to explain why initializing TDMRs can take long time (Dave). - Improved comments around 'next-to-initialize' address (Dave). v7 -> v8: (Dave) - Changelog: - explicitly call out this is the last step of TDX module initialization. - Trimed down changelog by removing SEAMCALL name and details. - Removed/trimmed down unnecessary comments. - Other changes due to 'struct tdmr_info_list'. v6 -> v7: - Removed need_resched() check. -- Andi. --- arch/x86/virt/vmx/tdx/tdx.c | 41 ++++++++++++++++++++++++++++++++++++- arch/x86/virt/vmx/tdx/tdx.h | 2 ++ 2 files changed, 42 insertions(+), 1 deletion(-) diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index 2bcace5cb25c..1992245290de 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -26,6 +26,7 @@ #include #include #include +#include #include #include #include @@ -864,6 +865,39 @@ static int construct_tdmrs(struct list_head *tmb_list, return ret; } +static int config_tdx_module(struct tdmr_info_list *tdmr_list, u64 global_keyid) +{ + u64 *tdmr_pa_array; + size_t array_sz; + int i, ret; + + /* + * TDMRs are passed to the TDX module via an array of physical + * addresses of each TDMR. The array itself also has certain + * alignment requirement. + */ + array_sz = tdmr_list->nr_consumed_tdmrs * sizeof(u64); + array_sz = roundup_pow_of_two(array_sz); + if (array_sz < TDMR_INFO_PA_ARRAY_ALIGNMENT) + array_sz = TDMR_INFO_PA_ARRAY_ALIGNMENT; + + tdmr_pa_array = kzalloc(array_sz, GFP_KERNEL); + if (!tdmr_pa_array) + return -ENOMEM; + + for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) + tdmr_pa_array[i] = __pa(tdmr_entry(tdmr_list, i)); + + ret = seamcall(TDH_SYS_CONFIG, __pa(tdmr_pa_array), + tdmr_list->nr_consumed_tdmrs, + global_keyid, 0, NULL, NULL); + + /* Free the array as it is not required anymore. */ + kfree(tdmr_pa_array); + + return ret; +} + static int init_tdx_module(void) { struct tdsysinfo_struct *sysinfo; @@ -917,16 +951,21 @@ static int init_tdx_module(void) if (ret) goto out_free_tdmrs; + /* Pass the TDMRs and the global KeyID to the TDX module */ + ret = config_tdx_module(&tdmr_list, tdx_global_keyid); + if (ret) + goto out_free_pamts; + /* * TODO: * - * - Configure the TDMRs and the global KeyID to the TDX module. * - Configure the global KeyID on all packages. * - Initialize all TDMRs. * * Return error before all steps are done. */ ret = -EINVAL; +out_free_pamts: if (ret) tdmrs_free_pamt_all(&tdmr_list); else diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h index 9b5a65f37e8b..c386aa3afe2a 100644 --- a/arch/x86/virt/vmx/tdx/tdx.h +++ b/arch/x86/virt/vmx/tdx/tdx.h @@ -24,6 +24,7 @@ #define TDH_SYS_INFO 32 #define TDH_SYS_INIT 33 #define TDH_SYS_LP_INIT 35 +#define TDH_SYS_CONFIG 45 struct cmr_info { u64 base; @@ -88,6 +89,7 @@ struct tdmr_reserved_area { } __packed; #define TDMR_INFO_ALIGNMENT 512 +#define TDMR_INFO_PA_ARRAY_ALIGNMENT 512 struct tdmr_info { u64 base; From patchwork Mon Jun 26 14:12:45 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Kai" X-Patchwork-Id: 13293013 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8068EEB64DA for ; Mon, 26 Jun 2023 14:16:41 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231418AbjFZOQh (ORCPT ); Mon, 26 Jun 2023 10:16:37 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35636 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230159AbjFZOQH (ORCPT ); Mon, 26 Jun 2023 10:16:07 -0400 Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AE52D19B7; Mon, 26 Jun 2023 07:15:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1687788936; x=1719324936; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=ICV6BLRH+DXa0G/pBwxcns0kdCV9sy111ORM4w+kuUQ=; b=gj0Z7r3k02+WKh8qmLlVqitDb9NcnvrURRISMsTzDgqMQNDnH8YX6SBu F5v15vjoA9n+D9fnUcfMt8d4feVThgJ1g8gqHVVeAQ0lHZ/CmH1fAfVJ4 SnERRxRcA98sO11WNlXTkQ80gwCpnmHnOsPYxujZYFrd9FIl69WCA+GaV rwikTlMeryVf6/v8pEiwqLlxB4bgC+JyEHlRMzVm0GHoaT55AeWlv45IF sneI3X/4PpT0zRNf7kBJM9iYs77MYpLWxx8WEjZcDLHBwi5KByPwSXxOG +Pm00QlfEwhYPVCFcEITbR5XOH4u9bJCCiFh+X91p2dYc+UnNiIVPoZog A==; X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="346033973" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="346033973" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:15:16 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="890292413" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="890292413" Received: from smithau-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.213.179.223]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:15:09 -0700 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, x86@kernel.org, dave.hansen@intel.com, kirill.shutemov@linux.intel.com, tony.luck@intel.com, peterz@infradead.org, tglx@linutronix.de, bp@alien8.de, mingo@redhat.com, hpa@zytor.com, seanjc@google.com, pbonzini@redhat.com, david@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, ashok.raj@intel.com, reinette.chatre@intel.com, len.brown@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, ying.huang@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, nik.borisov@suse.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v12 15/22] x86/virt/tdx: Configure global KeyID on all packages Date: Tue, 27 Jun 2023 02:12:45 +1200 Message-Id: <0c5a74c4c3aeac57b623c98d456aa059f833ebdd.1687784645.git.kai.huang@intel.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org After the list of TDMRs and the global KeyID are configured to the TDX module, the kernel needs to configure the key of the global KeyID on all packages using TDH.SYS.KEY.CONFIG. This SEAMCALL cannot run parallel on different cpus. Loop all online cpus and use smp_call_on_cpu() to call this SEAMCALL on the first cpu of each package. To keep things simple, this implementation takes no affirmative steps to online cpus to make sure there's at least one cpu for each package. The callers (aka. KVM) can ensure success by ensuring sufficient CPUs are online for this to succeed. Intel hardware doesn't guarantee cache coherency across different KeyIDs. The PAMTs are transitioning from being used by the kernel mapping (KeyId 0) to the TDX module's "global KeyID" mapping. This means that the kernel must flush any dirty KeyID-0 PAMT cachelines before the TDX module uses the global KeyID to access the PAMTs. Otherwise, if those dirty cachelines were written back, they would corrupt the TDX module's metadata. Aside: This corruption would be detected by the memory integrity hardware on the next read of the memory with the global KeyID. The result would likely be fatal to the system but would not impact TDX security. Following the TDX module specification, flush cache before configuring the global KeyID on all packages. Given the PAMT size can be large (~1/256th of system RAM), just use WBINVD on all CPUs to flush. If TDH.SYS.KEY.CONFIG fails, the TDX module may already have used the global KeyID to write the PAMTs. Therefore, use WBINVD to flush cache before returning the PAMTs back to the kernel. Also convert all PAMTs back to normal by using MOVDIR64B as suggested by the TDX module spec, although on the platform without the "partial write machine check" erratum it's OK to leave PAMTs as is. Signed-off-by: Kai Huang Reviewed-by: Isaku Yamahata Reviewed-by: Kirill A. Shutemov Reviewed-by: Yuan Yao --- v11 -> v12: - Added Kirill's tag - Improved changelog (Nikolay) v10 -> v11: - Convert PAMTs back to normal when module initialization fails. - Fixed an error in changelog v9 -> v10: - Changed to use 'smp_call_on_cpu()' directly to do key configuration. v8 -> v9: - Improved changelog (Dave). - Improved comments to explain the function to configure global KeyID "takes no affirmative action to online any cpu". (Dave). - Improved other comments suggested by Dave. v7 -> v8: (Dave) - Changelog changes: - Point out this is the step of "multi-steps" of init_tdx_module(). - Removed MOVDIR64B part. - Other changes due to removing TDH.SYS.SHUTDOWN and TDH.SYS.LP.INIT. - Changed to loop over online cpus and use smp_call_function_single() directly as the patch to shut down TDX module has been removed. - Removed MOVDIR64B part in comment. v6 -> v7: - Improved changelong and comment to explain why MOVDIR64B isn't used when returning PAMTs back to the kernel. --- arch/x86/virt/vmx/tdx/tdx.c | 135 +++++++++++++++++++++++++++++++++++- arch/x86/virt/vmx/tdx/tdx.h | 1 + 2 files changed, 134 insertions(+), 2 deletions(-) diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index 1992245290de..f5d4dbc11aee 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -31,6 +31,7 @@ #include #include #include +#include #include #include "tdx.h" @@ -577,7 +578,8 @@ static void tdmr_get_pamt(struct tdmr_info *tdmr, unsigned long *pamt_base, *pamt_size = pamt_sz; } -static void tdmr_free_pamt(struct tdmr_info *tdmr) +static void tdmr_do_pamt_func(struct tdmr_info *tdmr, + void (*pamt_func)(unsigned long base, unsigned long size)) { unsigned long pamt_base, pamt_size; @@ -590,9 +592,19 @@ static void tdmr_free_pamt(struct tdmr_info *tdmr) if (WARN_ON_ONCE(!pamt_base)) return; + (*pamt_func)(pamt_base, pamt_size); +} + +static void free_pamt(unsigned long pamt_base, unsigned long pamt_size) +{ free_contig_range(pamt_base >> PAGE_SHIFT, pamt_size >> PAGE_SHIFT); } +static void tdmr_free_pamt(struct tdmr_info *tdmr) +{ + tdmr_do_pamt_func(tdmr, free_pamt); +} + static void tdmrs_free_pamt_all(struct tdmr_info_list *tdmr_list) { int i; @@ -621,6 +633,41 @@ static int tdmrs_set_up_pamt_all(struct tdmr_info_list *tdmr_list, return ret; } +/* + * Convert TDX private pages back to normal by using MOVDIR64B to + * clear these pages. Note this function doesn't flush cache of + * these TDX private pages. The caller should make sure of that. + */ +static void reset_tdx_pages(unsigned long base, unsigned long size) +{ + const void *zero_page = (const void *)page_address(ZERO_PAGE(0)); + unsigned long phys, end; + + end = base + size; + for (phys = base; phys < end; phys += 64) + movdir64b(__va(phys), zero_page); + + /* + * MOVDIR64B uses WC protocol. Use memory barrier to + * make sure any later user of these pages sees the + * updated data. + */ + mb(); +} + +static void tdmr_reset_pamt(struct tdmr_info *tdmr) +{ + tdmr_do_pamt_func(tdmr, reset_tdx_pages); +} + +static void tdmrs_reset_pamt_all(struct tdmr_info_list *tdmr_list) +{ + int i; + + for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) + tdmr_reset_pamt(tdmr_entry(tdmr_list, i)); +} + static unsigned long tdmrs_count_pamt_kb(struct tdmr_info_list *tdmr_list) { unsigned long pamt_size = 0; @@ -898,6 +945,55 @@ static int config_tdx_module(struct tdmr_info_list *tdmr_list, u64 global_keyid) return ret; } +static int do_global_key_config(void *data) +{ + /* + * TDH.SYS.KEY.CONFIG may fail with entropy error (which is a + * recoverable error). Assume this is exceedingly rare and + * just return error if encountered instead of retrying. + * + * All '0's are just unused parameters. + */ + return seamcall(TDH_SYS_KEY_CONFIG, 0, 0, 0, 0, NULL, NULL); +} + +/* + * Attempt to configure the global KeyID on all physical packages. + * + * This requires running code on at least one CPU in each package. If a + * package has no online CPUs, that code will not run and TDX module + * initialization (TDMR initialization) will fail. + * + * This code takes no affirmative steps to online CPUs. Callers (aka. + * KVM) can ensure success by ensuring sufficient CPUs are online for + * this to succeed. + */ +static int config_global_keyid(void) +{ + cpumask_var_t packages; + int cpu, ret = -EINVAL; + + if (!zalloc_cpumask_var(&packages, GFP_KERNEL)) + return -ENOMEM; + + for_each_online_cpu(cpu) { + if (cpumask_test_and_set_cpu(topology_physical_package_id(cpu), + packages)) + continue; + + /* + * TDH.SYS.KEY.CONFIG cannot run concurrently on + * different cpus, so just do it one by one. + */ + ret = smp_call_on_cpu(cpu, do_global_key_config, NULL, true); + if (ret) + break; + } + + free_cpumask_var(packages); + return ret; +} + static int init_tdx_module(void) { struct tdsysinfo_struct *sysinfo; @@ -956,15 +1052,47 @@ static int init_tdx_module(void) if (ret) goto out_free_pamts; + /* + * Hardware doesn't guarantee cache coherency across different + * KeyIDs. The kernel needs to flush PAMT's dirty cachelines + * (associated with KeyID 0) before the TDX module can use the + * global KeyID to access the PAMT. Given PAMTs are potentially + * large (~1/256th of system RAM), just use WBINVD on all cpus + * to flush the cache. + */ + wbinvd_on_all_cpus(); + + /* Config the key of global KeyID on all packages */ + ret = config_global_keyid(); + if (ret) + goto out_reset_pamts; + /* * TODO: * - * - Configure the global KeyID on all packages. * - Initialize all TDMRs. * * Return error before all steps are done. */ ret = -EINVAL; +out_reset_pamts: + if (ret) { + /* + * Part of PAMTs may already have been initialized by the + * TDX module. Flush cache before returning PAMTs back + * to the kernel. + */ + wbinvd_on_all_cpus(); + /* + * According to the TDX hardware spec, if the platform + * doesn't have the "partial write machine check" + * erratum, any kernel read/write will never cause #MC + * in kernel space, thus it's OK to not convert PAMTs + * back to normal. But do the conversion anyway here + * as suggested by the TDX spec. + */ + tdmrs_reset_pamt_all(&tdmr_list); + } out_free_pamts: if (ret) tdmrs_free_pamt_all(&tdmr_list); @@ -1019,6 +1147,9 @@ static int __tdx_enable(void) * lock to prevent any new cpu from becoming online; 2) done both VMXON * and tdx_cpu_enable() on all online cpus. * + * This function requires there's at least one online cpu for each CPU + * package to succeed. + * * This function can be called in parallel by multiple callers. * * Return 0 if TDX is enabled successfully, otherwise error. diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h index c386aa3afe2a..a0438513bec0 100644 --- a/arch/x86/virt/vmx/tdx/tdx.h +++ b/arch/x86/virt/vmx/tdx/tdx.h @@ -21,6 +21,7 @@ /* * TDX module SEAMCALL leaf functions */ +#define TDH_SYS_KEY_CONFIG 31 #define TDH_SYS_INFO 32 #define TDH_SYS_INIT 33 #define TDH_SYS_LP_INIT 35 From patchwork Mon Jun 26 14:12:46 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Kai" X-Patchwork-Id: 13293014 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B65CDEB64D7 for ; Mon, 26 Jun 2023 14:16:46 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231365AbjFZOQp (ORCPT ); Mon, 26 Jun 2023 10:16:45 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35214 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231363AbjFZOQT (ORCPT ); Mon, 26 Jun 2023 10:16:19 -0400 Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AAFA910F4; Mon, 26 Jun 2023 07:15:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1687788947; x=1719324947; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=5sc34Yqc1vOr6HaopE/2pjCFoh8PgsgrPLKdaoJpmO4=; b=O2H8PF4Y79YERx5w2xkdoavKvK3ANI2b1INB97GqT4hFV9rrDI/LS4gf hGKlg5QqOCXbY7gL0egbe8zC69F7V1qe6ROGQfjRuVFlsYR5rKHOywn9c iupJveqAnnGCMHoVZ2DBWur/vPPvpy1JOYVppdcleu5KqZnBt19NmtAD8 2olBVFoY6IDIusGyfKZPP5cHu1Mb0KQoQFUM+M9O5eehd+jSZ9jVwuZrs t3YGzAzH2yJ/09hpBJEchvyGa/DUpoHqy1Om3bU81W9ORAig4EAa2iZ8Z XrY6qULiC2z+adCLsjZ3O5zEphXryd+zAB1uNQB4aLOetCYEVfRrW6dzh g==; X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="346034024" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="346034024" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:15:22 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="890292427" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="890292427" Received: from smithau-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.213.179.223]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:15:16 -0700 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, x86@kernel.org, dave.hansen@intel.com, kirill.shutemov@linux.intel.com, tony.luck@intel.com, peterz@infradead.org, tglx@linutronix.de, bp@alien8.de, mingo@redhat.com, hpa@zytor.com, seanjc@google.com, pbonzini@redhat.com, david@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, ashok.raj@intel.com, reinette.chatre@intel.com, len.brown@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, ying.huang@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, nik.borisov@suse.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v12 16/22] x86/virt/tdx: Initialize all TDMRs Date: Tue, 27 Jun 2023 02:12:46 +1200 Message-Id: X-Mailer: git-send-email 2.40.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org After the global KeyID has been configured on all packages, initialize all TDMRs to make all TDX-usable memory regions that are passed to the TDX module become usable. This is the last step of initializing the TDX module. Initializing TDMRs can be time consuming on large memory systems as it involves initializing all metadata entries for all pages that can be used by TDX guests. Initializing different TDMRs can be parallelized. For now to keep it simple, just initialize all TDMRs one by one. It can be enhanced in the future. Signed-off-by: Kai Huang Reviewed-by: Isaku Yamahata Reviewed-by: Kirill A. Shutemov Reviewed-by: Yuan Yao --- v11 -> v12: - Added Kirill's tag v10 -> v11: - No update v9 -> v10: - Code change due to change static 'tdx_tdmr_list' to local 'tdmr_list'. v8 -> v9: - Improved changlog to explain why initializing TDMRs can take long time (Dave). - Improved comments around 'next-to-initialize' address (Dave). v7 -> v8: (Dave) - Changelog: - explicitly call out this is the last step of TDX module initialization. - Trimed down changelog by removing SEAMCALL name and details. - Removed/trimmed down unnecessary comments. - Other changes due to 'struct tdmr_info_list'. v6 -> v7: - Removed need_resched() check. -- Andi. --- arch/x86/virt/vmx/tdx/tdx.c | 60 ++++++++++++++++++++++++++++++++----- arch/x86/virt/vmx/tdx/tdx.h | 1 + 2 files changed, 53 insertions(+), 8 deletions(-) diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index f5d4dbc11aee..52b7267ea226 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -994,6 +994,56 @@ static int config_global_keyid(void) return ret; } +static int init_tdmr(struct tdmr_info *tdmr) +{ + u64 next; + + /* + * Initializing a TDMR can be time consuming. To avoid long + * SEAMCALLs, the TDX module may only initialize a part of the + * TDMR in each call. + */ + do { + struct tdx_module_output out; + int ret; + + /* All 0's are unused parameters, they mean nothing. */ + ret = seamcall(TDH_SYS_TDMR_INIT, tdmr->base, 0, 0, 0, NULL, + &out); + if (ret) + return ret; + /* + * RDX contains 'next-to-initialize' address if + * TDH.SYS.TDMR.INIT did not fully complete and + * should be retried. + */ + next = out.rdx; + cond_resched(); + /* Keep making SEAMCALLs until the TDMR is done */ + } while (next < tdmr->base + tdmr->size); + + return 0; +} + +static int init_tdmrs(struct tdmr_info_list *tdmr_list) +{ + int i; + + /* + * This operation is costly. It can be parallelized, + * but keep it simple for now. + */ + for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) { + int ret; + + ret = init_tdmr(tdmr_entry(tdmr_list, i)); + if (ret) + return ret; + } + + return 0; +} + static int init_tdx_module(void) { struct tdsysinfo_struct *sysinfo; @@ -1067,14 +1117,8 @@ static int init_tdx_module(void) if (ret) goto out_reset_pamts; - /* - * TODO: - * - * - Initialize all TDMRs. - * - * Return error before all steps are done. - */ - ret = -EINVAL; + /* Initialize TDMRs to complete the TDX module initialization */ + ret = init_tdmrs(&tdmr_list); out_reset_pamts: if (ret) { /* diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h index a0438513bec0..f6b4e153890d 100644 --- a/arch/x86/virt/vmx/tdx/tdx.h +++ b/arch/x86/virt/vmx/tdx/tdx.h @@ -25,6 +25,7 @@ #define TDH_SYS_INFO 32 #define TDH_SYS_INIT 33 #define TDH_SYS_LP_INIT 35 +#define TDH_SYS_TDMR_INIT 36 #define TDH_SYS_CONFIG 45 struct cmr_info { From patchwork Mon Jun 26 14:12:47 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Kai" X-Patchwork-Id: 13293015 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E8F59EB64D7 for ; Mon, 26 Jun 2023 14:16:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231481AbjFZOQz (ORCPT ); Mon, 26 Jun 2023 10:16:55 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35830 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231477AbjFZOQY (ORCPT ); Mon, 26 Jun 2023 10:16:24 -0400 Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1094410FD; Mon, 26 Jun 2023 07:15:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1687788954; x=1719324954; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=3C4Jq8CVU11zkfXyA1z2AnjnK1Ws55t2+b6f4x+cwbw=; b=hIIshNM8ukeH+1gkfk8cdtm60CTT+fgedi+UeqB99OWI4MfFbYahUm8H 5y9p9paxDX9nA+5LzbLLpo9KLriiSfQBzem85YICqujWdkTr4taOG0v5S oM4LGgKgP/csVhVhaWmz0by6hDQq7cN1nCkPFuClKiXFA9h6s8ENGuThQ tBUVAruAG+Y6yYcNK1PboNCgT8hzC2/5itUeGLK7jFVzM2EBfcKDcI6zd k4SDzZZLvEmxKF0a6Q8jUnpJ3y1V2Ok5eUi0UkZcTddlBrkhHD4tVR0Uk aF36jpVMADExBiZmiZe9FMkOAdQb5vnq/15EOC/jlywg+rFqCAEcBoPcx Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="346034078" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="346034078" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:15:29 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="890292445" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="890292445" Received: from smithau-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.213.179.223]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:15:23 -0700 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, x86@kernel.org, dave.hansen@intel.com, kirill.shutemov@linux.intel.com, tony.luck@intel.com, peterz@infradead.org, tglx@linutronix.de, bp@alien8.de, mingo@redhat.com, hpa@zytor.com, seanjc@google.com, pbonzini@redhat.com, david@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, ashok.raj@intel.com, reinette.chatre@intel.com, len.brown@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, ying.huang@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, nik.borisov@suse.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v12 17/22] x86/kexec: Flush cache of TDX private memory Date: Tue, 27 Jun 2023 02:12:47 +1200 Message-Id: <92fae16d43128e7196f04db5ed71935f3e5ea32e.1687784645.git.kai.huang@intel.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org There are two problems in terms of using kexec() to boot to a new kernel when the old kernel has enabled TDX: 1) Part of the memory pages are still TDX private pages; 2) There might be dirty cachelines associated with TDX private pages. The first problem doesn't matter on the platforms w/o the "partial write machine check" erratum. KeyID 0 doesn't have integrity check. If the new kernel wants to use any non-zero KeyID, it needs to convert the memory to that KeyID and such conversion would work from any KeyID. However the old kernel needs to guarantee there's no dirty cacheline left behind before booting to the new kernel to avoid silent corruption from later cacheline writeback (Intel hardware doesn't guarantee cache coherency across different KeyIDs). There are two things that the old kernel needs to do to achieve that: 1) Stop accessing TDX private memory mappings: a. Stop making TDX module SEAMCALLs (TDX global KeyID); b. Stop TDX guests from running (per-guest TDX KeyID). 2) Flush any cachelines from previous TDX private KeyID writes. For 2), use wbinvd() to flush cache in stop_this_cpu(), following SME support. And in this way 1) happens for free as there's no TDX activity between wbinvd() and the native_halt(). Flushing cache in stop_this_cpu() only flushes cache on remote cpus. On the rebooting cpu which does kexec(), unlike SME which does the cache flush in relocate_kernel(), flush the cache right after stopping remote cpus in machine_shutdown(). There are two reasons to do so: 1) For TDX there's no need to defer cache flush to relocate_kernel() because all TDX activities have been stopped. 2) On the platforms with the above erratum the kernel must convert all TDX private pages back to normal before booting to the new kernel in kexec(), and flushing cache early allows the kernel to convert memory early rather than having to muck with the relocate_kernel() assembly. Theoretically, cache flush is only needed when the TDX module has been initialized. However initializing the TDX module is done on demand at runtime, and it takes a mutex to read the module status. Just check whether TDX is enabled by the BIOS instead to flush cache. Signed-off-by: Kai Huang Reviewed-by: Isaku Yamahata Reviewed-by: Kirill A. Shutemov --- v11 -> v12: - Changed comment/changelog to say kernel doesn't try to handle fast warm reset but depends on BIOS to enable workaround (Kirill) - Added Kirill's tag v10 -> v11: - Fixed a bug that cache for rebooting cpu isn't flushed for TDX private memory. - Updated changelog accordingly. v9 -> v10: - No change. v8 -> v9: - Various changelog enhancement and fix (Dave). - Improved comment (Dave). v7 -> v8: - Changelog: - Removed "leave TDX module open" part due to shut down patch has been removed. v6 -> v7: - Improved changelog to explain why don't convert TDX private pages back to normal. --- arch/x86/kernel/process.c | 7 ++++++- arch/x86/kernel/reboot.c | 15 +++++++++++++++ 2 files changed, 21 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index dac41a0072ea..0ce66deb9bc8 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -780,8 +780,13 @@ void __noreturn stop_this_cpu(void *dummy) * * Test the CPUID bit directly because the machine might've cleared * X86_FEATURE_SME due to cmdline options. + * + * The TDX module or guests might have left dirty cachelines + * behind. Flush them to avoid corruption from later writeback. + * Note that this flushes on all systems where TDX is possible, + * but does not actually check that TDX was in use. */ - if (cpuid_eax(0x8000001f) & BIT(0)) + if (cpuid_eax(0x8000001f) & BIT(0) || platform_tdx_enabled()) native_wbinvd(); for (;;) { /* diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c index 3adbe97015c1..ae7480a213a6 100644 --- a/arch/x86/kernel/reboot.c +++ b/arch/x86/kernel/reboot.c @@ -32,6 +32,7 @@ #include #include #include +#include /* * Power off function, if any @@ -695,6 +696,20 @@ void native_machine_shutdown(void) local_irq_disable(); stop_other_cpus(); #endif + /* + * stop_other_cpus() has flushed all dirty cachelines of TDX + * private memory on remote cpus. Unlike SME, which does the + * cache flush on _this_ cpu in the relocate_kernel(), flush + * the cache for _this_ cpu here. This is because on the + * platforms with "partial write machine check" erratum the + * kernel needs to convert all TDX private pages back to normal + * before booting to the new kernel in kexec(), and the cache + * flush must be done before that. If the kernel took SME's way, + * it would have to muck with the relocate_kernel() assembly to + * do memory conversion. + */ + if (platform_tdx_enabled()) + native_wbinvd(); lapic_shutdown(); restore_boot_irq_mode(); From patchwork Mon Jun 26 14:12:48 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Kai" X-Patchwork-Id: 13293017 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7DC52EB64D7 for ; Mon, 26 Jun 2023 14:17:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231492AbjFZORM (ORCPT ); Mon, 26 Jun 2023 10:17:12 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35192 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230224AbjFZOQm (ORCPT ); Mon, 26 Jun 2023 10:16:42 -0400 Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6103910E2; Mon, 26 Jun 2023 07:16:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1687788966; x=1719324966; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=MdgIAiAKsm5b7IZW1m4wCtAmfDlp9uu+ITFZtSJwHDU=; b=cqUMFl7uHmUzQPMTGgfSgEcm2VUKnic47G1hn2TbhzE5IN4RXtUBgXuI yGhvMwK7QXrcTj2jfOvktNg9Spls0qONNpT4onTyuiaWFlekCaLAaOPtJ Z71YzYMeCZJVd0ErDodGMwQjpBdrcWhb0ZzNIPBrCXA6V9l5FPm+RLvZH /264fRFSmovfvXsg2PHLqUBCKxXxFbF14P3Pdvpmau4YfciN2+Dl73Qfp p3WXqovmu7q3kyomLudc7DPSGxnSxJUKLsiQqIoMXqZHhM3qgSmU/w98j A3WHrj/16HpPxQyc1Mj7A2QAGnKI5e5rOSambnZ635OCleHk1WZdmo42a Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="346034130" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="346034130" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:15:36 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="890292458" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="890292458" Received: from smithau-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.213.179.223]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:15:29 -0700 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, x86@kernel.org, dave.hansen@intel.com, kirill.shutemov@linux.intel.com, tony.luck@intel.com, peterz@infradead.org, tglx@linutronix.de, bp@alien8.de, mingo@redhat.com, hpa@zytor.com, seanjc@google.com, pbonzini@redhat.com, david@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, ashok.raj@intel.com, reinette.chatre@intel.com, len.brown@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, ying.huang@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, nik.borisov@suse.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v12 18/22] x86/virt/tdx: Keep TDMRs when module initialization is successful Date: Tue, 27 Jun 2023 02:12:48 +1200 Message-Id: <7d06fe5fda0e330895c1c9043b881f3c2a2d4f3f.1687784645.git.kai.huang@intel.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org On the platforms with the "partial write machine check" erratum, the kexec() needs to convert all TDX private pages back to normal before booting to the new kernel. Otherwise, the new kernel may get unexpected machine check. There's no existing infrastructure to track TDX private pages. Change to keep TDMRs when module initialization is successful so that they can be used to find PAMTs. With this change, only put_online_mems() and freeing the buffer of the TDSYSINFO_STRUCT and CMR array still need to be done even when module initialization is successful. Adjust the error handling to explicitly do them when module initialization is successful and unconditionally clean up the rest when initialization fails. Signed-off-by: Kai Huang --- v11 -> v12 (new patch): - Defer keeping TDMRs logic to this patch for better review - Improved error handling logic (Nikolay/Kirill in patch 15) --- arch/x86/virt/vmx/tdx/tdx.c | 84 ++++++++++++++++++------------------- 1 file changed, 42 insertions(+), 42 deletions(-) diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index 52b7267ea226..85b24b2e9417 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -49,6 +49,8 @@ static DEFINE_MUTEX(tdx_module_lock); /* All TDX-usable memory regions. Protected by mem_hotplug_lock. */ static LIST_HEAD(tdx_memlist); +static struct tdmr_info_list tdx_tdmr_list; + /* * Wrapper of __seamcall() to convert SEAMCALL leaf function error code * to kernel error code. @seamcall_ret and @out contain the SEAMCALL @@ -1047,7 +1049,6 @@ static int init_tdmrs(struct tdmr_info_list *tdmr_list) static int init_tdx_module(void) { struct tdsysinfo_struct *sysinfo; - struct tdmr_info_list tdmr_list; struct cmr_info *cmr_array; int ret; @@ -1088,17 +1089,17 @@ static int init_tdx_module(void) goto out_put_tdxmem; /* Allocate enough space for constructing TDMRs */ - ret = alloc_tdmr_list(&tdmr_list, sysinfo); + ret = alloc_tdmr_list(&tdx_tdmr_list, sysinfo); if (ret) goto out_free_tdxmem; /* Cover all TDX-usable memory regions in TDMRs */ - ret = construct_tdmrs(&tdx_memlist, &tdmr_list, sysinfo); + ret = construct_tdmrs(&tdx_memlist, &tdx_tdmr_list, sysinfo); if (ret) goto out_free_tdmrs; /* Pass the TDMRs and the global KeyID to the TDX module */ - ret = config_tdx_module(&tdmr_list, tdx_global_keyid); + ret = config_tdx_module(&tdx_tdmr_list, tdx_global_keyid); if (ret) goto out_free_pamts; @@ -1118,51 +1119,50 @@ static int init_tdx_module(void) goto out_reset_pamts; /* Initialize TDMRs to complete the TDX module initialization */ - ret = init_tdmrs(&tdmr_list); + ret = init_tdmrs(&tdx_tdmr_list); + if (ret) + goto out_reset_pamts; + + pr_info("%lu KBs allocated for PAMT.\n", + tdmrs_count_pamt_kb(&tdx_tdmr_list)); + + /* + * @tdx_memlist is written here and read at memory hotplug time. + * Lock out memory hotplug code while building it. + */ + put_online_mems(); + /* + * For now both @sysinfo and @cmr_array are only used during + * module initialization, so always free them. + */ + free_page((unsigned long)sysinfo); + + return 0; out_reset_pamts: - if (ret) { - /* - * Part of PAMTs may already have been initialized by the - * TDX module. Flush cache before returning PAMTs back - * to the kernel. - */ - wbinvd_on_all_cpus(); - /* - * According to the TDX hardware spec, if the platform - * doesn't have the "partial write machine check" - * erratum, any kernel read/write will never cause #MC - * in kernel space, thus it's OK to not convert PAMTs - * back to normal. But do the conversion anyway here - * as suggested by the TDX spec. - */ - tdmrs_reset_pamt_all(&tdmr_list); - } + /* + * Part of PAMTs may already have been initialized by the + * TDX module. Flush cache before returning PAMTs back + * to the kernel. + */ + wbinvd_on_all_cpus(); + /* + * According to the TDX hardware spec, if the platform + * doesn't have the "partial write machine check" + * erratum, any kernel read/write will never cause #MC + * in kernel space, thus it's OK to not convert PAMTs + * back to normal. But do the conversion anyway here + * as suggested by the TDX spec. + */ + tdmrs_reset_pamt_all(&tdx_tdmr_list); out_free_pamts: - if (ret) - tdmrs_free_pamt_all(&tdmr_list); - else - pr_info("%lu KBs allocated for PAMT.\n", - tdmrs_count_pamt_kb(&tdmr_list)); + tdmrs_free_pamt_all(&tdx_tdmr_list); out_free_tdmrs: - /* - * Always free the buffer of TDMRs as they are only used during - * module initialization. - */ - free_tdmr_list(&tdmr_list); + free_tdmr_list(&tdx_tdmr_list); out_free_tdxmem: - if (ret) - free_tdx_memlist(&tdx_memlist); + free_tdx_memlist(&tdx_memlist); out_put_tdxmem: - /* - * @tdx_memlist is written here and read at memory hotplug time. - * Lock out memory hotplug code while building it. - */ put_online_mems(); out: - /* - * For now both @sysinfo and @cmr_array are only used during - * module initialization, so always free them. - */ free_page((unsigned long)sysinfo); return ret; } From patchwork Mon Jun 26 14:12:49 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Kai" X-Patchwork-Id: 13293016 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id EF09BEB64DA for ; Mon, 26 Jun 2023 14:17:16 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231495AbjFZORP (ORCPT ); Mon, 26 Jun 2023 10:17:15 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35814 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231386AbjFZOQv (ORCPT ); Mon, 26 Jun 2023 10:16:51 -0400 Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8C161269E; Mon, 26 Jun 2023 07:16:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1687788971; x=1719324971; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=khU2wAmAb3bU/YWKOhJwkRKb7EYyfaX7rDa+3NEfufs=; b=Y0/EAcgMIav5fksTsCgKPyY4MwlnY7DlDEqSvbaVDGhCFaVLMHiCvAvL AolUluNHWgmGDQpL43FeSyROo16HeShjyfV8aeBh260Ttd2zISHVl4tyY eFOZkraQKk0jxTrWs40OQOicURQaPx2PYse3SdvlEaZY6Rr0j0GjZuLug rREqJJfzkQzxQLUGL1S1V8GKWvH8MW1CTaReXKUn41eZjVoMm8knOezIt w/hjIyD+pVkl9u/FRR3A2r3qtuW1f0mvBN4NgwBadi59vxi32XLKjl5Ms 1i1d+sc+ya4AkAZibMYxS+qpb/sutHzRA/ywzIZglru7OJi3vERyE8llJ g==; X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="346034172" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="346034172" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:15:43 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="890292473" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="890292473" Received: from smithau-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.213.179.223]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:15:36 -0700 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, x86@kernel.org, dave.hansen@intel.com, kirill.shutemov@linux.intel.com, tony.luck@intel.com, peterz@infradead.org, tglx@linutronix.de, bp@alien8.de, mingo@redhat.com, hpa@zytor.com, seanjc@google.com, pbonzini@redhat.com, david@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, ashok.raj@intel.com, reinette.chatre@intel.com, len.brown@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, ying.huang@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, nik.borisov@suse.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v12 19/22] x86/kexec(): Reset TDX private memory on platforms with TDX erratum Date: Tue, 27 Jun 2023 02:12:49 +1200 Message-Id: <28aece770321e307d58df77eddee2d3fa851d15a.1687784645.git.kai.huang@intel.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org The first few generations of TDX hardware have an erratum. A partial write to a TDX private memory cacheline will silently "poison" the line. Subsequent reads will consume the poison and generate a machine check. According to the TDX hardware spec, neither of these things should have happened. == Background == Virtually all kernel memory accesses operations happen in full cachelines. In practice, writing a "byte" of memory usually reads a 64 byte cacheline of memory, modifies it, then writes the whole line back. Those operations do not trigger this problem. This problem is triggered by "partial" writes where a write transaction of less than cacheline lands at the memory controller. The CPU does these via non-temporal write instructions (like MOVNTI), or through UC/WC memory mappings. The issue can also be triggered away from the CPU by devices doing partial writes via DMA. == Problem == A fast warm reset doesn't reset TDX private memory. Kexec() can also boot into the new kernel directly. Thus if the old kernel has enabled TDX on the platform with this erratum, the new kernel may get unexpected machine check. Note that w/o this erratum any kernel read/write on TDX private memory should never cause machine check, thus it's OK for the old kernel to leave TDX private pages as is. == Solution == In short, with this erratum, the kernel needs to explicitly convert all TDX private pages back to normal to give the new kernel a clean slate after kexec(). The BIOS is also expected to disable fast warm reset as a workaround to this erratum, thus this implementation doesn't try to reset TDX private memory for the reboot case in the kernel but depend on the BIOS to enable the workaround. For now TDX private memory can only be PAMT pages. It would be ideal to cover all types of TDX private memory here (TDX guest private pages and Secure-EPT pages are yet to be implemented when TDX gets supported in KVM), but there's no existing infrastructure to track TDX private pages. It's not feasible to query the TDX module about page type either because VMX has already been stopped when KVM receives the reboot notifier. Another option is to blindly convert all memory pages. But this may bring non-trivial latency to kexec() on large memory systems (especially when the number of TDX private pages is small). Thus even with this temporary solution, eventually it's better for the kernel to only reset TDX private pages. Also, it's problematic to convert all memory pages because not all pages are mapped as writable in the direct-mapping. The kernel needs to switch to another page table which maps all pages as writable (e.g., the identical-mapping table for kexec(), or a new page table) to do so, but this looks overkill. Therefore, rather than doing something dramatic, only reset PAMT pages for now. Do it in machine_kexec() to avoid additional overhead to the machine reboot/shutdown as the kernel depends on the BIOS to disable fast warm reset as a workaround for the reboot case. Signed-off-by: Kai Huang --- v11 -> v12: - Changed comment/changelog to say kernel doesn't try to handle fast warm reset but depends on BIOS to enable workaround (Kirill) - Added a new tdx_may_has_private_mem to indicate system may have TDX private memory and PAMTs/TDMRs are stable to access. (Dave). - Use atomic_t for tdx_may_has_private_mem for build-in memory barrier (Dave) - Changed calling x86_platform.memory_shutdown() to calling tdx_reset_memory() directly from machine_kexec() to avoid overhead to normal reboot case. v10 -> v11: - New patch --- arch/x86/include/asm/tdx.h | 2 + arch/x86/kernel/machine_kexec_64.c | 9 ++++ arch/x86/virt/vmx/tdx/tdx.c | 79 ++++++++++++++++++++++++++++++ 3 files changed, 90 insertions(+) diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index 91416fd600cd..e95c9fbf52e4 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -100,10 +100,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1, bool platform_tdx_enabled(void); int tdx_cpu_enable(void); int tdx_enable(void); +void tdx_reset_memory(void); #else /* !CONFIG_INTEL_TDX_HOST */ static inline bool platform_tdx_enabled(void) { return false; } static inline int tdx_cpu_enable(void) { return -ENODEV; } static inline int tdx_enable(void) { return -ENODEV; } +static inline void tdx_reset_memory(void) { } #endif /* CONFIG_INTEL_TDX_HOST */ #endif /* !__ASSEMBLY__ */ diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c index 1a3e2c05a8a5..232253bd7ccd 100644 --- a/arch/x86/kernel/machine_kexec_64.c +++ b/arch/x86/kernel/machine_kexec_64.c @@ -28,6 +28,7 @@ #include #include #include +#include #ifdef CONFIG_ACPI /* @@ -301,6 +302,14 @@ void machine_kexec(struct kimage *image) void *control_page; int save_ftrace_enabled; + /* + * On the platform with "partial write machine check" erratum, + * all TDX private pages need to be converted back to normal + * before booting to the new kernel, otherwise the new kernel + * may get unexpected machine check. + */ + tdx_reset_memory(); + #ifdef CONFIG_KEXEC_JUMP if (image->preserve_context) save_processor_state(); diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index 85b24b2e9417..1107f4227568 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -51,6 +51,8 @@ static LIST_HEAD(tdx_memlist); static struct tdmr_info_list tdx_tdmr_list; +static atomic_t tdx_may_has_private_mem; + /* * Wrapper of __seamcall() to convert SEAMCALL leaf function error code * to kernel error code. @seamcall_ret and @out contain the SEAMCALL @@ -1113,6 +1115,17 @@ static int init_tdx_module(void) */ wbinvd_on_all_cpus(); + /* + * Starting from this point the system may have TDX private + * memory. Make it globally visible so tdx_reset_memory() only + * reads TDMRs/PAMTs when they are stable. + * + * Note using atomic_inc_return() to provide the explicit memory + * ordering isn't mandatory here as the WBINVD above already + * does that. Compiler barrier isn't needed here either. + */ + atomic_inc_return(&tdx_may_has_private_mem); + /* Config the key of global KeyID on all packages */ ret = config_global_keyid(); if (ret) @@ -1154,6 +1167,15 @@ static int init_tdx_module(void) * as suggested by the TDX spec. */ tdmrs_reset_pamt_all(&tdx_tdmr_list); + /* + * No more TDX private pages now, and PAMTs/TDMRs are + * going to be freed. Make this globally visible so + * tdx_reset_memory() can read stable TDMRs/PAMTs. + * + * Note atomic_dec_return(), which is an atomic RMW with + * return value, always enforces the memory barrier. + */ + atomic_dec_return(&tdx_may_has_private_mem); out_free_pamts: tdmrs_free_pamt_all(&tdx_tdmr_list); out_free_tdmrs: @@ -1229,6 +1251,63 @@ int tdx_enable(void) } EXPORT_SYMBOL_GPL(tdx_enable); +/* + * Convert TDX private pages back to normal on platforms with + * "partial write machine check" erratum. + * + * Called from machine_kexec() before booting to the new kernel. + */ +void tdx_reset_memory(void) +{ + if (!platform_tdx_enabled()) + return; + + /* + * Kernel read/write to TDX private memory doesn't + * cause machine check on hardware w/o this erratum. + */ + if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE)) + return; + + /* Called from kexec() when only rebooting cpu is alive */ + WARN_ON_ONCE(num_online_cpus() != 1); + + if (!atomic_read(&tdx_may_has_private_mem)) + return; + + /* + * Ideally it's better to cover all types of TDX private pages, + * but there's no existing infrastructure to tell whether a page + * is TDX private memory or not. Using SEAMCALL to query TDX + * module isn't feasible either because: 1) VMX has been turned + * off by reaching here so SEAMCALL cannot be made; 2) Even + * SEAMCALL can be made the result from TDX module may not be + * accurate (e.g., remote CPU can be stopped while the kernel + * is in the middle of reclaiming one TDX private page and doing + * MOVDIR64B). + * + * One solution could be just converting all memory pages, but + * this may bring non-trivial latency on large memory systems + * (especially when the number of TDX private pages is small). + * So even with this temporary solution, eventually the kernel + * should only convert TDX private pages. + * + * Also, not all pages are mapped as writable in direct mapping, + * thus it's problematic to do so. It can be done by switching + * to the identical mapping table for kexec() or a new page table + * which maps all pages as writable, but the complexity looks + * overkill. + * + * Thus instead of doing something dramatic to convert all pages, + * only convert PAMTs as for now TDX private pages can only be + * PAMT. + * + * All other cpus are already dead. TDMRs/PAMTs are stable when + * @tdx_may_has_private_mem reads true. + */ + tdmrs_reset_pamt_all(&tdx_tdmr_list); +} + static int __init record_keyid_partitioning(u32 *tdx_keyid_start, u32 *nr_tdx_keyids) { From patchwork Mon Jun 26 14:12:50 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Kai" X-Patchwork-Id: 13293018 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 85479EB64D7 for ; Mon, 26 Jun 2023 14:17:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231255AbjFZOR2 (ORCPT ); Mon, 26 Jun 2023 10:17:28 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35038 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230004AbjFZORE (ORCPT ); Mon, 26 Jun 2023 10:17:04 -0400 Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AA6841984; Mon, 26 Jun 2023 07:16:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1687788977; x=1719324977; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=aUMKWf1P6Io805CmABy5YQpEPitiSlwsBehX9i/r0CA=; b=R7KsmDVarrNYWv+4+A0iJd0Vz4AC0qOED51++5JTqrbTKtyfYlRIXCYL rbBMqzv0CgGCoWxqGPdZcGGQ7yHp/JR7bnrHtrByABvp7qeTKpSz+eydI q5U5hLyVhEvLf0Io+LsqxArP9oqK3hfRaKByOfLdbYd9f9DzB74Vbm2Nd mfranaNYFFutKUGxEGC9xIsMK+Igj5cwFCAOMv2dWMjw7KgpAhPxtKJ1Z ZUuW0NPdxuoza6nTrDaVtG74eYkerqgOGnK8ZIg5akCdKJA09+yXST9+T MZuBTW5gWsDGdMrGT4oP8aw5Kp17+H8FZjwDjS60G84JCvWm9WSKeiYjG A==; X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="346034210" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="346034210" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:15:50 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="890292482" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="890292482" Received: from smithau-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.213.179.223]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:15:43 -0700 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, x86@kernel.org, dave.hansen@intel.com, kirill.shutemov@linux.intel.com, tony.luck@intel.com, peterz@infradead.org, tglx@linutronix.de, bp@alien8.de, mingo@redhat.com, hpa@zytor.com, seanjc@google.com, pbonzini@redhat.com, david@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, ashok.raj@intel.com, reinette.chatre@intel.com, len.brown@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, ying.huang@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, nik.borisov@suse.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v12 20/22] x86/virt/tdx: Allow SEAMCALL to handle #UD and #GP Date: Tue, 27 Jun 2023 02:12:50 +1200 Message-Id: X-Mailer: git-send-email 2.40.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org On the platform with the "partial write machine check" erratum, a kernel partial write to TDX private memory may cause unexpected machine check. It would be nice if the #MC handler could print additional information to show the #MC was TDX private memory error due to possible kernel bug. To do that, the machine check handler needs to use SEAMCALL to query page type of the error memory from the TDX module, because there's no existing infrastructure to track TDX private pages. SEAMCALL instruction causes #UD if CPU isn't in VMX operation. In #MC handler, it is legal that CPU isn't in VMX operation when making this SEAMCALL. Extend the TDX_MODULE_CALL macro to handle #UD so the SEAMCALL can return error code instead of Oops in the #MC handler. Opportunistically handles #GP too since they share the same code. A bonus is when kernel mistakenly calls SEAMCALL when CPU isn't in VMX operation, or when TDX isn't enabled by the BIOS, or when the BIOS is buggy, the kernel can get a nicer error message rather than a less understandable Oops. Signed-off-by: Kai Huang Reviewed-by: Kirill A. Shutemov --- v11 -> v12 (new patch): - Splitted out from "SEAMCALL infrastructure" patch for better review. - Provide justification in changelog (Dave/David) --- arch/x86/include/asm/tdx.h | 5 +++++ arch/x86/virt/vmx/tdx/tdx.c | 7 +++++++ arch/x86/virt/vmx/tdx/tdxcall.S | 19 +++++++++++++++++-- 3 files changed, 29 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index e95c9fbf52e4..8d3f85bcccc1 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -8,6 +8,8 @@ #include #include +#include + /* * SW-defined error codes. * @@ -18,6 +20,9 @@ #define TDX_SW_ERROR (TDX_ERROR | GENMASK_ULL(47, 40)) #define TDX_SEAMCALL_VMFAILINVALID (TDX_SW_ERROR | _UL(0xFFFF0000)) +#define TDX_SEAMCALL_GP (TDX_SW_ERROR | X86_TRAP_GP) +#define TDX_SEAMCALL_UD (TDX_SW_ERROR | X86_TRAP_UD) + #ifndef __ASSEMBLY__ /* TDX supported page sizes from the TDX module ABI. */ diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index 1107f4227568..eba7ff91206d 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -93,6 +93,13 @@ static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9, case TDX_SEAMCALL_VMFAILINVALID: pr_err_once("module is not loaded.\n"); return -ENODEV; + case TDX_SEAMCALL_GP: + pr_err_once("not enabled by BIOS.\n"); + return -ENODEV; + case TDX_SEAMCALL_UD: + pr_err_once("SEAMCALL failed: CPU %d is not in VMX operation.\n", + cpu); + return -EINVAL; default: pr_err_once("SEAMCALL failed: CPU %d: leaf %llu, error 0x%llx.\n", cpu, fn, sret); diff --git a/arch/x86/virt/vmx/tdx/tdxcall.S b/arch/x86/virt/vmx/tdx/tdxcall.S index 49a54356ae99..757b0c34be10 100644 --- a/arch/x86/virt/vmx/tdx/tdxcall.S +++ b/arch/x86/virt/vmx/tdx/tdxcall.S @@ -1,6 +1,7 @@ /* SPDX-License-Identifier: GPL-2.0 */ #include #include +#include /* * TDCALL and SEAMCALL are supported in Binutils >= 2.36. @@ -45,6 +46,7 @@ /* Leave input param 2 in RDX */ .if \host +1: seamcall /* * SEAMCALL instruction is essentially a VMExit from VMX root @@ -57,10 +59,23 @@ * This value will never be used as actual SEAMCALL error code as * it is from the Reserved status code class. */ - jnc .Lno_vmfailinvalid + jnc .Lseamcall_out mov $TDX_SEAMCALL_VMFAILINVALID, %rax -.Lno_vmfailinvalid: + jmp .Lseamcall_out +2: + /* + * SEAMCALL caused #GP or #UD. By reaching here %eax contains + * the trap number. Convert the trap number to the TDX error + * code by setting TDX_SW_ERROR to the high 32-bits of %rax. + * + * Note cannot OR TDX_SW_ERROR directly to %rax as OR instruction + * only accepts 32-bit immediate at most. + */ + mov $TDX_SW_ERROR, %r12 + orq %r12, %rax + _ASM_EXTABLE_FAULT(1b, 2b) +.Lseamcall_out: .else tdcall .endif From patchwork Mon Jun 26 14:12:51 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Kai" X-Patchwork-Id: 13293019 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3854CEB64DA for ; Mon, 26 Jun 2023 14:17:39 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231367AbjFZORi (ORCPT ); Mon, 26 Jun 2023 10:17:38 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35434 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229650AbjFZORN (ORCPT ); Mon, 26 Jun 2023 10:17:13 -0400 Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7314E173A; Mon, 26 Jun 2023 07:16:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1687788993; x=1719324993; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=mhkI8UzNGPmQs/rXIZVDOhskAdCoKLc9FiiZ4Bv6iLc=; b=BMrKsD8APHAo52X/N7ZJR4SHuptjgj5UHTWTphlTg9ijZ0V7WtGX2LzT tTAG4RH3CpSvnvoinLZaj3MwBuF4DJTTHdCbt2O8H4OsU4iU2ENUIo9k5 6BM8ifYlXtNIrcYo8P8rJwF5zmOOFws//zQPuWJ9xqbdNP77qakMUQqAw SEH1pS6mlU+UmjEuWDLJX0TCpP8RRKFTxXngjVUtYU3aBIcEemSjNF5Gh csox3nK3rKRtyEJg6Wdu82AL6dKZ9hpucXR07GSJhPSTa68LpIbF/L3HG Hx9ZJ3lehcGH2vd7tNwPH8Za4xaKNmiHneS7cu93jm2UJhNNpgb2wzG78 w==; X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="346034255" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="346034255" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:15:56 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="890292497" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="890292497" Received: from smithau-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.213.179.223]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:15:50 -0700 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, x86@kernel.org, dave.hansen@intel.com, kirill.shutemov@linux.intel.com, tony.luck@intel.com, peterz@infradead.org, tglx@linutronix.de, bp@alien8.de, mingo@redhat.com, hpa@zytor.com, seanjc@google.com, pbonzini@redhat.com, david@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, ashok.raj@intel.com, reinette.chatre@intel.com, len.brown@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, ying.huang@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, nik.borisov@suse.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v12 21/22] x86/mce: Improve error log of kernel space TDX #MC due to erratum Date: Tue, 27 Jun 2023 02:12:51 +1200 Message-Id: X-Mailer: git-send-email 2.40.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org The first few generations of TDX hardware have an erratum. Triggering it in Linux requires some kind of kernel bug involving relatively exotic memory writes to TDX private memory and will manifest via spurious-looking machine checks when reading the affected memory. == Background == Virtually all kernel memory accesses operations happen in full cachelines. In practice, writing a "byte" of memory usually reads a 64 byte cacheline of memory, modifies it, then writes the whole line back. Those operations do not trigger this problem. This problem is triggered by "partial" writes where a write transaction of less than cacheline lands at the memory controller. The CPU does these via non-temporal write instructions (like MOVNTI), or through UC/WC memory mappings. The issue can also be triggered away from the CPU by devices doing partial writes via DMA. == Problem == A partial write to a TDX private memory cacheline will silently "poison" the line. Subsequent reads will consume the poison and generate a machine check. According to the TDX hardware spec, neither of these things should have happened. To add insult to injury, the Linux machine code will present these as a literal "Hardware error" when they were, in fact, a software-triggered issue. == Solution == In the end, this issue is hard to trigger. Rather than do something rash (and incomplete) like unmap TDX private memory from the direct map, improve the machine check handler. Currently, the #MC handler doesn't distinguish whether the memory is TDX private memory or not but just dump, for instance, below message: [...] mce: [Hardware Error]: CPU 147: Machine Check Exception: f Bank 1: bd80000000100134 [...] mce: [Hardware Error]: RIP 10: {__tlb_remove_page_size+0x10/0xa0} ... [...] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel [...] Kernel panic - not syncing: Fatal local machine check Which says "Hardware Error" and "Data load in unrecoverable area of kernel". Ideally, it's better for the log to say "software bug around TDX private memory" instead of "Hardware Error". But in reality the real hardware memory error can happen, and sadly such software-triggered #MC cannot be distinguished from the real hardware error. Also, the error message is used by userspace tool 'mcelog' to parse, so changing the output may break userspace. So keep the "Hardware Error". The "Data load in unrecoverable area of kernel" is also helpful, so keep it too. Instead of modifying above error log, improve the error log by printing additional TDX related message to make the log like: ... [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel [...] mce: [Hardware Error]: Machine Check: TDX private memory error. Possible kernel bug. Adding this additional message requires determination of whether the memory page is TDX private memory. There is no existing infrastructure to do that. Add an interface to query the TDX module to fill this gap. == Impact == This issue requires some kind of kernel bug to trigger. TDX private memory should never be mapped UC/WC. A partial write originating from these mappings would require *two* bugs, first mapping the wrong page, then writing the wrong memory. It would also be detectable using traditional memory corruption techniques like DEBUG_PAGEALLOC. MOVNTI (and friends) could cause this issue with something like a simple buffer overrun or use-after-free on the direct map. It should also be detectable with normal debug techniques. The one place where this might get nasty would be if the CPU read data then wrote back the same data. That would trigger this problem but would not, for instance, set off mechanisms like slab redzoning because it doesn't actually corrupt data. With an IOMMU at least, the DMA exposure is similar to the UC/WC issue. TDX private memory would first need to be incorrectly mapped into the I/O space and then a later DMA to that mapping would actually cause the poisoning event. Signed-off-by: Kai Huang Reviewed-by: Kirill A. Shutemov Reviewed-by: Yuan Yao --- v11 -> v12: - Simplified #MC message (Dave/Kirill) - Slightly improved some comments. v10 -> v11: - New patch --- arch/x86/include/asm/tdx.h | 2 + arch/x86/kernel/cpu/mce/core.c | 33 +++++++++++ arch/x86/virt/vmx/tdx/tdx.c | 102 +++++++++++++++++++++++++++++++++ arch/x86/virt/vmx/tdx/tdx.h | 5 ++ 4 files changed, 142 insertions(+) diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index 8d3f85bcccc1..a697b359d8c6 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -106,11 +106,13 @@ bool platform_tdx_enabled(void); int tdx_cpu_enable(void); int tdx_enable(void); void tdx_reset_memory(void); +bool tdx_is_private_mem(unsigned long phys); #else /* !CONFIG_INTEL_TDX_HOST */ static inline bool platform_tdx_enabled(void) { return false; } static inline int tdx_cpu_enable(void) { return -ENODEV; } static inline int tdx_enable(void) { return -ENODEV; } static inline void tdx_reset_memory(void) { } +static inline bool tdx_is_private_mem(unsigned long phys) { return false; } #endif /* CONFIG_INTEL_TDX_HOST */ #endif /* !__ASSEMBLY__ */ diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 2eec60f50057..f71b649f4c82 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -52,6 +52,7 @@ #include #include #include +#include #include "internal.h" @@ -228,11 +229,34 @@ static void wait_for_panic(void) panic("Panicing machine check CPU died"); } +static const char *mce_memory_info(struct mce *m) +{ + if (!m || !mce_is_memory_error(m) || !mce_usable_address(m)) + return NULL; + + /* + * Certain initial generations of TDX-capable CPUs have an + * erratum. A kernel non-temporal partial write to TDX private + * memory poisons that memory, and a subsequent read of that + * memory triggers #MC. + * + * However such #MC caused by software cannot be distinguished + * from the real hardware #MC. Just print additional message + * to show such #MC may be result of the CPU erratum. + */ + if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE)) + return NULL; + + return !tdx_is_private_mem(m->addr) ? NULL : + "TDX private memory error. Possible kernel bug."; +} + static noinstr void mce_panic(const char *msg, struct mce *final, char *exp) { struct llist_node *pending; struct mce_evt_llist *l; int apei_err = 0; + const char *memmsg; /* * Allow instrumentation around external facilities usage. Not that it @@ -283,6 +307,15 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp) } if (exp) pr_emerg(HW_ERR "Machine check: %s\n", exp); + /* + * Confidential computing platforms such as TDX platforms + * may occur MCE due to incorrect access to confidential + * memory. Print additional information for such error. + */ + memmsg = mce_memory_info(final); + if (memmsg) + pr_emerg(HW_ERR "Machine check: %s\n", memmsg); + if (!fake_panic) { if (panic_timeout == 0) panic_timeout = mca_cfg.panic_timeout; diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index eba7ff91206d..5f96c2d866e5 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -1315,6 +1315,108 @@ void tdx_reset_memory(void) tdmrs_reset_pamt_all(&tdx_tdmr_list); } +static bool is_pamt_page(unsigned long phys) +{ + struct tdmr_info_list *tdmr_list = &tdx_tdmr_list; + int i; + + /* + * This function is called from #MC handler, and theoretically + * it could run in parallel with the TDX module initialization + * on other logical cpus. But it's not OK to hold mutex here + * so just blindly check module status to make sure PAMTs/TDMRs + * are stable to access. + * + * This may return inaccurate result in rare cases, e.g., when + * #MC happens on a PAMT page during module initialization, but + * this is fine as #MC handler doesn't need a 100% accurate + * result. + */ + if (tdx_module_status != TDX_MODULE_INITIALIZED) + return false; + + for (i = 0; i < tdmr_list->nr_consumed_tdmrs; i++) { + unsigned long base, size; + + tdmr_get_pamt(tdmr_entry(tdmr_list, i), &base, &size); + + if (phys >= base && phys < (base + size)) + return true; + } + + return false; +} + +/* + * Return whether the memory page at the given physical address is TDX + * private memory or not. Called from #MC handler do_machine_check(). + * + * Note this function may not return an accurate result in rare cases. + * This is fine as the #MC handler doesn't need a 100% accurate result, + * because it cannot distinguish #MC between software bug and real + * hardware error anyway. + */ +bool tdx_is_private_mem(unsigned long phys) +{ + struct tdx_module_output out; + u64 sret; + + if (!platform_tdx_enabled()) + return false; + + /* Get page type from the TDX module */ + sret = __seamcall(TDH_PHYMEM_PAGE_RDMD, phys & PAGE_MASK, + 0, 0, 0, &out); + /* + * Handle the case that CPU isn't in VMX operation. + * + * KVM guarantees no VM is running (thus no TDX guest) + * when there's any online CPU isn't in VMX operation. + * This means there will be no TDX guest private memory + * and Secure-EPT pages. However the TDX module may have + * been initialized and the memory page could be PAMT. + */ + if (sret == TDX_SEAMCALL_UD) + return is_pamt_page(phys); + + /* + * Any other failure means: + * + * 1) TDX module not loaded; or + * 2) Memory page isn't managed by the TDX module. + * + * In either case, the memory page cannot be a TDX + * private page. + */ + if (sret) + return false; + + /* + * SEAMCALL was successful -- read page type (via RCX): + * + * - PT_NDA: Page is not used by the TDX module + * - PT_RSVD: Reserved for Non-TDX use + * - Others: Page is used by the TDX module + * + * Note PAMT pages are marked as PT_RSVD but they are also TDX + * private memory. + * + * Note: Even page type is PT_NDA, the memory page could still + * be associated with TDX private KeyID if the kernel hasn't + * explicitly used MOVDIR64B to clear the page. Assume KVM + * always does that after reclaiming any private page from TDX + * gusets. + */ + switch (out.rcx) { + case PT_NDA: + return false; + case PT_RSVD: + return is_pamt_page(phys); + default: + return true; + } +} + static int __init record_keyid_partitioning(u32 *tdx_keyid_start, u32 *nr_tdx_keyids) { diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h index f6b4e153890d..2fefd688924c 100644 --- a/arch/x86/virt/vmx/tdx/tdx.h +++ b/arch/x86/virt/vmx/tdx/tdx.h @@ -21,6 +21,7 @@ /* * TDX module SEAMCALL leaf functions */ +#define TDH_PHYMEM_PAGE_RDMD 24 #define TDH_SYS_KEY_CONFIG 31 #define TDH_SYS_INFO 32 #define TDH_SYS_INIT 33 @@ -28,6 +29,10 @@ #define TDH_SYS_TDMR_INIT 36 #define TDH_SYS_CONFIG 45 +/* TDX page types */ +#define PT_NDA 0x0 +#define PT_RSVD 0x1 + struct cmr_info { u64 base; u64 size; From patchwork Mon Jun 26 14:12:52 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Kai" X-Patchwork-Id: 13293020 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E2E9BEB64DA for ; Mon, 26 Jun 2023 14:17:48 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229709AbjFZORr (ORCPT ); Mon, 26 Jun 2023 10:17:47 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35890 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231174AbjFZORZ (ORCPT ); Mon, 26 Jun 2023 10:17:25 -0400 Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4F25D2728; Mon, 26 Jun 2023 07:16:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1687789001; x=1719325001; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=kQktyAbtdZBS432xYsZBwAB18tndlfwU9Pgj+CVTLyg=; b=oIczBCxnT6/2YBLznOQ4m4f4jrC0VhoaDTSNcyVSgagHYuzABx2URTaD h9kNPl7NpZwnkkvcjV7hNNwSmixlYgY02Gj8f9fxkPzO2r7dWB4ly5w0+ bnPFg29WFN/z5meaPyApybsPZbFOTW0FxuDapkkwolVcV0okFDpOySzQJ 1UrJ6wV3FuuSB6FQV/dVq/8qN64/jDfu0p5DwGCyUrafqPyNGPsc0sHCw N3avSUFY2OQ1bYyDFGVLR3J1EK9et33MPafbERkd9UZ3ajLqzyuyIDu47 oIcx37klrwTTKt8vQFvS7m6mQALUGHtgKsSOD6KRDPZRW3QZn4oMDrBKL A==; X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="346034310" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="346034310" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:16:04 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10753"; a="890292512" X-IronPort-AV: E=Sophos;i="6.01,159,1684825200"; d="scan'208";a="890292512" Received: from smithau-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.213.179.223]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jun 2023 07:15:57 -0700 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, x86@kernel.org, dave.hansen@intel.com, kirill.shutemov@linux.intel.com, tony.luck@intel.com, peterz@infradead.org, tglx@linutronix.de, bp@alien8.de, mingo@redhat.com, hpa@zytor.com, seanjc@google.com, pbonzini@redhat.com, david@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, ashok.raj@intel.com, reinette.chatre@intel.com, len.brown@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, ying.huang@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, nik.borisov@suse.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v12 22/22] Documentation/x86: Add documentation for TDX host support Date: Tue, 27 Jun 2023 02:12:52 +1200 Message-Id: <180efee808936fcf017af4e79caec2b7111028a4.1687784645.git.kai.huang@intel.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org Add documentation for TDX host kernel support. There is already one file Documentation/x86/tdx.rst containing documentation for TDX guest internals. Also reuse it for TDX host kernel support. Introduce a new level menu "TDX Guest Support" and move existing materials under it, and add a new menu for TDX host kernel support. Signed-off-by: Kai Huang --- v11 -> v12: - Removed "no CPUID/MSR to detect TDX module" related (Dave). - Fixed some spelling errors. --- Documentation/arch/x86/tdx.rst | 189 +++++++++++++++++++++++++++++++-- 1 file changed, 178 insertions(+), 11 deletions(-) diff --git a/Documentation/arch/x86/tdx.rst b/Documentation/arch/x86/tdx.rst index dc8d9fd2c3f7..f8017a2a663e 100644 --- a/Documentation/arch/x86/tdx.rst +++ b/Documentation/arch/x86/tdx.rst @@ -10,6 +10,173 @@ encrypting the guest memory. In TDX, a special module running in a special mode sits between the host and the guest and manages the guest/host separation. +TDX Host Kernel Support +======================= + +TDX introduces a new CPU mode called Secure Arbitration Mode (SEAM) and +a new isolated range pointed by the SEAM Ranger Register (SEAMRR). A +CPU-attested software module called 'the TDX module' runs inside the new +isolated range to provide the functionalities to manage and run protected +VMs. + +TDX also leverages Intel Multi-Key Total Memory Encryption (MKTME) to +provide crypto-protection to the VMs. TDX reserves part of MKTME KeyIDs +as TDX private KeyIDs, which are only accessible within the SEAM mode. +BIOS is responsible for partitioning legacy MKTME KeyIDs and TDX KeyIDs. + +Before the TDX module can be used to create and run protected VMs, it +must be loaded into the isolated range and properly initialized. The TDX +architecture doesn't require the BIOS to load the TDX module, but the +kernel assumes it is loaded by the BIOS. + +TDX boot-time detection +----------------------- + +The kernel detects TDX by detecting TDX private KeyIDs during kernel +boot. Below dmesg shows when TDX is enabled by BIOS:: + + [..] tdx: BIOS enabled: private KeyID range: [16, 64). + +TDX module initialization +--------------------------------------- + +The kernel talks to the TDX module via the new SEAMCALL instruction. The +TDX module implements SEAMCALL leaf functions to allow the kernel to +initialize it. + +If the TDX module isn't loaded, the SEAMCALL instruction fails with a +special error. In this case the kernel fails the module initialization +and reports the module isn't loaded:: + + [..] tdx: Module isn't loaded. + +Initializing the TDX module consumes roughly ~1/256th system RAM size to +use it as 'metadata' for the TDX memory. It also takes additional CPU +time to initialize those metadata along with the TDX module itself. Both +are not trivial. The kernel initializes the TDX module at runtime on +demand. + +Besides initializing the TDX module, a per-cpu initialization SEAMCALL +must be done on one cpu before any other SEAMCALLs can be made on that +cpu. + +The kernel provides two functions, tdx_enable() and tdx_cpu_enable() to +allow the user of TDX to enable the TDX module and enable TDX on local +cpu. + +Making SEAMCALL requires the CPU already being in VMX operation (VMXON +has been done). For now both tdx_enable() and tdx_cpu_enable() don't +handle VMXON internally, but depends on the caller to guarantee that. + +To enable TDX, the caller of TDX should: 1) hold read lock of CPU hotplug +lock; 2) do VMXON and tdx_enable_cpu() on all online cpus successfully; +3) call tdx_enable(). For example:: + + cpus_read_lock(); + on_each_cpu(vmxon_and_tdx_cpu_enable()); + ret = tdx_enable(); + cpus_read_unlock(); + if (ret) + goto no_tdx; + // TDX is ready to use + +And the caller of TDX must guarantee the tdx_cpu_enable() has been +successfully done on any cpu before it wants to run any other SEAMCALL. +A typical usage is do both VMXON and tdx_cpu_enable() in CPU hotplug +online callback, and refuse to online if tdx_cpu_enable() fails. + +User can consult dmesg to see the presence of the TDX module, and whether +it has been initialized. + +If the TDX module is not loaded, dmesg shows below:: + + [..] tdx: TDX module is not loaded. + +If the TDX module is initialized successfully, dmesg shows something +like below:: + + [..] tdx: TDX module: attributes 0x0, vendor_id 0x8086, major_version 1, minor_version 0, build_date 20211209, build_num 160 + [..] tdx: 262668 KBs allocated for PAMT. + [..] tdx: TDX module initialized. + +If the TDX module failed to initialize, dmesg also shows it failed to +initialize:: + + [..] tdx: TDX module initialization failed ... + +TDX Interaction to Other Kernel Components +------------------------------------------ + +TDX Memory Policy +~~~~~~~~~~~~~~~~~ + +TDX reports a list of "Convertible Memory Region" (CMR) to tell the +kernel which memory is TDX compatible. The kernel needs to build a list +of memory regions (out of CMRs) as "TDX-usable" memory and pass those +regions to the TDX module. Once this is done, those "TDX-usable" memory +regions are fixed during module's lifetime. + +To keep things simple, currently the kernel simply guarantees all pages +in the page allocator are TDX memory. Specifically, the kernel uses all +system memory in the core-mm at the time of initializing the TDX module +as TDX memory, and in the meantime, refuses to online any non-TDX-memory +in the memory hotplug. + +This can be enhanced in the future, i.e. by allowing adding non-TDX +memory to a separate NUMA node. In this case, the "TDX-capable" nodes +and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace +needs to guarantee memory pages for TDX guests are always allocated from +the "TDX-capable" nodes. + +Physical Memory Hotplug +~~~~~~~~~~~~~~~~~~~~~~~ + +Note TDX assumes convertible memory is always physically present during +machine's runtime. A non-buggy BIOS should never support hot-removal of +any convertible memory. This implementation doesn't handle ACPI memory +removal but depends on the BIOS to behave correctly. + +CPU Hotplug +~~~~~~~~~~~ + +TDX module requires the per-cpu initialization SEAMCALL (TDH.SYS.LP.INIT) +must be done on one cpu before any other SEAMCALLs can be made on that +cpu, including those involved during the module initialization. + +The kernel provides tdx_cpu_enable() to let the user of TDX to do it when +the user wants to use a new cpu for TDX task. + +TDX doesn't support physical (ACPI) CPU hotplug. During machine boot, +TDX verifies all boot-time present logical CPUs are TDX compatible before +enabling TDX. A non-buggy BIOS should never support hot-add/removal of +physical CPU. Currently the kernel doesn't handle physical CPU hotplug, +but depends on the BIOS to behave correctly. + +Note TDX works with CPU logical online/offline, thus the kernel still +allows to offline logical CPU and online it again. + +Kexec() +~~~~~~~ + +There are two problems in terms of using kexec() to boot to a new kernel +when the old kernel has enabled TDX: 1) Part of the memory pages are +still TDX private pages; 2) There might be dirty cachelines associated +with TDX private pages. + +The first problem doesn't matter. KeyID 0 doesn't have integrity check. +Even the new kernel wants use any non-zero KeyID, it needs to convert +the memory to that KeyID and such conversion would work from any KeyID. + +However the old kernel needs to guarantee there's no dirty cacheline +left behind before booting to the new kernel to avoid silent corruption +from later cacheline writeback (Intel hardware doesn't guarantee cache +coherency across different KeyIDs). + +Similar to AMD SME, the kernel just uses wbinvd() to flush cache before +booting to the new kernel. + +TDX Guest Support +================= Since the host cannot directly access guest registers or memory, much normal functionality of a hypervisor must be moved into the guest. This is implemented using a Virtualization Exception (#VE) that is handled by the @@ -20,7 +187,7 @@ TDX includes new hypercall-like mechanisms for communicating from the guest to the hypervisor or the TDX module. New TDX Exceptions -================== +------------------ TDX guests behave differently from bare-metal and traditional VMX guests. In TDX guests, otherwise normal instructions or memory accesses can cause @@ -30,7 +197,7 @@ Instructions marked with an '*' conditionally cause exceptions. The details for these instructions are discussed below. Instruction-based #VE ---------------------- +~~~~~~~~~~~~~~~~~~~~~ - Port I/O (INS, OUTS, IN, OUT) - HLT @@ -41,7 +208,7 @@ Instruction-based #VE - CPUID* Instruction-based #GP ---------------------- +~~~~~~~~~~~~~~~~~~~~~ - All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH, VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON @@ -52,7 +219,7 @@ Instruction-based #GP - RDMSR*,WRMSR* RDMSR/WRMSR Behavior --------------------- +~~~~~~~~~~~~~~~~~~~~ MSR access behavior falls into three categories: @@ -73,7 +240,7 @@ trapping and handling in the TDX module. Other than possibly being slow, these MSRs appear to function just as they would on bare metal. CPUID Behavior --------------- +~~~~~~~~~~~~~~ For some CPUID leaves and sub-leaves, the virtualized bit fields of CPUID return values (in guest EAX/EBX/ECX/EDX) are configurable by the @@ -93,7 +260,7 @@ not know how to handle. The guest kernel may ask the hypervisor for the value with a hypercall. #VE on Memory Accesses -====================== +---------------------- There are essentially two classes of TDX memory: private and shared. Private memory receives full TDX protections. Its content is protected @@ -107,7 +274,7 @@ entries. This helps ensure that a guest does not place sensitive information in shared memory, exposing it to the untrusted hypervisor. #VE on Shared Memory --------------------- +~~~~~~~~~~~~~~~~~~~~ Access to shared mappings can cause a #VE. The hypervisor ultimately controls whether a shared memory access causes a #VE, so the guest must be @@ -127,7 +294,7 @@ be careful not to access device MMIO regions unless it is also prepared to handle a #VE. #VE on Private Pages --------------------- +~~~~~~~~~~~~~~~~~~~~ An access to private mappings can also cause a #VE. Since all kernel memory is also private memory, the kernel might theoretically need to @@ -145,7 +312,7 @@ The hypervisor is permitted to unilaterally move accepted pages to a to handle the exception. Linux #VE handler -================= +----------------- Just like page faults or #GP's, #VE exceptions can be either handled or be fatal. Typically, an unhandled userspace #VE results in a SIGSEGV. @@ -167,7 +334,7 @@ While the block is in place, any #VE is elevated to a double fault (#DF) which is not recoverable. MMIO handling -============= +------------- In non-TDX VMs, MMIO is usually implemented by giving a guest access to a mapping which will cause a VMEXIT on access, and then the hypervisor @@ -189,7 +356,7 @@ MMIO access via other means (like structure overlays) may result in an oops. Shared Memory Conversions -========================= +------------------------- All TDX guest memory starts out as private at boot. This memory can not be accessed by the hypervisor. However, some kernel users like device