From patchwork Fri Dec 9 06:52:22 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Kai" X-Patchwork-Id: 13069313 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 86DB9C10F1B for ; Fri, 9 Dec 2022 06:53:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229712AbiLIGxU (ORCPT ); Fri, 9 Dec 2022 01:53:20 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39414 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229573AbiLIGxP (ORCPT ); Fri, 9 Dec 2022 01:53:15 -0500 Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 093021F63B; Thu, 8 Dec 2022 22:53:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1670568786; x=1702104786; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=tpNWTLiIlps42i61R7np9UDYahZOICSBhQnhpwefLwQ=; b=SR5fs889scheak+u+Ts2Zs1nPhUdXNVNs+YR7+OztS3dzl3ahrlIIydp Zpn2RuYR+CaoiLoLEIPCCKS1kTuIlUjigGVOy4tc1Y4eiUlaeDOHfoJwJ i1NzXz2KJDSxrRM2l7rg8DLgMfKWnv2bxJXmZmdfR9gohWlNNVe6TqS/E 9UfLxgLtMsdC4DkYlOP8YStZddqywCpPkXifyLE4maG5wChGIZHig82Z9 BKCgJr4YUiGwdtpjGLJJAHGCuspJbvW1kw1d58MKZFO6mg74wk3PmiBq8 DhcaEwuYPJmuCe8BJAXY+bb57vNmMblBl1Av1PsGHK0A0i375XkMe3NVJ A==; X-IronPort-AV: E=McAfee;i="6500,9779,10555"; a="318551253" X-IronPort-AV: E=Sophos;i="5.96,230,1665471600"; d="scan'208";a="318551253" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Dec 2022 22:53:05 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10555"; a="679836792" X-IronPort-AV: E=Sophos;i="5.96,230,1665471600"; d="scan'208";a="679836792" Received: from omiramon-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.212.28.82]) by orsmga001-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Dec 2022 22:53:00 -0800 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, dave.hansen@intel.com, peterz@infradead.org, tglx@linutronix.de, seanjc@google.com, pbonzini@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, kirill.shutemov@linux.intel.com, ying.huang@intel.com, reinette.chatre@intel.com, len.brown@intel.com, tony.luck@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v8 01/16] x86/tdx: Define TDX supported page sizes as macros Date: Fri, 9 Dec 2022 19:52:22 +1300 Message-Id: <2436106b4de8df75ab304c43128fe5b174bb0bf4.1670566861.git.kai.huang@intel.com> X-Mailer: git-send-email 2.38.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org TDX supports 4K, 2M and 1G page sizes. The corresponding values are defined by the TDX module spec and used as TDX module ABI. Currently, they are used in try_accept_one() when the TDX guest tries to accept a page. However currently try_accept_one() uses hard-coded magic values. Define TDX supported page sizes as macros and get rid of the hard-coded values in try_accept_one(). TDX host support will need to use them too. Reviewed-by: Kirill A. Shutemov Signed-off-by: Kai Huang Reviewed-by: Dave Hansen --- v7 -> v8: - Improved the comment of TDX supported page sizes macros (Dave) v6 -> v7: - Removed the helper to convert kernel page level to TDX page level. - Changed to use macro to define TDX supported page sizes. --- arch/x86/coco/tdx/tdx.c | 6 +++--- arch/x86/include/asm/tdx.h | 5 +++++ 2 files changed, 8 insertions(+), 3 deletions(-) diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c index cfd4c95b9f04..7fa7fb54f438 100644 --- a/arch/x86/coco/tdx/tdx.c +++ b/arch/x86/coco/tdx/tdx.c @@ -722,13 +722,13 @@ static bool try_accept_one(phys_addr_t *start, unsigned long len, */ switch (pg_level) { case PG_LEVEL_4K: - page_size = 0; + page_size = TDX_PS_4K; break; case PG_LEVEL_2M: - page_size = 1; + page_size = TDX_PS_2M; break; case PG_LEVEL_1G: - page_size = 2; + page_size = TDX_PS_1G; break; default: return false; diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index 28d889c9aa16..25fd6070dc0b 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -20,6 +20,11 @@ #ifndef __ASSEMBLY__ +/* TDX supported page sizes from the TDX module ABI. */ +#define TDX_PS_4K 0 +#define TDX_PS_2M 1 +#define TDX_PS_1G 2 + /* * Used to gather the output registers values of the TDCALL and SEAMCALL * instructions when requesting services from the TDX module. From patchwork Fri Dec 9 06:52:23 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Kai" X-Patchwork-Id: 13069314 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 80CFBC4167B for ; Fri, 9 Dec 2022 06:53:25 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229776AbiLIGxW (ORCPT ); Fri, 9 Dec 2022 01:53:22 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39478 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229634AbiLIGxQ (ORCPT ); Fri, 9 Dec 2022 01:53:16 -0500 Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2F46D23143; Thu, 8 Dec 2022 22:53:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1670568791; x=1702104791; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=c4Tr/8DzYU5E0n8ipC6EgxopAIOBeB5hOKwrsAY0wbU=; b=FbfQtTt0HYmcfbG8cmznLVLwljA3v9thg4zJLdjl8rWr9UIOXfRC8xuY +OmUrAQ2pIo+TF9wtISUK7D5QILaXiwc/fjCFD7BermCp+/wgROlbj13I mpNYZev9BDBGOThd3cDa6P94nqyu0snUCfiu8SekVtDtqT8RsdoJm45aA xA/tiHOtWI6vDJi++W+iHakxFvex25sZDx3E5AyC4ExrUC/c2RVv/bvrP 4Kavpfn3xCqtvLJLcWQcgwFBe0HRY10d1DXpBhtBwi+z0ZcoA1Pkp3HDD 1jtlqNc+WaPfYlaGrcJbxlFZj2ofNOHN2R7TaTYf7jKaKOjmn9hgdeBRe w==; X-IronPort-AV: E=McAfee;i="6500,9779,10555"; a="318551272" X-IronPort-AV: E=Sophos;i="5.96,230,1665471600"; d="scan'208";a="318551272" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Dec 2022 22:53:10 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10555"; a="679836814" X-IronPort-AV: E=Sophos;i="5.96,230,1665471600"; d="scan'208";a="679836814" Received: from omiramon-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.212.28.82]) by orsmga001-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Dec 2022 22:53:05 -0800 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, dave.hansen@intel.com, peterz@infradead.org, tglx@linutronix.de, seanjc@google.com, pbonzini@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, kirill.shutemov@linux.intel.com, ying.huang@intel.com, reinette.chatre@intel.com, len.brown@intel.com, tony.luck@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v8 02/16] x86/virt/tdx: Detect TDX during kernel boot Date: Fri, 9 Dec 2022 19:52:23 +1300 Message-Id: <75317b5b6f29e77b98455b0401bd903964f50814.1670566861.git.kai.huang@intel.com> X-Mailer: git-send-email 2.38.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org Intel Trust Domain Extensions (TDX) protects guest VMs from malicious host and certain physical attacks. A CPU-attested software module called 'the TDX module' runs inside a new isolated memory range as a trusted hypervisor to manage and run protected VMs. Pre-TDX Intel hardware has support for a memory encryption architecture called MKTME. The memory encryption hardware underpinning MKTME is also used for Intel TDX. TDX ends up "stealing" some of the physical address space from the MKTME architecture for crypto-protection to VMs. The BIOS is responsible for partitioning the "KeyID" space between legacy MKTME and TDX. The KeyIDs reserved for TDX are called 'TDX private KeyIDs' or 'TDX KeyIDs' for short. TDX doesn't trust the BIOS. During machine boot, TDX verifies the TDX private KeyIDs are consistently and correctly programmed by the BIOS across all CPU packages before it enables TDX on any CPU core. A valid TDX private KeyID range on BSP indicates TDX has been enabled by the BIOS, otherwise the BIOS is buggy. The TDX module is expected to be loaded by the BIOS when it enables TDX, but the kernel needs to properly initialize it before it can be used to create and run any TDX guests. The TDX module will be initialized by the KVM subsystem when KVM wants to use TDX. Add a new early_initcall(tdx_init) to detect the TDX private KeyIDs. Both TDX module initialization and creating TDX guest require to use TDX private KeyID. Also add a function to report whether TDX is enabled by the BIOS (TDX KeyID range is valid). Similar to AMD SME, kexec() will use it to determine whether cache flush is needed. To start to support TDX, create a new arch/x86/virt/vmx/tdx/tdx.c for TDX host kernel support. Add a new Kconfig option CONFIG_INTEL_TDX_HOST to opt-in TDX host kernel support (to distinguish with TDX guest kernel support). So far only KVM uses TDX. Make the new config option depend on KVM_INTEL. Reviewed-by: Kirill A. Shutemov Signed-off-by: Kai Huang --- v7 -> v8: (address Dave's comments) - Improved changelog: - "KVM user" -> "The TDX module will be initialized by KVM when ..." - Changed "tdx_int" part to "Just say what this patch is doing" - Fixed the last sentence of "kexec()" paragraph - detect_tdx() -> record_keyid_partitioning() - Improved how to calculate tdx_keyid_start. - tdx_keyid_num -> nr_tdx_keyids. - Improved dmesg printing. - Add comment to clear_tdx(). v6 -> v7: - No change. v5 -> v6: - Removed SEAMRR detection to make code simpler. - Removed the 'default N' in the KVM_TDX_HOST Kconfig (Kirill). - Changed to use 'obj-y' in arch/x86/virt/vmx/tdx/Makefile (Kirill). --- arch/x86/Kconfig | 12 +++++ arch/x86/Makefile | 2 + arch/x86/include/asm/tdx.h | 7 +++ arch/x86/virt/Makefile | 2 + arch/x86/virt/vmx/Makefile | 2 + arch/x86/virt/vmx/tdx/Makefile | 2 + arch/x86/virt/vmx/tdx/tdx.c | 90 ++++++++++++++++++++++++++++++++++ arch/x86/virt/vmx/tdx/tdx.h | 15 ++++++ 8 files changed, 132 insertions(+) create mode 100644 arch/x86/virt/Makefile create mode 100644 arch/x86/virt/vmx/Makefile create mode 100644 arch/x86/virt/vmx/tdx/Makefile create mode 100644 arch/x86/virt/vmx/tdx/tdx.c create mode 100644 arch/x86/virt/vmx/tdx/tdx.h diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 67745ceab0db..cced4ef3bfb2 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1953,6 +1953,18 @@ config X86_SGX If unsure, say N. +config INTEL_TDX_HOST + bool "Intel Trust Domain Extensions (TDX) host support" + depends on CPU_SUP_INTEL + depends on X86_64 + depends on KVM_INTEL + help + Intel Trust Domain Extensions (TDX) protects guest VMs from malicious + host and certain physical attacks. This option enables necessary TDX + support in host kernel to run protected VMs. + + If unsure, say N. + config EFI bool "EFI runtime service support" depends on ACPI diff --git a/arch/x86/Makefile b/arch/x86/Makefile index 415a5d138de4..38d3e8addc5f 100644 --- a/arch/x86/Makefile +++ b/arch/x86/Makefile @@ -246,6 +246,8 @@ archheaders: libs-y += arch/x86/lib/ +core-y += arch/x86/virt/ + # drivers-y are linked after core-y drivers-$(CONFIG_MATH_EMULATION) += arch/x86/math-emu/ drivers-$(CONFIG_PCI) += arch/x86/pci/ diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index 25fd6070dc0b..4dfe2e794411 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -94,5 +94,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1, return -ENODEV; } #endif /* CONFIG_INTEL_TDX_GUEST && CONFIG_KVM_GUEST */ + +#ifdef CONFIG_INTEL_TDX_HOST +bool platform_tdx_enabled(void); +#else /* !CONFIG_INTEL_TDX_HOST */ +static inline bool platform_tdx_enabled(void) { return false; } +#endif /* CONFIG_INTEL_TDX_HOST */ + #endif /* !__ASSEMBLY__ */ #endif /* _ASM_X86_TDX_H */ diff --git a/arch/x86/virt/Makefile b/arch/x86/virt/Makefile new file mode 100644 index 000000000000..1e36502cd738 --- /dev/null +++ b/arch/x86/virt/Makefile @@ -0,0 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only +obj-y += vmx/ diff --git a/arch/x86/virt/vmx/Makefile b/arch/x86/virt/vmx/Makefile new file mode 100644 index 000000000000..feebda21d793 --- /dev/null +++ b/arch/x86/virt/vmx/Makefile @@ -0,0 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only +obj-$(CONFIG_INTEL_TDX_HOST) += tdx/ diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile new file mode 100644 index 000000000000..93ca8b73e1f1 --- /dev/null +++ b/arch/x86/virt/vmx/tdx/Makefile @@ -0,0 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only +obj-y += tdx.o diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c new file mode 100644 index 000000000000..292852773ced --- /dev/null +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -0,0 +1,90 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright(c) 2022 Intel Corporation. + * + * Intel Trusted Domain Extensions (TDX) support + */ + +#define pr_fmt(fmt) "tdx: " fmt + +#include +#include +#include +#include +#include +#include +#include +#include "tdx.h" + +static u32 tdx_keyid_start __ro_after_init; +static u32 nr_tdx_keyids __ro_after_init; + +/* + * tdx_keyid_start and nr_tdx_keyids indicate that TDX is uninitialized. + * This is used in TDX initialization error paths to take it from + * initialized -> uninitialized. + */ +static void __init clear_tdx(void) +{ + tdx_keyid_start = nr_tdx_keyids = 0; +} + +static int __init record_keyid_partitioning(void) +{ + u32 nr_mktme_keyids; + int ret; + + /* + * IA32_MKTME_KEYID_PARTIONING: + * Bit [31:0]: Number of MKTME KeyIDs. + * Bit [63:32]: Number of TDX private KeyIDs. + */ + ret = rdmsr_safe(MSR_IA32_MKTME_KEYID_PARTITIONING, &nr_mktme_keyids, + &nr_tdx_keyids); + if (ret) + return -ENODEV; + + if (!nr_tdx_keyids) + return -ENODEV; + + /* TDX KeyIDs start after the last MKTME KeyID. */ + tdx_keyid_start = nr_mktme_keyids + 1; + + pr_info("BIOS enabled: private KeyID range [%u, %u)\n", + tdx_keyid_start, tdx_keyid_start + nr_tdx_keyids); + + return 0; +} + +static int __init tdx_init(void) +{ + int err; + + err = record_keyid_partitioning(); + if (err) + return err; + + /* + * Initializing the TDX module requires one TDX private KeyID. + * If there's only one TDX KeyID then after module initialization + * KVM won't be able to run any TDX guest, which makes the whole + * thing worthless. Just disable TDX in this case. + */ + if (nr_tdx_keyids < 2) { + pr_info("initialization failed: too few private KeyIDs available (%d).\n", + nr_tdx_keyids); + goto no_tdx; + } + + return 0; +no_tdx: + clear_tdx(); + return -ENODEV; +} +early_initcall(tdx_init); + +/* Return whether the BIOS has enabled TDX */ +bool platform_tdx_enabled(void) +{ + return !!nr_tdx_keyids; +} diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h new file mode 100644 index 000000000000..d00074abcb20 --- /dev/null +++ b/arch/x86/virt/vmx/tdx/tdx.h @@ -0,0 +1,15 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _X86_VIRT_TDX_H +#define _X86_VIRT_TDX_H + +/* + * This file contains both macros and data structures defined by the TDX + * architecture and Linux defined software data structures and functions. + * The two should not be mixed together for better readability. The + * architectural definitions come first. + */ + +/* MSR to report KeyID partitioning between MKTME and TDX */ +#define MSR_IA32_MKTME_KEYID_PARTITIONING 0x00000087 + +#endif From patchwork Fri Dec 9 06:52:24 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Kai" X-Patchwork-Id: 13069315 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B5203C4332F for ; Fri, 9 Dec 2022 06:53:27 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229757AbiLIGx0 (ORCPT ); Fri, 9 Dec 2022 01:53:26 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39476 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229675AbiLIGxR (ORCPT ); Fri, 9 Dec 2022 01:53:17 -0500 Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 31D2626C; Thu, 8 Dec 2022 22:53:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1670568796; x=1702104796; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=CvOOEOyK2b3ch9pD2o6ehuG6OyGC9zRQ6+5y2KsnA+o=; b=noQnsf+vRzgnWLSFBOWDa77igmaP8TXnTLfexxV/Wbx9UYX+rR3Mz3em Xew9Bb5WfFhAU3HuldmD/97JmyMtbaqzl52r1kHcwbrZ6u4zz+q57MhZv L67hBTQXfS1AD0TkQ2IkqMK1fbW9nHGDn7VdsT7JZ/jkg5ggr+PIt7mZg fhPNN+TZf3dPepnCKthpWVvq21fD0lcXzldv+lfLl5R+QTTJ0Ip3OFkKk kLdDCFExTJRcL4X09Y5GnC2+dpSyk0+aSNFLZAZCc0HaxZ+h+deWV1LCZ 1/gbXhtTTJnUorCP3MYnIxXlSjbFPQBm5UbL9SG/IbzUvgQhpE1Nh14fR Q==; X-IronPort-AV: E=McAfee;i="6500,9779,10555"; a="318551286" X-IronPort-AV: E=Sophos;i="5.96,230,1665471600"; d="scan'208";a="318551286" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Dec 2022 22:53:15 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10555"; a="679836822" X-IronPort-AV: E=Sophos;i="5.96,230,1665471600"; d="scan'208";a="679836822" Received: from omiramon-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.212.28.82]) by orsmga001-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Dec 2022 22:53:11 -0800 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, dave.hansen@intel.com, peterz@infradead.org, tglx@linutronix.de, seanjc@google.com, pbonzini@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, kirill.shutemov@linux.intel.com, ying.huang@intel.com, reinette.chatre@intel.com, len.brown@intel.com, tony.luck@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v8 03/16] x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC Date: Fri, 9 Dec 2022 19:52:24 +1300 Message-Id: <514a6c330ef5fc214d32bcbcc7a77dc825e857df.1670566861.git.kai.huang@intel.com> X-Mailer: git-send-email 2.38.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org TDX capable platforms are locked to X2APIC mode and cannot fall back to the legacy xAPIC mode when TDX is enabled by the BIOS. TDX host support requires x2APIC. Make INTEL_TDX_HOST depend on X86_X2APIC. Link: https://lore.kernel.org/lkml/ba80b303-31bf-d44a-b05d-5c0f83038798@intel.com/ Signed-off-by: Kai Huang Reviewed-by: Dave Hansen --- v7 -> v8: (Dave) - Only make INTEL_TDX_HOST depend on X86_X2APIC but removed other code - Rewrote the changelog. v6 -> v7: - Changed to use "Link" for the two lore links to get rid of checkpatch warning. --- arch/x86/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index cced4ef3bfb2..dd333b46fafb 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1958,6 +1958,7 @@ config INTEL_TDX_HOST depends on CPU_SUP_INTEL depends on X86_64 depends on KVM_INTEL + depends on X86_X2APIC help Intel Trust Domain Extensions (TDX) protects guest VMs from malicious host and certain physical attacks. This option enables necessary TDX From patchwork Fri Dec 9 06:52:25 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Kai" X-Patchwork-Id: 13069316 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 458CFC4167B for ; Fri, 9 Dec 2022 06:53:36 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229809AbiLIGxf (ORCPT ); Fri, 9 Dec 2022 01:53:35 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39540 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229773AbiLIGxW (ORCPT ); Fri, 9 Dec 2022 01:53:22 -0500 Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 620D3264BA; Thu, 8 Dec 2022 22:53:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1670568801; x=1702104801; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=LYSwgfSkyHyPWSC3nXmZTAwd1hEPHDYubQaUW0uEPEM=; b=k7JO4GsQfT7Bx6rwI0e22mzLivH8YQr7WMKLXO92gyFnLjoH7B+5dGFo 84ovYe/dpLiwEZM9l6h29nTkhz9hab+etBYrF1cLM+vlqIKWkRrmcmOdg NEMFh1u32WXl0dSyO2kio8LhhrdN2rxJ2L4AfXJVydJCltFsYV04sEk94 nY69aoBBYEWpOzjbvOmW6MlBgfhyeyowk2LpiSLP4qeieDycde2ZSuhvo Dh3lKQE+BO6S/NDh3ZMWJzX/+izl3hY5+W5rJwCH4wETKCtc6xSGkZQsb DJwU3j4+W4nYCRkyNr1bjBo52m8rY7bFJ1E+A+a/ez3s0d42pyNCyzGEK Q==; X-IronPort-AV: E=McAfee;i="6500,9779,10555"; a="318551303" X-IronPort-AV: E=Sophos;i="5.96,230,1665471600"; d="scan'208";a="318551303" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Dec 2022 22:53:21 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10555"; a="679836829" X-IronPort-AV: E=Sophos;i="5.96,230,1665471600"; d="scan'208";a="679836829" Received: from omiramon-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.212.28.82]) by orsmga001-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Dec 2022 22:53:16 -0800 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, dave.hansen@intel.com, peterz@infradead.org, tglx@linutronix.de, seanjc@google.com, pbonzini@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, kirill.shutemov@linux.intel.com, ying.huang@intel.com, reinette.chatre@intel.com, len.brown@intel.com, tony.luck@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v8 04/16] x86/virt/tdx: Add skeleton to initialize TDX on demand Date: Fri, 9 Dec 2022 19:52:25 +1300 Message-Id: <5ea3f199c0d6f9f3e1738fa5c211c52d4d618e84.1670566861.git.kai.huang@intel.com> X-Mailer: git-send-email 2.38.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org Before the TDX module can be used to create and run TDX guests, it must be loaded and properly initialized. The TDX module is expected to be loaded by the BIOS, and to be initialized by the kernel. TDX introduces a new CPU mode: Secure Arbitration Mode (SEAM). The host kernel communicates with the TDX module via a new SEAMCALL instruction. The TDX module implements a set of SEAMCALL leaf functions to allow the host kernel to initialize it. The TDX module can be initialized only once in its lifetime. Instead of always initializing it at boot time, this implementation chooses an "on demand" approach to initialize TDX until there is a real need (e.g when requested by KVM). This approach has below pros: 1) It avoids consuming the memory that must be allocated by kernel and given to the TDX module as metadata (~1/256th of the TDX-usable memory), and also saves the CPU cycles of initializing the TDX module (and the metadata) when TDX is not used at all. 2) The TDX module design allows it to be updated while the system is running. The update procedure shares quite a few steps with this "on demand" loading mechanism. The hope is that much of "on demand" mechanism can be shared with a future "update" mechanism. A boot-time TDX module implementation would not be able to share much code with the update mechanism. 3) Loading the TDX module requires VMX to be enabled. Currently, only the kernel KVM code mucks with VMX enabling. If the TDX module were to be initialized separately from KVM (like at boot), the boot code would need to be taught how to muck with VMX enabling and KVM would need to be taught how to cope with that. Making KVM itself responsible for TDX initialization lets the rest of the kernel stay blissfully unaware of VMX. Add a placeholder tdx_enable() to initialize the TDX module on demand. Use a state machine protected by mutex to make sure the initialization will only be done once, as it can be called multiple times (i.e. KVM module can be reloaded) and be called concurrently by other kernel components in the future. The TDX module will be initialized in multi-steps defined by the TDX module and most of those steps involve a specific SEAMCALL: 1) Get the TDX module information and TDX-capable memory regions (TDH.SYS.INFO). 2) Build the list of TDX-usable memory regions. 3) Construct a list of "TD Memory Regions" (TDMRs) to cover all TDX-usable memory regions. 4) Pick up one TDX private KeyID as the global KeyID. 5) Configure the TDMRs and the global KeyID to the TDX module (TDH.SYS.CONFIG). 6) Configure the global KeyID on all packages (TDH.SYS.KEY.CONFIG). 7) Initialize all TDMRs (TDH.SYS.TDMR.INIT). Reviewed-by: Chao Gao Signed-off-by: Kai Huang --- v7 -> v8: - Refined changelog (Dave). - Removed "all BIOS-enabled cpus" related code (Peter/Thomas/Dave). - Add a "TODO list" comment in init_tdx_module() to list all steps of initializing the TDX Module to tell the story (Dave). - Made tdx_enable() unverisally return -EINVAL, and removed nonsense comments (Dave). - Simplified __tdx_enable() to only handle success or failure. - TDX_MODULE_SHUTDOWN -> TDX_MODULE_ERROR - Removed TDX_MODULE_NONE (not loaded) as it is not necessary. - Improved comments (Dave). - Pointed out 'tdx_module_status' is software thing (Dave). v6 -> v7: - No change. v5 -> v6: - Added code to set status to TDX_MODULE_NONE if TDX module is not loaded (Chao) - Added Chao's Reviewed-by. - Improved comments around cpus_read_lock(). - v3->v5 (no feedback on v4): - Removed the check that SEAMRR and TDX KeyID have been detected on all present cpus. - Removed tdx_detect(). - Added num_online_cpus() to MADT-enabled CPUs check within the CPU hotplug lock and return early with error message. - Improved dmesg printing for TDX module detection and initialization. --- arch/x86/include/asm/tdx.h | 2 + arch/x86/virt/vmx/tdx/tdx.c | 93 +++++++++++++++++++++++++++++++++++++ 2 files changed, 95 insertions(+) diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index 4dfe2e794411..4a3ee64c1ca7 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -97,8 +97,10 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1, #ifdef CONFIG_INTEL_TDX_HOST bool platform_tdx_enabled(void); +int tdx_enable(void); #else /* !CONFIG_INTEL_TDX_HOST */ static inline bool platform_tdx_enabled(void) { return false; } +static inline int tdx_enable(void) { return -EINVAL; } #endif /* CONFIG_INTEL_TDX_HOST */ #endif /* !__ASSEMBLY__ */ diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index 292852773ced..ace9770e5e08 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -12,13 +12,25 @@ #include #include #include +#include #include #include #include "tdx.h" +/* Kernel defined TDX module status during module initialization. */ +enum tdx_module_status_t { + TDX_MODULE_UNKNOWN, + TDX_MODULE_INITIALIZED, + TDX_MODULE_ERROR +}; + static u32 tdx_keyid_start __ro_after_init; static u32 nr_tdx_keyids __ro_after_init; +static enum tdx_module_status_t tdx_module_status; +/* Prevent concurrent attempts on TDX detection and initialization */ +static DEFINE_MUTEX(tdx_module_lock); + /* * tdx_keyid_start and nr_tdx_keyids indicate that TDX is uninitialized. * This is used in TDX initialization error paths to take it from @@ -88,3 +100,84 @@ bool platform_tdx_enabled(void) { return !!nr_tdx_keyids; } + +static int init_tdx_module(void) +{ + /* + * TODO: + * + * - Get TDX module information and TDX-capable memory regions. + * - Build the list of TDX-usable memory regions. + * - Construct a list of TDMRs to cover all TDX-usable memory + * regions. + * - Pick up one TDX private KeyID as the global KeyID. + * - Configure the TDMRs and the global KeyID to the TDX module. + * - Configure the global KeyID on all packages. + * - Initialize all TDMRs. + * + * Return error before all steps are done. + */ + return -EINVAL; +} + +static int __tdx_enable(void) +{ + int ret; + + ret = init_tdx_module(); + if (ret) { + pr_err_once("initialization failed (%d)\n", ret); + tdx_module_status = TDX_MODULE_ERROR; + /* + * Just return one universal error code. + * For now the caller cannot recover anyway. + */ + return -EINVAL; + } + + pr_info_once("TDX module initialized.\n"); + tdx_module_status = TDX_MODULE_INITIALIZED; + + return 0; +} + +/** + * tdx_enable - Enable TDX by initializing the TDX module + * + * The caller must make sure all online cpus are in VMX operation before + * calling this function. + * + * This function can be called in parallel by multiple callers. + * + * Return 0 if TDX is enabled successfully, otherwise error. + */ +int tdx_enable(void) +{ + int ret; + + if (!platform_tdx_enabled()) { + pr_err_once("initialization failed: TDX is disabled.\n"); + return -EINVAL; + } + + mutex_lock(&tdx_module_lock); + + switch (tdx_module_status) { + case TDX_MODULE_UNKNOWN: + ret = __tdx_enable(); + break; + case TDX_MODULE_INITIALIZED: + /* Already initialized, great, tell the caller. */ + ret = 0; + break; + default: + /* Failed to initialize in the previous attempts */ + ret = -EINVAL; + break; + } + + mutex_unlock(&tdx_module_lock); + + return ret; +} +EXPORT_SYMBOL_GPL(tdx_enable); From patchwork Fri Dec 9 06:52:26 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Kai" X-Patchwork-Id: 13069317 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0F510C4332F for ; Fri, 9 Dec 2022 06:54:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229952AbiLIGyB (ORCPT ); Fri, 9 Dec 2022 01:54:01 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39896 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229550AbiLIGxz (ORCPT ); Fri, 9 Dec 2022 01:53:55 -0500 Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8B1612FA42; Thu, 8 Dec 2022 22:53:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1670568806; x=1702104806; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=hJUNPHDZg7gXRSiBMx9nWTsOS5wt7XzKj3iuBZto+M0=; b=GTd45CsFn8SKCOJnn5J87LusSNM9v6cXI83jPLAg1CXx/xFFB1HudajY XQRRuRQm7xRjFVzCPR72DLaRM83sHm8eJ5heRYVpUSSeJ8TyERqg9Ssbs NIWRXudzAciojE7Nc1WmbrekUF3/DADsFVKZgf8bB1id03h2wv7RmFjWr Z/JXfF5KFmD140oaP0Et/KVO1n9ckotCHjl3oEpPRul3FGM3IM/KV2fhS 41uhbYbLo8TAWBod7cr9SyahrWS1xssljBsHz1XuCCxFpFe78uoF/kEOu uEiEmj7TreeAq0agRZQe4VOiZsgMeske/UjPf6MXUGS7ZJt4lXC/dWTCG w==; X-IronPort-AV: E=McAfee;i="6500,9779,10555"; a="318551318" X-IronPort-AV: E=Sophos;i="5.96,230,1665471600"; d="scan'208";a="318551318" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Dec 2022 22:53:26 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10555"; a="679836851" X-IronPort-AV: E=Sophos;i="5.96,230,1665471600"; d="scan'208";a="679836851" Received: from omiramon-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.212.28.82]) by orsmga001-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Dec 2022 22:53:21 -0800 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, dave.hansen@intel.com, peterz@infradead.org, tglx@linutronix.de, seanjc@google.com, pbonzini@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, kirill.shutemov@linux.intel.com, ying.huang@intel.com, reinette.chatre@intel.com, len.brown@intel.com, tony.luck@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v8 05/16] x86/virt/tdx: Implement functions to make SEAMCALL Date: Fri, 9 Dec 2022 19:52:26 +1300 Message-Id: X-Mailer: git-send-email 2.38.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org TDX introduces a new CPU mode: Secure Arbitration Mode (SEAM). This mode runs only the TDX module itself or other code to load the TDX module. The host kernel communicates with SEAM software via a new SEAMCALL instruction. This is conceptually similar to a guest->host hypercall, except it is made from the host to SEAM software instead. The TDX module establishes a new SEAMCALL ABI which allows the host to initialize the module and to manage VMs. Add infrastructure to make SEAMCALLs. The SEAMCALL ABI is very similar to the TDCALL ABI and leverages much TDCALL infrastructure. SEAMCALL instruction causes #GP when TDX isn't BIOS enabled, and #UD when CPU is not in VMX operation. The current TDX_MODULE_CALL macro doesn't handle any of them. There's no way to check whether the CPU is in VMX operation or not. Initializing the TDX module is done at runtime on demand, and it depends on the caller to ensure CPU is in VMX operation before making SEAMCALL. To avoid getting Oops when the caller mistakenly tries to initialize the TDX module when CPU is not in VMX operation, extend the TDX_MODULE_CALL macro to handle #UD (and opportunistically #GP since they share the same assembly). Introduce two new TDX error codes for #UD and #GP respectively so the caller can distinguish. Also, Opportunistically put the new TDX error codes and the existing TDX_SEAMCALL_VMFAILINVALID into INTEL_TDX_HOST Kconfig option as they are only used when it is on. Any failure during the module initialization is not recoverable for now. Print out error message when SEAMCALL failed depending on the error code to help the user to understand what went wrong. Signed-off-by: Kai Huang --- v7 -> v8: - Improved changelog (Dave): - Trim down some sentences (Dave). - Removed __seamcall() and seamcall() function name and changed accordingly (Dave). - Improved the sentence explaining why to handle #GP (Dave). - Added code to print out error message in seamcall(), following the idea that tdx_enable() to return universal error and print out error message to make clear what's going wrong (Dave). Also mention this in changelog. v6 -> v7: - No change. v5 -> v6: - Added code to handle #UD and #GP (Dave). - Moved the seamcall() wrapper function to this patch, and used a temporary __always_unused to avoid compile warning (Dave). - v3 -> v5 (no feedback on v4): - Explicitly tell TDX_SEAMCALL_VMFAILINVALID is returned if the SEAMCALL itself fails. - Improve the changelog. --- arch/x86/include/asm/tdx.h | 9 ++++++ arch/x86/virt/vmx/tdx/Makefile | 2 +- arch/x86/virt/vmx/tdx/seamcall.S | 52 ++++++++++++++++++++++++++++++++ arch/x86/virt/vmx/tdx/tdx.c | 49 ++++++++++++++++++++++++++++++ arch/x86/virt/vmx/tdx/tdx.h | 10 ++++++ arch/x86/virt/vmx/tdx/tdxcall.S | 19 ++++++++++-- 6 files changed, 138 insertions(+), 3 deletions(-) create mode 100644 arch/x86/virt/vmx/tdx/seamcall.S diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h index 4a3ee64c1ca7..5c5ecfddb15b 100644 --- a/arch/x86/include/asm/tdx.h +++ b/arch/x86/include/asm/tdx.h @@ -8,6 +8,10 @@ #include #include +#ifdef CONFIG_INTEL_TDX_HOST + +#include + /* * SW-defined error codes. * @@ -18,6 +22,11 @@ #define TDX_SW_ERROR (TDX_ERROR | GENMASK_ULL(47, 40)) #define TDX_SEAMCALL_VMFAILINVALID (TDX_SW_ERROR | _UL(0xFFFF0000)) +#define TDX_SEAMCALL_GP (TDX_SW_ERROR | X86_TRAP_GP) +#define TDX_SEAMCALL_UD (TDX_SW_ERROR | X86_TRAP_UD) + +#endif + #ifndef __ASSEMBLY__ /* TDX supported page sizes from the TDX module ABI. */ diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile index 93ca8b73e1f1..38d534f2c113 100644 --- a/arch/x86/virt/vmx/tdx/Makefile +++ b/arch/x86/virt/vmx/tdx/Makefile @@ -1,2 +1,2 @@ # SPDX-License-Identifier: GPL-2.0-only -obj-y += tdx.o +obj-y += tdx.o seamcall.o diff --git a/arch/x86/virt/vmx/tdx/seamcall.S b/arch/x86/virt/vmx/tdx/seamcall.S new file mode 100644 index 000000000000..f81be6b9c133 --- /dev/null +++ b/arch/x86/virt/vmx/tdx/seamcall.S @@ -0,0 +1,52 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#include +#include + +#include "tdxcall.S" + +/* + * __seamcall() - Host-side interface functions to SEAM software module + * (the P-SEAMLDR or the TDX module). + * + * Transform function call register arguments into the SEAMCALL register + * ABI. Return TDX_SEAMCALL_VMFAILINVALID if the SEAMCALL itself fails, + * or the completion status of the SEAMCALL leaf function. Additional + * output operands are saved in @out (if it is provided by the caller). + * + *------------------------------------------------------------------------- + * SEAMCALL ABI: + *------------------------------------------------------------------------- + * Input Registers: + * + * RAX - SEAMCALL Leaf number. + * RCX,RDX,R8-R9 - SEAMCALL Leaf specific input registers. + * + * Output Registers: + * + * RAX - SEAMCALL completion status code. + * RCX,RDX,R8-R11 - SEAMCALL Leaf specific output registers. + * + *------------------------------------------------------------------------- + * + * __seamcall() function ABI: + * + * @fn (RDI) - SEAMCALL Leaf number, moved to RAX + * @rcx (RSI) - Input parameter 1, moved to RCX + * @rdx (RDX) - Input parameter 2, moved to RDX + * @r8 (RCX) - Input parameter 3, moved to R8 + * @r9 (R8) - Input parameter 4, moved to R9 + * + * @out (R9) - struct tdx_module_output pointer + * stored temporarily in R12 (not + * used by the P-SEAMLDR or the TDX + * module). It can be NULL. + * + * Return (via RAX) the completion status of the SEAMCALL, or + * TDX_SEAMCALL_VMFAILINVALID. + */ +SYM_FUNC_START(__seamcall) + FRAME_BEGIN + TDX_MODULE_CALL host=1 + FRAME_END + RET +SYM_FUNC_END(__seamcall) diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index ace9770e5e08..b7cedf0589db 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -101,6 +101,55 @@ bool platform_tdx_enabled(void) return !!nr_tdx_keyids; } +/* + * Wrapper of __seamcall() to convert SEAMCALL leaf function error code + * to kernel error code. @seamcall_ret and @out contain the SEAMCALL + * leaf function return code and the additional output respectively if + * not NULL. + */ +static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9, + u64 *seamcall_ret, + struct tdx_module_output *out) +{ + u64 sret; + + sret = __seamcall(fn, rcx, rdx, r8, r9, out); + + /* Save SEAMCALL return code if the caller wants it */ + if (seamcall_ret) + *seamcall_ret = sret; + + /* SEAMCALL was successful */ + if (!sret) + return 0; + + switch (sret) { + case TDX_SEAMCALL_GP: + /* + * tdx_enable() has already checked that BIOS has + * enabled TDX at the very beginning before going + * forward. It's likely a firmware bug if the + * SEAMCALL still caused #GP. + */ + pr_err_once("[firmware bug]: TDX is not enabled by BIOS.\n"); + return -ENODEV; + case TDX_SEAMCALL_VMFAILINVALID: + pr_err_once("TDX module is not loaded.\n"); + return -ENODEV; + case TDX_SEAMCALL_UD: + pr_err_once("CPU is not in VMX operation.\n"); + return -EINVAL; + default: + pr_err_once("SEAMCALL failed: leaf %llu, error 0x%llx.\n", + fn, sret); + if (out) + pr_err_once("additional output: rcx 0x%llx, rdx 0x%llx, r8 0x%llx, r9 0x%llx, r10 0x%llx, r11 0x%llx.\n", + out->rcx, out->rdx, out->r8, + out->r9, out->r10, out->r11); + return -EIO; + } +} + static int init_tdx_module(void) { /* diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h index d00074abcb20..884357a4133c 100644 --- a/arch/x86/virt/vmx/tdx/tdx.h +++ b/arch/x86/virt/vmx/tdx/tdx.h @@ -2,6 +2,8 @@ #ifndef _X86_VIRT_TDX_H #define _X86_VIRT_TDX_H +#include + /* * This file contains both macros and data structures defined by the TDX * architecture and Linux defined software data structures and functions. @@ -12,4 +14,12 @@ /* MSR to report KeyID partitioning between MKTME and TDX */ #define MSR_IA32_MKTME_KEYID_PARTITIONING 0x00000087 +/* + * Do not put any hardware-defined TDX structure representations below + * this comment! + */ + +struct tdx_module_output; +u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9, + struct tdx_module_output *out); #endif diff --git a/arch/x86/virt/vmx/tdx/tdxcall.S b/arch/x86/virt/vmx/tdx/tdxcall.S index 49a54356ae99..757b0c34be10 100644 --- a/arch/x86/virt/vmx/tdx/tdxcall.S +++ b/arch/x86/virt/vmx/tdx/tdxcall.S @@ -1,6 +1,7 @@ /* SPDX-License-Identifier: GPL-2.0 */ #include #include +#include /* * TDCALL and SEAMCALL are supported in Binutils >= 2.36. @@ -45,6 +46,7 @@ /* Leave input param 2 in RDX */ .if \host +1: seamcall /* * SEAMCALL instruction is essentially a VMExit from VMX root @@ -57,10 +59,23 @@ * This value will never be used as actual SEAMCALL error code as * it is from the Reserved status code class. */ - jnc .Lno_vmfailinvalid + jnc .Lseamcall_out mov $TDX_SEAMCALL_VMFAILINVALID, %rax -.Lno_vmfailinvalid: + jmp .Lseamcall_out +2: + /* + * SEAMCALL caused #GP or #UD. By reaching here %eax contains + * the trap number. Convert the trap number to the TDX error + * code by setting TDX_SW_ERROR to the high 32-bits of %rax. + * + * Note cannot OR TDX_SW_ERROR directly to %rax as OR instruction + * only accepts 32-bit immediate at most. + */ + mov $TDX_SW_ERROR, %r12 + orq %r12, %rax + _ASM_EXTABLE_FAULT(1b, 2b) +.Lseamcall_out: .else tdcall .endif From patchwork Fri Dec 9 06:52:27 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Kai" X-Patchwork-Id: 13069318 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 60E7AC10F1B for ; Fri, 9 Dec 2022 06:54:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230017AbiLIGyV (ORCPT ); Fri, 9 Dec 2022 01:54:21 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39904 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229834AbiLIGyC (ORCPT ); Fri, 9 Dec 2022 01:54:02 -0500 Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A5D6045A3E; Thu, 8 Dec 2022 22:53:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1670568811; x=1702104811; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=2JieH49o0RwYeN8IspHMe+dJXCb9JWOWGaNKYRxbRaQ=; b=PCH1pfjXzuXN4VW/gPS1ptQpD/VlQsYpEsBQ8TQcAA6ZyPK7XCMXFYiZ 7JwmVsz6mSJEor/sOq+QJdP7iWV/si0rZ8AL3xe5tTKZrAK0uRDzB+qP3 flNldJJPNj2phAoBHhb+n/FNvI93X6x/FMm4oFGJNLAuEDsjw5+HjFRDD hT1CM2AnOamypQW+ol7nDn825D1tvLKm+COmaEJc7cgyDeFlmYQw/yF0B j8MewYvLv/krX6CcVRG0hzXqvEneu+IfJd6+iAtYMA4iXA7zX5bTtXD7+ QGzExgsv5WWcNNz6gES1PV2J2rf2L+HGNIvypdR8wDSy+8mZlw5scTvyN Q==; X-IronPort-AV: E=McAfee;i="6500,9779,10555"; a="318551339" X-IronPort-AV: E=Sophos;i="5.96,230,1665471600"; d="scan'208";a="318551339" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Dec 2022 22:53:31 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10555"; a="679836910" X-IronPort-AV: E=Sophos;i="5.96,230,1665471600"; d="scan'208";a="679836910" Received: from omiramon-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.212.28.82]) by orsmga001-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Dec 2022 22:53:26 -0800 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, dave.hansen@intel.com, peterz@infradead.org, tglx@linutronix.de, seanjc@google.com, pbonzini@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, kirill.shutemov@linux.intel.com, ying.huang@intel.com, reinette.chatre@intel.com, len.brown@intel.com, tony.luck@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v8 06/16] x86/virt/tdx: Get information about TDX module and TDX-capable memory Date: Fri, 9 Dec 2022 19:52:27 +1300 Message-Id: <7c21a3de810397901bade0b1021912bbbf2d18bd.1670566861.git.kai.huang@intel.com> X-Mailer: git-send-email 2.38.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org Start to transit out the "multi-steps" of initializing the TDX module as listed in the skeleton infrastructure. Do the first step to get the TDX module information and the TDX-capable memory regions. TDX provides increased levels of memory confidentiality and integrity. This requires special hardware support for features like memory encryption and storage of memory integrity checksums. Not all memory satisfies these requirements. As a result, TDX introduced the concept of a "Convertible Memory Region" (CMR). During boot, the firmware builds a list of all of the memory ranges which can provide the TDX security guarantees. CMRs tell the kernel which memory is TDX compatible. The kernel takes CMRs (plus a little more metadata) and constructs "TD Memory Regions" (TDMRs). TDMRs let the kernel grant TDX protections to some or all of the CMR areas. The TDX module information tells the kernel TDX module properties such as metadata entry size, the maximum number of TDMRs, and the maximum number of reserved areas per TDMR that the module allows, etc. The list of CMRs, along with the TDX module information, is available to the kernel by querying the TDX module. For now, both the TDX module information and CMRs are only used during the module initialization, so declare them as local. However, they are 1024 bytes and 512 bytes respectively. Putting them to the stack exceeds the default "stack frame size" that the kernel assumes as safe, and the compiler yields a warning about this. Add a kernel build flag to extend the safe stack size to 4K for tdx.c to silence the warning -- the initialization function is only called once so it's safe to have a 4K stack. Note not all members in the 1024 bytes TDX module information are used (even by the KVM). Reviewed-by: Isaku Yamahata Signed-off-by: Kai Huang --- v7 -> v8: (Dave) - Improved changelog to tell this is the first patch to transit out the "multi-steps" init_tdx_module(). - Removed all CMR check/trim code but to depend on later SEAMCALL. - Variable 'vertical alignment' in print TDX module information. - Added DECLARE_PADDED_STRUCT() for padded structure. - Made tdx_sysinfo and tdx_cmr_array[] to be function local variable (and rename them accordingly), and added -Wframe-larger-than=4096 flag to silence the build warning. v6 -> v7: - Simplified the check of CMRs due to the fact that TDX actually verifies CMRs (that are passed by the BIOS) before enabling TDX. - Changed the function name from check_cmrs() -> trim_empty_cmrs(). - Added CMR page aligned check so that later patch can just get the PFN using ">> PAGE_SHIFT". v5 -> v6: - Added to also print TDX module's attribute (Isaku). - Removed all arguments in tdx_gete_sysinfo() to use static variables of 'tdx_sysinfo' and 'tdx_cmr_array' directly as they are all used directly in other functions in later patches. - Added Isaku's Reviewed-by. - v3 -> v5 (no feedback on v4): - Renamed sanitize_cmrs() to check_cmrs(). - Removed unnecessary sanity check against tdx_sysinfo and tdx_cmr_array actual size returned by TDH.SYS.INFO. - Changed -EFAULT to -EINVAL in couple places. - Added comments around tdx_sysinfo and tdx_cmr_array saying they are used by TDH.SYS.INFO ABI. - Changed to pass 'tdx_sysinfo' and 'tdx_cmr_array' as function arguments in tdx_get_sysinfo(). - Changed to only print BIOS-CMR when check_cmrs() fails. --- arch/x86/virt/vmx/tdx/Makefile | 1 + arch/x86/virt/vmx/tdx/tdx.c | 85 ++++++++++++++++++++++++++++++++-- arch/x86/virt/vmx/tdx/tdx.h | 76 ++++++++++++++++++++++++++++++ 3 files changed, 157 insertions(+), 5 deletions(-) diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile index 38d534f2c113..f8a40d15fdfc 100644 --- a/arch/x86/virt/vmx/tdx/Makefile +++ b/arch/x86/virt/vmx/tdx/Makefile @@ -1,2 +1,3 @@ # SPDX-License-Identifier: GPL-2.0-only +CFLAGS_tdx.o += -Wframe-larger-than=4096 obj-y += tdx.o seamcall.o diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index b7cedf0589db..6fe505c32599 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -13,6 +13,7 @@ #include #include #include +#include #include #include #include "tdx.h" @@ -107,9 +108,8 @@ bool platform_tdx_enabled(void) * leaf function return code and the additional output respectively if * not NULL. */ -static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9, - u64 *seamcall_ret, - struct tdx_module_output *out) +static int seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9, + u64 *seamcall_ret, struct tdx_module_output *out) { u64 sret; @@ -150,12 +150,85 @@ static int __always_unused seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9, } } +static inline bool is_cmr_empty(struct cmr_info *cmr) +{ + return !cmr->size; +} + +static void print_cmrs(struct cmr_info *cmr_array, int nr_cmrs) +{ + int i; + + for (i = 0; i < nr_cmrs; i++) { + struct cmr_info *cmr = &cmr_array[i]; + + /* + * The array of CMRs reported via TDH.SYS.INFO can + * contain tail empty CMRs. Don't print them. + */ + if (is_cmr_empty(cmr)) + break; + + pr_info("CMR: [0x%llx, 0x%llx)\n", cmr->base, + cmr->base + cmr->size); + } +} + +/* + * Get the TDX module information (TDSYSINFO_STRUCT) and the array of + * CMRs, and save them to @sysinfo and @cmr_array, which come from the + * kernel stack. @sysinfo must have been padded to have enough room + * to save the TDSYSINFO_STRUCT. + */ +static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo, + struct cmr_info *cmr_array) +{ + struct tdx_module_output out; + u64 sysinfo_pa, cmr_array_pa; + int ret; + + /* + * Cannot use __pa() directly as @sysinfo and @cmr_array + * come from the kernel stack. + */ + sysinfo_pa = slow_virt_to_phys(sysinfo); + cmr_array_pa = slow_virt_to_phys(cmr_array); + ret = seamcall(TDH_SYS_INFO, sysinfo_pa, TDSYSINFO_STRUCT_SIZE, + cmr_array_pa, MAX_CMRS, NULL, &out); + if (ret) + return ret; + + pr_info("TDX module: atributes 0x%x, vendor_id 0x%x, major_version %u, minor_version %u, build_date %u, build_num %u", + sysinfo->attributes, sysinfo->vendor_id, + sysinfo->major_version, sysinfo->minor_version, + sysinfo->build_date, sysinfo->build_num); + + /* R9 contains the actual entries written to the CMR array. */ + print_cmrs(cmr_array, out.r9); + + return 0; +} + static int init_tdx_module(void) { + /* + * @tdsysinfo and @cmr_array are used in TDH.SYS.INFO SEAMCALL ABI. + * They are 1024 bytes and 512 bytes respectively but it's fine to + * keep them in the stack as this function is only called once. + */ + DECLARE_PADDED_STRUCT(tdsysinfo_struct, tdsysinfo, + TDSYSINFO_STRUCT_SIZE, TDSYSINFO_STRUCT_ALIGNMENT); + struct cmr_info cmr_array[MAX_CMRS] __aligned(CMR_INFO_ARRAY_ALIGNMENT); + struct tdsysinfo_struct *sysinfo = &PADDED_STRUCT(tdsysinfo); + int ret; + + ret = tdx_get_sysinfo(sysinfo, cmr_array); + if (ret) + goto out; + /* * TODO: * - * - Get TDX module information and TDX-capable memory regions. * - Build the list of TDX-usable memory regions. * - Construct a list of TDMRs to cover all TDX-usable memory * regions. @@ -166,7 +239,9 @@ static int init_tdx_module(void) * * Return error before all steps are done. */ - return -EINVAL; + ret = -EINVAL; +out: + return ret; } static int __tdx_enable(void) diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h index 884357a4133c..6d32f62e4182 100644 --- a/arch/x86/virt/vmx/tdx/tdx.h +++ b/arch/x86/virt/vmx/tdx/tdx.h @@ -3,6 +3,8 @@ #define _X86_VIRT_TDX_H #include +#include +#include /* * This file contains both macros and data structures defined by the TDX @@ -14,6 +16,80 @@ /* MSR to report KeyID partitioning between MKTME and TDX */ #define MSR_IA32_MKTME_KEYID_PARTITIONING 0x00000087 +/* + * TDX module SEAMCALL leaf functions + */ +#define TDH_SYS_INFO 32 + +struct cmr_info { + u64 base; + u64 size; +} __packed; + +#define MAX_CMRS 32 +#define CMR_INFO_ARRAY_ALIGNMENT 512 + +struct cpuid_config { + u32 leaf; + u32 sub_leaf; + u32 eax; + u32 ebx; + u32 ecx; + u32 edx; +} __packed; + +#define DECLARE_PADDED_STRUCT(type, name, size, alignment) \ + struct type##_padded { \ + union { \ + struct type name; \ + u8 padding[size]; \ + }; \ + } name##_padded __aligned(alignment) + +#define PADDED_STRUCT(name) (name##_padded.name) + +#define TDSYSINFO_STRUCT_SIZE 1024 +#define TDSYSINFO_STRUCT_ALIGNMENT 1024 + +/* + * The size of this structure itself is flexible. The actual structure + * passed to TDH.SYS.INFO must be padded to TDSYSINFO_STRUCT_SIZE and be + * aligned to TDSYSINFO_STRUCT_ALIGNMENT using DECLARE_PADDED_STRUCT(). + */ +struct tdsysinfo_struct { + /* TDX-SEAM Module Info */ + u32 attributes; + u32 vendor_id; + u32 build_date; + u16 build_num; + u16 minor_version; + u16 major_version; + u8 reserved0[14]; + /* Memory Info */ + u16 max_tdmrs; + u16 max_reserved_per_tdmr; + u16 pamt_entry_size; + u8 reserved1[10]; + /* Control Struct Info */ + u16 tdcs_base_size; + u8 reserved2[2]; + u16 tdvps_base_size; + u8 tdvps_xfam_dependent_size; + u8 reserved3[9]; + /* TD Capabilities */ + u64 attributes_fixed0; + u64 attributes_fixed1; + u64 xfam_fixed0; + u64 xfam_fixed1; + u8 reserved4[32]; + u32 num_cpuid_config; + /* + * The actual number of CPUID_CONFIG depends on above + * 'num_cpuid_config'. + */ + DECLARE_FLEX_ARRAY(struct cpuid_config, cpuid_configs); +} __packed; + /* * Do not put any hardware-defined TDX structure representations below * this comment! From patchwork Fri Dec 9 06:52:28 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Kai" X-Patchwork-Id: 13069319 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2E42EC4332F for ; Fri, 9 Dec 2022 06:54:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229982AbiLIGys (ORCPT ); Fri, 9 Dec 2022 01:54:48 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40862 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229661AbiLIGyO (ORCPT ); Fri, 9 Dec 2022 01:54:14 -0500 Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B821A7D06F; Thu, 8 Dec 2022 22:53:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1670568816; x=1702104816; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=tvIf0B+prIZRtNQbaZDnuRSnc2zmlZb3aueyjPdO5ho=; b=UChMu3oIWJKYEJ8FtfnytYPf6Egv1/LEpbLz4Rwr33cJj+yjSHKgOjqq QbVUUIFghoyIpL1XUF4j7w+1jBRAFjiVsrDddSTFY/PrWhhMUnw7Y3EV0 p/MARbunectKUZrjy2jDuKkoRuGDpAcMZ2x57BNUiiSsEUMu53mmGf4X+ Rzdvo+V4z09PSCe1xhuH1PTZdSZ0twz0b9XUbTQ0p6QuSctVqrM7bT+Sg 1nfxGqKyqIVbQkcqLiHKrujo49nHtYfYPBNUp44yf7QsNCbe7MbZGWTUV 1TYGXSAGJD25QKBqKn96fXMJCyL3bgo5QMkQ85gpyYvNz0BrrCmEXTgBo Q==; X-IronPort-AV: E=McAfee;i="6500,9779,10555"; a="318551352" X-IronPort-AV: E=Sophos;i="5.96,230,1665471600"; d="scan'208";a="318551352" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Dec 2022 22:53:36 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10555"; a="679836937" X-IronPort-AV: E=Sophos;i="5.96,230,1665471600"; d="scan'208";a="679836937" Received: from omiramon-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.212.28.82]) by orsmga001-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Dec 2022 22:53:31 -0800 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, dave.hansen@intel.com, peterz@infradead.org, tglx@linutronix.de, seanjc@google.com, pbonzini@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, kirill.shutemov@linux.intel.com, ying.huang@intel.com, reinette.chatre@intel.com, len.brown@intel.com, tony.luck@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v8 07/16] x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory Date: Fri, 9 Dec 2022 19:52:28 +1300 Message-Id: <8aab33a7db7a408beb403950e21f693b0b0f1f2b.1670566861.git.kai.huang@intel.com> X-Mailer: git-send-email 2.38.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org As a step of initializing the TDX module, the kernel needs to tell the TDX module which memory regions can be used by the TDX module as TDX guest memory. TDX reports a list of "Convertible Memory Region" (CMR) to tell the kernel which memory is TDX compatible. The kernel needs to build a list of memory regions (out of CMRs) as "TDX-usable" memory and pass them to the TDX module. Once this is done, those "TDX-usable" memory regions are fixed during module's lifetime. The initial support of TDX guests will only allocate TDX guest memory from the global page allocator. To keep things simple, just make sure all pages in the page allocator are TDX memory. To guarantee that, stash off the memblock memory regions at the time of initializing the TDX module as TDX's own usable memory regions, and in the meantime, register a TDX memory notifier to reject to online any new memory in memory hotplug. This approach works as in practice all boot-time present DIMMs are TDX convertible memory. However, if any non-TDX-convertible memory has been hot-added (i.e. CXL memory via kmem driver) before initializing the TDX module, the module initialization will fail. This can also be enhanced in the future, i.e. by allowing adding non-TDX memory to a separate NUMA node. In this case, the "TDX-capable" nodes and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace needs to guarantee memory pages for TDX guests are always allocated from the "TDX-capable" nodes. Signed-off-by: Kai Huang --- v7 -> v8: - Trimed down changelog (Dave). - Changed to use PHYS_PFN() and PFN_PHYS() throughout this series (Ying). - Moved memory hotplug handling from add_arch_memory() to memory_notifier (Dan/David). - Removed 'nid' from 'struct tdx_memblock' to later patch (Dave). - {build|free}_tdx_memory() -> {build|}free_tdx_memlist() (Dave). - Removed pfn_covered_by_cmr() check as no code to trim CMRs now. - Improve the comment around first 1MB (Dave). - Added a comment around reserve_real_mode() to point out TDX code relies on first 1MB being reserved (Ying). - Added comment to explain why the new online memory range cannot cross multiple TDX memory blocks (Dave). - Improved other comments (Dave). --- arch/x86/Kconfig | 1 + arch/x86/kernel/setup.c | 2 + arch/x86/virt/vmx/tdx/tdx.c | 160 +++++++++++++++++++++++++++++++++++- 3 files changed, 162 insertions(+), 1 deletion(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index dd333b46fafb..b36129183035 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1959,6 +1959,7 @@ config INTEL_TDX_HOST depends on X86_64 depends on KVM_INTEL depends on X86_X2APIC + select ARCH_KEEP_MEMBLOCK help Intel Trust Domain Extensions (TDX) protects guest VMs from malicious host and certain physical attacks. This option enables necessary TDX diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index 216fee7144ee..3a841a77fda4 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -1174,6 +1174,8 @@ void __init setup_arch(char **cmdline_p) * * Moreover, on machines with SandyBridge graphics or in setups that use * crashkernel the entire 1M is reserved anyway. + * + * Note the host kernel TDX also requires the first 1MB being reserved. */ reserve_real_mode(); diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index 6fe505c32599..f010402f443d 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -13,6 +13,13 @@ #include #include #include +#include +#include +#include +#include +#include +#include +#include #include #include #include @@ -25,6 +32,12 @@ enum tdx_module_status_t { TDX_MODULE_ERROR }; +struct tdx_memblock { + struct list_head list; + unsigned long start_pfn; + unsigned long end_pfn; +}; + static u32 tdx_keyid_start __ro_after_init; static u32 nr_tdx_keyids __ro_after_init; @@ -32,6 +45,9 @@ static enum tdx_module_status_t tdx_module_status; /* Prevent concurrent attempts on TDX detection and initialization */ static DEFINE_MUTEX(tdx_module_lock); +/* All TDX-usable memory regions */ +static LIST_HEAD(tdx_memlist); + /* * tdx_keyid_start and nr_tdx_keyids indicate that TDX is uninitialized. * This is used in TDX initialization error paths to take it from @@ -69,6 +85,50 @@ static int __init record_keyid_partitioning(void) return 0; } +static bool is_tdx_memory(unsigned long start_pfn, unsigned long end_pfn) +{ + struct tdx_memblock *tmb; + + /* Empty list means TDX isn't enabled. */ + if (list_empty(&tdx_memlist)) + return true; + + list_for_each_entry(tmb, &tdx_memlist, list) { + /* + * The new range is TDX memory if it is fully covered by + * any TDX memory block. + * + * Note TDX memory blocks are originated from memblock + * memory regions, which can only be contiguous when two + * regions have different NUMA nodes or flags. Therefore + * the new range cannot cross multiple TDX memory blocks. + */ + if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn) + return true; + } + return false; +} + +static int tdx_memory_notifier(struct notifier_block *nb, unsigned long action, + void *v) +{ + struct memory_notify *mn = v; + + if (action != MEM_GOING_ONLINE) + return NOTIFY_OK; + + /* + * Not all memory is compatible with TDX. Reject + * to online any incompatible memory. + */ + return is_tdx_memory(mn->start_pfn, mn->start_pfn + mn->nr_pages) ? + NOTIFY_OK : NOTIFY_BAD; +} + +static struct notifier_block tdx_memory_nb = { + .notifier_call = tdx_memory_notifier, +}; + static int __init tdx_init(void) { int err; @@ -89,6 +149,13 @@ static int __init tdx_init(void) goto no_tdx; } + err = register_memory_notifier(&tdx_memory_nb); + if (err) { + pr_info("initialization failed: register_memory_notifier() failed (%d)\n", + err); + goto no_tdx; + } + return 0; no_tdx: clear_tdx(); @@ -209,6 +276,77 @@ static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo, return 0; } +/* + * Add a memory region as a TDX memory block. The caller must make sure + * all memory regions are added in address ascending order and don't + * overlap. + */ +static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn, + unsigned long end_pfn) +{ + struct tdx_memblock *tmb; + + tmb = kmalloc(sizeof(*tmb), GFP_KERNEL); + if (!tmb) + return -ENOMEM; + + INIT_LIST_HEAD(&tmb->list); + tmb->start_pfn = start_pfn; + tmb->end_pfn = end_pfn; + + list_add_tail(&tmb->list, tmb_list); + return 0; +} + +static void free_tdx_memlist(struct list_head *tmb_list) +{ + while (!list_empty(tmb_list)) { + struct tdx_memblock *tmb = list_first_entry(tmb_list, + struct tdx_memblock, list); + + list_del(&tmb->list); + kfree(tmb); + } +} + +/* + * Ensure that all memblock memory regions are convertible to TDX + * memory. Once this has been established, stash the memblock + * ranges off in a secondary structure because memblock is modified + * in memory hotplug while TDX memory regions are fixed. + */ +static int build_tdx_memlist(struct list_head *tmb_list) +{ + unsigned long start_pfn, end_pfn; + int i, ret; + + for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) { + /* + * The first 1MB is not reported as TDX convertible memory. + * Although the first 1MB is always reserved and won't end up + * to the page allocator, it is still in memblock's memory + * regions. Skip them manually to exclude them as TDX memory. + */ + start_pfn = max(start_pfn, PHYS_PFN(SZ_1M)); + if (start_pfn >= end_pfn) + continue; + + /* + * Add the memory regions as TDX memory. The regions in + * memblock has already guaranteed they are in address + * ascending order and don't overlap. + */ + ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn); + if (ret) + goto err; + } + + return 0; +err: + free_tdx_memlist(tmb_list); + return ret; +} + static int init_tdx_module(void) { /* @@ -226,10 +364,25 @@ static int init_tdx_module(void) if (ret) goto out; + /* + * The initial support of TDX guests only allocates memory from + * the global page allocator. To keep things simple, just make + * sure all pages in the page allocator are TDX memory. + * + * Build the list of "TDX-usable" memory regions which cover all + * pages in the page allocator to guarantee that. Do it while + * holding mem_hotplug_lock read-lock as the memory hotplug code + * path reads the @tdx_memlist to reject any new memory. + */ + get_online_mems(); + + ret = build_tdx_memlist(&tdx_memlist); + if (ret) + goto out; + /* * TODO: * - * - Build the list of TDX-usable memory regions. * - Construct a list of TDMRs to cover all TDX-usable memory * regions. * - Pick up one TDX private KeyID as the global KeyID. @@ -241,6 +394,11 @@ static int init_tdx_module(void) */ ret = -EINVAL; out: + /* + * @tdx_memlist is written here and read at memory hotplug time. + * Lock out memory hotplug code while building it. + */ + put_online_mems(); return ret; } From patchwork Fri Dec 9 06:52:29 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Kai" X-Patchwork-Id: 13069320 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7C306C4332F for ; Fri, 9 Dec 2022 06:55:30 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229988AbiLIGz2 (ORCPT ); Fri, 9 Dec 2022 01:55:28 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40878 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229769AbiLIGyr (ORCPT ); Fri, 9 Dec 2022 01:54:47 -0500 Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E2241A65A1; Thu, 8 Dec 2022 22:53:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1670568825; x=1702104825; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=D+jCp1tbjtRLXfVHa9PahUAAUr8LxwkuzJ2V9nLXOCM=; b=EYBXwhwU4Yaw/NOUmsWzkLgyu6a5VqthJLougpNBbrs68LxuoU7PCRpk OAuKPTpRauLg7D8VwQTbldG+EZ+luFBUZSYmkKFeg2Eg7stbAcCT51O89 iQdChA4GkXaDvy7UHyk4ty9RiVn6njAZA4Y0CdjACDzw4VGgYjIt9Fbf+ fjxt6pejqVJ17RGb74FduLQPajFm8Y+3Lq8cTLpP6sf+VRX5R1f7scN2D 1xwF4wVQsuyIYqNYTTeTKDJd5QKtSAWgulKULvWz8lgPEqn1s8R2YIcWC tLotj4wiZhMW46dacRHagLBjrjNEehr3jVbeHFHdZ/knL3zppkuqwVl+Z w==; X-IronPort-AV: E=McAfee;i="6500,9779,10555"; a="318551372" X-IronPort-AV: E=Sophos;i="5.96,230,1665471600"; d="scan'208";a="318551372" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Dec 2022 22:53:41 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10555"; a="679836973" X-IronPort-AV: E=Sophos;i="5.96,230,1665471600"; d="scan'208";a="679836973" Received: from omiramon-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.212.28.82]) by orsmga001-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Dec 2022 22:53:36 -0800 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, dave.hansen@intel.com, peterz@infradead.org, tglx@linutronix.de, seanjc@google.com, pbonzini@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, kirill.shutemov@linux.intel.com, ying.huang@intel.com, reinette.chatre@intel.com, len.brown@intel.com, tony.luck@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v8 08/16] x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions Date: Fri, 9 Dec 2022 19:52:29 +1300 Message-Id: X-Mailer: git-send-email 2.38.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org After the kernel selects all TDX-usable memory regions, the kernel needs to pass those regions to the TDX module via data structure "TD Memory Region" (TDMR). Add a placeholder to construct a list of TDMRs (in multiple steps) to cover all TDX-usable memory regions. === Long Version === TDX provides increased levels of memory confidentiality and integrity. This requires special hardware support for features like memory encryption and storage of memory integrity checksums. Not all memory satisfies these requirements. As a result, TDX introduced the concept of a "Convertible Memory Region" (CMR). During boot, the firmware builds a list of all of the memory ranges which can provide the TDX security guarantees. The list of these ranges is available to the kernel by querying the TDX module. The TDX architecture needs additional metadata to record things like which TD guest "owns" a given page of memory. This metadata essentially serves as the 'struct page' for the TDX module. The space for this metadata is not reserved by the hardware up front and must be allocated by the kernel and given to the TDX module. Since this metadata consumes space, the VMM can choose whether or not to allocate it for a given area of convertible memory. If it chooses not to, the memory cannot receive TDX protections and can not be used by TDX guests as private memory. For every memory region that the VMM wants to use as TDX memory, it sets up a "TD Memory Region" (TDMR). Each TDMR represents a physically contiguous convertible range and must also have its own physically contiguous metadata table, referred to as a Physical Address Metadata Table (PAMT), to track status for each page in the TDMR range. Unlike a CMR, each TDMR requires 1G granularity and alignment. To support physical RAM areas that don't meet those strict requirements, each TDMR permits a number of internal "reserved areas" which can be placed over memory holes. If PAMT metadata is placed within a TDMR it must be covered by one of these reserved areas. Let's summarize the concepts: CMR - Firmware-enumerated physical ranges that support TDX. CMRs are 4K aligned. TDMR - Physical address range which is chosen by the kernel to support TDX. 1G granularity and alignment required. Each TDMR has reserved areas where TDX memory holes and overlapping PAMTs can be represented. PAMT - Physically contiguous TDX metadata. One table for each page size per TDMR. Roughly 1/256th of TDMR in size. 256G TDMR = ~1G PAMT. As one step of initializing the TDX module, the kernel configures TDX-usable memory regions by passing a list of TDMRs to the TDX module. Constructing the list of TDMRs consists below steps: 1) Fill out TDMRs to cover all memory regions that the TDX module will use for TD memory. 2) Allocate and set up PAMT for each TDMR. 3) Designate reserved areas for each TDMR. Add a placeholder to construct TDMRs to do the above steps. Always free the space for TDMRs at the end of the module initialization (no matter successful or not) as TDMRs are only used during the initialization. Reviewed-by: Isaku Yamahata Signed-off-by: Kai Huang --- v7 -> v8: - Improved changelog to tell this is one step of "TODO list" in init_tdx_module(). - Other changelog improvement suggested by Dave (with "Create TDMRs" to "Fill out TDMRs" to align with the code). - Added a "TODO list" comment to lay out the steps to construct TDMRs, following the same idea of "TODO list" in tdx_module_init(). - Introduced 'struct tdmr_info_list' (Dave) - Further added additional members (tdmr_sz/max_tdmrs/nr_tdmrs) to simplify getting TDMR by given index, and reduce passing arguments around functions. - Added alloc_tdmr_list()/free_tdmr_list() accordingly, which internally uses tdmr_size_single() (Dave). - tdmr_num -> nr_tdmrs (Dave). v6 -> v7: - Improved commit message to explain 'int' overflow cannot happen in cal_tdmr_size() and alloc_tdmr_array(). -- Andy/Dave. v5 -> v6: - construct_tdmrs_memblock() -> construct_tdmrs() as 'tdx_memblock' is used instead of memblock. - Added Isaku's Reviewed-by. - v3 -> v5 (no feedback on v4): - Moved calculating TDMR size to this patch. - Changed to use alloc_pages_exact() to allocate buffer for all TDMRs once, instead of allocating each TDMR individually. - Removed "crypto protection" in the changelog. - -EFAULT -> -EINVAL in couple of places. --- arch/x86/virt/vmx/tdx/tdx.c | 104 +++++++++++++++++++++++++++++++++++- arch/x86/virt/vmx/tdx/tdx.h | 23 ++++++++ 2 files changed, 125 insertions(+), 2 deletions(-) diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index f010402f443d..d36ac72ef299 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include #include @@ -347,6 +348,86 @@ static int build_tdx_memlist(struct list_head *tmb_list) return ret; } +struct tdmr_info_list { + struct tdmr_info *first_tdmr; + int tdmr_sz; + int max_tdmrs; + int nr_tdmrs; /* Actual number of TDMRs */ +}; + +/* Calculate the actual TDMR size */ +static int tdmr_size_single(u16 max_reserved_per_tdmr) +{ + int tdmr_sz; + + /* + * The actual size of TDMR depends on the maximum + * number of reserved areas. + */ + tdmr_sz = sizeof(struct tdmr_info); + tdmr_sz += sizeof(struct tdmr_reserved_area) * max_reserved_per_tdmr; + + return ALIGN(tdmr_sz, TDMR_INFO_ALIGNMENT); +} + +static int alloc_tdmr_list(struct tdmr_info_list *tdmr_list, + struct tdsysinfo_struct *sysinfo) +{ + size_t tdmr_sz, tdmr_array_sz; + void *tdmr_array; + + tdmr_sz = tdmr_size_single(sysinfo->max_reserved_per_tdmr); + tdmr_array_sz = tdmr_sz * sysinfo->max_tdmrs; + + /* + * To keep things simple, allocate all TDMRs together. + * The buffer needs to be physically contiguous to make + * sure each TDMR is physically contiguous. + */ + tdmr_array = alloc_pages_exact(tdmr_array_sz, + GFP_KERNEL | __GFP_ZERO); + if (!tdmr_array) + return -ENOMEM; + + tdmr_list->first_tdmr = tdmr_array; + /* + * Keep the size of TDMR to find the target TDMR + * at a given index in the TDMR list. + */ + tdmr_list->tdmr_sz = tdmr_sz; + tdmr_list->max_tdmrs = sysinfo->max_tdmrs; + tdmr_list->nr_tdmrs = 0; + + return 0; +} + +static void free_tdmr_list(struct tdmr_info_list *tdmr_list) +{ + free_pages_exact(tdmr_list->first_tdmr, + tdmr_list->max_tdmrs * tdmr_list->tdmr_sz); +} + +/* + * Construct a list of TDMRs on the preallocated space in @tdmr_list + * to cover all TDX memory regions in @tmb_list based on the TDX module + * information in @sysinfo. + */ +static int construct_tdmrs(struct list_head *tmb_list, + struct tdmr_info_list *tdmr_list, + struct tdsysinfo_struct *sysinfo) +{ + /* + * TODO: + * + * - Fill out TDMRs to cover all TDX memory regions. + * - Allocate and set up PAMTs for each TDMR. + * - Designate reserved areas for each TDMR. + * + * Return -EINVAL until constructing TDMRs is done + */ + return -EINVAL; +} + static int init_tdx_module(void) { /* @@ -358,6 +439,7 @@ static int init_tdx_module(void) TDSYSINFO_STRUCT_SIZE, TDSYSINFO_STRUCT_ALIGNMENT); struct cmr_info cmr_array[MAX_CMRS] __aligned(CMR_INFO_ARRAY_ALIGNMENT); struct tdsysinfo_struct *sysinfo = &PADDED_STRUCT(tdsysinfo); + struct tdmr_info_list tdmr_list; int ret; ret = tdx_get_sysinfo(sysinfo, cmr_array); @@ -380,11 +462,19 @@ static int init_tdx_module(void) if (ret) goto out; + /* Allocate enough space for constructing TDMRs */ + ret = alloc_tdmr_list(&tdmr_list, sysinfo); + if (ret) + goto out_free_tdx_mem; + + /* Cover all TDX-usable memory regions in TDMRs */ + ret = construct_tdmrs(&tdx_memlist, &tdmr_list, sysinfo); + if (ret) + goto out_free_tdmrs; + /* * TODO: * - * - Construct a list of TDMRs to cover all TDX-usable memory - * regions. * - Pick up one TDX private KeyID as the global KeyID. * - Configure the TDMRs and the global KeyID to the TDX module. * - Configure the global KeyID on all packages. @@ -393,6 +483,16 @@ static int init_tdx_module(void) * Return error before all steps are done. */ ret = -EINVAL; +out_free_tdmrs: + /* + * Free the space for the TDMRs no matter the initialization is + * successful or not. They are not needed anymore after the + * module initialization. + */ + free_tdmr_list(&tdmr_list); +out_free_tdx_mem: + if (ret) + free_tdx_memlist(&tdx_memlist); out: /* * @tdx_memlist is written here and read at memory hotplug time. diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h index 6d32f62e4182..d0c762f1a94c 100644 --- a/arch/x86/virt/vmx/tdx/tdx.h +++ b/arch/x86/virt/vmx/tdx/tdx.h @@ -90,6 +90,29 @@ struct tdsysinfo_struct { DECLARE_FLEX_ARRAY(struct cpuid_config, cpuid_configs); } __packed; +struct tdmr_reserved_area { + u64 offset; + u64 size; +} __packed; + +#define TDMR_INFO_ALIGNMENT 512 + +struct tdmr_info { + u64 base; + u64 size; + u64 pamt_1g_base; + u64 pamt_1g_size; + u64 pamt_2m_base; + u64 pamt_2m_size; + u64 pamt_4k_base; + u64 pamt_4k_size; + /* + * Actual number of reserved areas depends on + * 'struct tdsysinfo_struct'::max_reserved_per_tdmr. + */ + DECLARE_FLEX_ARRAY(struct tdmr_reserved_area, reserved_areas); +} __packed __aligned(TDMR_INFO_ALIGNMENT); + /* * Do not put any hardware-defined TDX structure representations below * this comment! From patchwork Fri Dec 9 06:52:30 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Kai" X-Patchwork-Id: 13069321 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 14F29C4332F for ; Fri, 9 Dec 2022 06:55:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230049AbiLIGzd (ORCPT ); Fri, 9 Dec 2022 01:55:33 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39896 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229901AbiLIGyu (ORCPT ); Fri, 9 Dec 2022 01:54:50 -0500 Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CCDA998E9B; Thu, 8 Dec 2022 22:53:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1670568827; x=1702104827; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=576NLGHg2IB8blsMFnuFfzPk7D8+hfcrsAlrHJGxFxU=; b=GIZEg85zozi5BV9VFL/+OyypjANUlbwIC2+NDSybb4rrVC/RrUh+TMGi G8V88McS1A1UZzUBwjMLfJwQy/ZBbA54cGwjOcelA5/mMSMqDx1cDI9QV fd91W9nO117s8MusSSG0qDlE713QCj+9Xp013MUQOCDRxI/BwwPwAWKD4 4XQ5T0YasUD5WPz774BXYF3hJnmzG06qCPZVcADY25TTn8/ANuWjfEAkb 0ygpe/TEiriepggbdQ6F0k5OXLbcj6WAHomZxCVTR9ipC0hiJuW7MooMt w4HN/3k+f0v/myyYbuJWJ0xFel+sCd3PWpEXl/sWjC9xV0+HFDhBzUCCQ Q==; X-IronPort-AV: E=McAfee;i="6500,9779,10555"; a="318551395" X-IronPort-AV: E=Sophos;i="5.96,230,1665471600"; d="scan'208";a="318551395" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Dec 2022 22:53:46 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10555"; a="679836990" X-IronPort-AV: E=Sophos;i="5.96,230,1665471600"; d="scan'208";a="679836990" Received: from omiramon-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.212.28.82]) by orsmga001-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Dec 2022 22:53:41 -0800 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, dave.hansen@intel.com, peterz@infradead.org, tglx@linutronix.de, seanjc@google.com, pbonzini@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, kirill.shutemov@linux.intel.com, ying.huang@intel.com, reinette.chatre@intel.com, len.brown@intel.com, tony.luck@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v8 09/16] x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions Date: Fri, 9 Dec 2022 19:52:30 +1300 Message-Id: <6f9c0bc1074501fa2431bde73bdea57279bf0085.1670566861.git.kai.huang@intel.com> X-Mailer: git-send-email 2.38.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org Start to transit out the "multi-steps" to construct a list of "TD Memory Regions" (TDMRs) to cover all TDX-usable memory regions. The kernel configures TDX-usable memory regions by passing a list of TDMRs "TD Memory Regions" (TDMRs) to the TDX module. Each TDMR contains the information of the base/size of a memory region, the base/size of the associated Physical Address Metadata Table (PAMT) and a list of reserved areas in the region. Do the first step to fill out a number of TDMRs to cover all TDX memory regions. To keep it simple, always try to use one TDMR for each memory region. As the first step only set up the base/size for each TDMR. Each TDMR must be 1G aligned and the size must be in 1G granularity. This implies that one TDMR could cover multiple memory regions. If a memory region spans the 1GB boundary and the former part is already covered by the previous TDMR, just use a new TDMR for the remaining part. TDX only supports a limited number of TDMRs. Disable TDX if all TDMRs are consumed but there is more memory region to cover. Signed-off-by: Kai Huang --- v7 -> v8: (Dave) - Add a sentence to changelog stating this is the first patch to transit "multi-steps" of constructing TDMRs. - Added a comment to explain "why one TDMR for each memory region" is OK for now. - Trimed down/removed unnecessary comments. - Removed tdmr_start() but use tdmr->base directly - create_tdmrs() -> fill_out_tdmrs() - Other changes due to introducing 'struct tdmr_info_list'. v6 -> v7: - No change. v5 -> v6: - Rebase due to using 'tdx_memblock' instead of memblock. - v3 -> v5 (no feedback on v4): - Removed allocating TDMR individually. - Improved changelog by using Dave's words. - Made TDMR_START() and TDMR_END() as static inline function. --- arch/x86/virt/vmx/tdx/tdx.c | 95 ++++++++++++++++++++++++++++++++++++- 1 file changed, 93 insertions(+), 2 deletions(-) diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index d36ac72ef299..5b1de0200c6b 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -407,6 +407,90 @@ static void free_tdmr_list(struct tdmr_info_list *tdmr_list) tdmr_list->max_tdmrs * tdmr_list->tdmr_sz); } +/* Get the TDMR from the list at the given index. */ +static struct tdmr_info *tdmr_entry(struct tdmr_info_list *tdmr_list, + int idx) +{ + return (struct tdmr_info *)((unsigned long)tdmr_list->first_tdmr + + tdmr_list->tdmr_sz * idx); +} + +#define TDMR_ALIGNMENT BIT_ULL(30) +#define TDMR_PFN_ALIGNMENT (TDMR_ALIGNMENT >> PAGE_SHIFT) +#define TDMR_ALIGN_DOWN(_addr) ALIGN_DOWN((_addr), TDMR_ALIGNMENT) +#define TDMR_ALIGN_UP(_addr) ALIGN((_addr), TDMR_ALIGNMENT) + +static inline u64 tdmr_end(struct tdmr_info *tdmr) +{ + return tdmr->base + tdmr->size; +} + +/* + * Take the memory referenced in @tmb_list and populate the + * preallocated @tdmr_list, following all the special alignment + * and size rules for TDMR. + */ +static int fill_out_tdmrs(struct list_head *tmb_list, + struct tdmr_info_list *tdmr_list) +{ + struct tdx_memblock *tmb; + int tdmr_idx = 0; + + /* + * Loop over TDX memory regions and fill out TDMRs to cover them. + * To keep it simple, always try to use one TDMR to cover one + * memory region. + * + * In practice TDX1.0 supports 64 TDMRs, which is big enough to + * cover all memory regions in reality if the admin doesn't use + * 'memmap' to create a bunch of discrete memory regions. When + * there's a real problem, enhancement can be done to merge TDMRs + * to reduce the final number of TDMRs. + */ + list_for_each_entry(tmb, tmb_list, list) { + struct tdmr_info *tdmr = tdmr_entry(tdmr_list, tdmr_idx); + u64 start, end; + + start = TDMR_ALIGN_DOWN(PFN_PHYS(tmb->start_pfn)); + end = TDMR_ALIGN_UP(PFN_PHYS(tmb->end_pfn)); + + /* + * A valid size indicates the current TDMR has already + * been filled out to cover the previous memory region(s). + */ + if (tdmr->size) { + /* + * Loop to the next if the current memory region + * has already been fully covered. + */ + if (end <= tdmr_end(tdmr)) + continue; + + /* Otherwise, skip the already covered part. */ + if (start < tdmr_end(tdmr)) + start = tdmr_end(tdmr); + + /* + * Create a new TDMR to cover the current memory + * region, or the remaining part of it. + */ + tdmr_idx++; + if (tdmr_idx >= tdmr_list->max_tdmrs) + return -E2BIG; + + tdmr = tdmr_entry(tdmr_list, tdmr_idx); + } + + tdmr->base = start; + tdmr->size = end - start; + } + + /* @tdmr_idx is always the index of last valid TDMR. */ + tdmr_list->nr_tdmrs = tdmr_idx + 1; + + return 0; +} + /* * Construct a list of TDMRs on the preallocated space in @tdmr_list * to cover all TDX memory regions in @tmb_list based on the TDX module @@ -416,16 +500,23 @@ static int construct_tdmrs(struct list_head *tmb_list, struct tdmr_info_list *tdmr_list, struct tdsysinfo_struct *sysinfo) { + int ret; + + ret = fill_out_tdmrs(tmb_list, tdmr_list); + if (ret) + goto err; + /* * TODO: * - * - Fill out TDMRs to cover all TDX memory regions. * - Allocate and set up PAMTs for each TDMR. * - Designate reserved areas for each TDMR. * * Return -EINVAL until constructing TDMRs is done */ - return -EINVAL; + ret = -EINVAL; +err: + return ret; } static int init_tdx_module(void) From patchwork Fri Dec 9 06:52:31 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Kai" X-Patchwork-Id: 13069322 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 66CC4C4332F for ; Fri, 9 Dec 2022 06:56:11 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229912AbiLIG4K (ORCPT ); Fri, 9 Dec 2022 01:56:10 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40866 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229512AbiLIGz1 (ORCPT ); Fri, 9 Dec 2022 01:55:27 -0500 Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EAEBC26ADE; Thu, 8 Dec 2022 22:54:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1670568843; x=1702104843; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=HPAlT0aW+bbZfotxgLBpex47Tjhj/8q36otLhqMPWFg=; b=FDark/jS4Gxqf+OxyUdOuwmKz/3ldxX8fmFSZ92PMIIOT84THvDnYJNu vfQvJF9vtTQK8h729X04YNmrTXFFtmRWoHxbeLd8LkRFnHpo8YRdAdCHc diTdpk36A+iUNVklnLCo13ui1MyW9Bhu3ioiY/IJJ6MKAI7dcn7j8THuh jIVEi7P6a42GFkctaiE9LdOS2etGWlCvIX8tbVW40eQMYgsJHoTKr4XWh cxZondvgjqBV9dfBF12OF1tkm78rISRZXtWuooHygrkfAuVb1B+fyVbD5 8d8Nv/rpu12O9Pq6a1sLiL87n6KyGeku1xurSlQWhzr7hTlwVx3anWYX4 Q==; X-IronPort-AV: E=McAfee;i="6500,9779,10555"; a="318551416" X-IronPort-AV: E=Sophos;i="5.96,230,1665471600"; d="scan'208";a="318551416" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Dec 2022 22:53:52 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10555"; a="679837012" X-IronPort-AV: E=Sophos;i="5.96,230,1665471600"; d="scan'208";a="679837012" Received: from omiramon-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.212.28.82]) by orsmga001-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Dec 2022 22:53:46 -0800 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, dave.hansen@intel.com, peterz@infradead.org, tglx@linutronix.de, seanjc@google.com, pbonzini@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, kirill.shutemov@linux.intel.com, ying.huang@intel.com, reinette.chatre@intel.com, len.brown@intel.com, tony.luck@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v8 10/16] x86/virt/tdx: Allocate and set up PAMTs for TDMRs Date: Fri, 9 Dec 2022 19:52:31 +1300 Message-Id: <76a17c574da18c10ad7a4f96e010697aaa5f7c04.1670566861.git.kai.huang@intel.com> X-Mailer: git-send-email 2.38.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org The TDX module uses additional metadata to record things like which guest "owns" a given page of memory. This metadata, referred as Physical Address Metadata Table (PAMT), essentially serves as the 'struct page' for the TDX module. PAMTs are not reserved by hardware up front. They must be allocated by the kernel and then given to the TDX module during module initialization. TDX supports 3 page sizes: 4K, 2M, and 1G. Each "TD Memory Region" (TDMR) has 3 PAMTs to track the 3 supported page sizes. Each PAMT must be a physically contiguous area from a Convertible Memory Region (CMR). However, the PAMTs which track pages in one TDMR do not need to reside within that TDMR but can be anywhere in CMRs. If one PAMT overlaps with any TDMR, the overlapping part must be reported as a reserved area in that particular TDMR. Use alloc_contig_pages() since PAMT must be a physically contiguous area and it may be potentially large (~1/256th of the size of the given TDMR). The downside is alloc_contig_pages() may fail at runtime. One (bad) mitigation is to launch a TDX guest early during system boot to get those PAMTs allocated at early time, but the only way to fix is to add a boot option to allocate or reserve PAMTs during kernel boot. It is imperfect but will be improved on later. TDX only supports a limited number of reserved areas per TDMR to cover both PAMTs and memory holes within the given TDMR. If many PAMTs are allocated within a single TDMR, the reserved areas may not be sufficient to cover all of them. Adopt the following policies when allocating PAMTs for a given TDMR: - Allocate three PAMTs of the TDMR in one contiguous chunk to minimize the total number of reserved areas consumed for PAMTs. - Try to first allocate PAMT from the local node of the TDMR for better NUMA locality. Also dump out how many pages are allocated for PAMTs when the TDX module is initialized successfully. This helps answer the eternal "where did all my memory go?" questions. Reviewed-by: Isaku Yamahata Signed-off-by: Kai Huang Reviewed-by: Dave Hansen --- v7 -> v8: (Dave) - Changelog: - Added a sentence to state PAMT allocation will be improved. - Others suggested by Dave. - Moved 'nid' of 'struct tdx_memblock' to this patch. - Improved comments around tdmr_get_nid(). - WARN_ON_ONCE() -> pr_warn() in tdmr_get_nid(). - Other changes due to 'struct tdmr_info_list'. v6 -> v7: - Changes due to using macros instead of 'enum' for TDX supported page sizes. v5 -> v6: - Rebase due to using 'tdx_memblock' instead of memblock. - 'int pamt_entry_nr' -> 'unsigned long nr_pamt_entries' (Dave/Sagis). - Improved comment around tdmr_get_nid() (Dave). - Improved comment in tdmr_set_up_pamt() around breaking the PAMT into PAMTs for 4K/2M/1G (Dave). - tdmrs_get_pamt_pages() -> tdmrs_count_pamt_pages() (Dave). - v3 -> v5 (no feedback on v4): - Used memblock to get the NUMA node for given TDMR. - Removed tdmr_get_pamt_sz() helper but use open-code instead. - Changed to use 'switch .. case..' for each TDX supported page size in tdmr_get_pamt_sz() (the original __tdmr_get_pamt_sz()). - Added printing out memory used for PAMT allocation when TDX module is initialized successfully. - Explained downside of alloc_contig_pages() in changelog. - Addressed other minor comments. --- arch/x86/Kconfig | 1 + arch/x86/virt/vmx/tdx/tdx.c | 215 +++++++++++++++++++++++++++++++++++- 2 files changed, 211 insertions(+), 5 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index b36129183035..b86a333b860f 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1960,6 +1960,7 @@ config INTEL_TDX_HOST depends on KVM_INTEL depends on X86_X2APIC select ARCH_KEEP_MEMBLOCK + depends on CONTIG_ALLOC help Intel Trust Domain Extensions (TDX) protects guest VMs from malicious host and certain physical attacks. This option enables necessary TDX diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index 5b1de0200c6b..cf970a783f1f 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -37,6 +37,7 @@ struct tdx_memblock { struct list_head list; unsigned long start_pfn; unsigned long end_pfn; + int nid; }; static u32 tdx_keyid_start __ro_after_init; @@ -283,7 +284,7 @@ static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo, * overlap. */ static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn, - unsigned long end_pfn) + unsigned long end_pfn, int nid) { struct tdx_memblock *tmb; @@ -294,6 +295,7 @@ static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn, INIT_LIST_HEAD(&tmb->list); tmb->start_pfn = start_pfn; tmb->end_pfn = end_pfn; + tmb->nid = nid; list_add_tail(&tmb->list, tmb_list); return 0; @@ -319,9 +321,9 @@ static void free_tdx_memlist(struct list_head *tmb_list) static int build_tdx_memlist(struct list_head *tmb_list) { unsigned long start_pfn, end_pfn; - int i, ret; + int i, nid, ret; - for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) { + for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) { /* * The first 1MB is not reported as TDX convertible memory. * Although the first 1MB is always reserved and won't end up @@ -337,7 +339,7 @@ static int build_tdx_memlist(struct list_head *tmb_list) * memblock has already guaranteed they are in address * ascending order and don't overlap. */ - ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn); + ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn, nid); if (ret) goto err; } @@ -491,6 +493,200 @@ static int fill_out_tdmrs(struct list_head *tmb_list, return 0; } +/* + * Calculate PAMT size given a TDMR and a page size. The returned + * PAMT size is always aligned up to 4K page boundary. + */ +static unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr, int pgsz, + u16 pamt_entry_size) +{ + unsigned long pamt_sz, nr_pamt_entries; + + switch (pgsz) { + case TDX_PS_4K: + nr_pamt_entries = tdmr->size >> PAGE_SHIFT; + break; + case TDX_PS_2M: + nr_pamt_entries = tdmr->size >> PMD_SHIFT; + break; + case TDX_PS_1G: + nr_pamt_entries = tdmr->size >> PUD_SHIFT; + break; + default: + WARN_ON_ONCE(1); + return 0; + } + + pamt_sz = nr_pamt_entries * pamt_entry_size; + /* TDX requires PAMT size must be 4K aligned */ + pamt_sz = ALIGN(pamt_sz, PAGE_SIZE); + + return pamt_sz; +} + +/* + * Locate a NUMA node which should hold the allocation of the @tdmr + * PAMT. This node will have some memory covered by the TDMR. The + * relative amount of memory covered is not considered. + */ +static int tdmr_get_nid(struct tdmr_info *tdmr, struct list_head *tmb_list) +{ + struct tdx_memblock *tmb; + + /* + * A TDMR must cover at least part of one TMB. That TMB will end + * after the TDMR begins. But, that TMB may have started before + * the TDMR. Find the next 'tmb' that _ends_ after this TDMR + * begins. Ignore 'tmb' start addresses. They are irrelevant. + */ + list_for_each_entry(tmb, tmb_list, list) { + if (tmb->end_pfn > PHYS_PFN(tdmr->base)) + return tmb->nid; + } + + /* + * Fall back to allocating the TDMR's metadata from node 0 when + * no TDX memory block can be found. This should never happen + * since TDMRs originate from TDX memory blocks. + */ + pr_warn("TDMR [0x%llx, 0x%llx): unable to find local NUMA node for PAMT allocation, fallback to use node 0.\n", + tdmr->base, tdmr_end(tdmr)); + return 0; +} + +/* + * Allocate PAMTs from the local NUMA node of some memory in @tmb_list + * within @tdmr, and set up PAMTs for @tdmr. + */ +static int tdmr_set_up_pamt(struct tdmr_info *tdmr, + struct list_head *tmb_list, + u16 pamt_entry_size) +{ + unsigned long pamt_base[TDX_PS_1G + 1]; + unsigned long pamt_size[TDX_PS_1G + 1]; + unsigned long tdmr_pamt_base; + unsigned long tdmr_pamt_size; + struct page *pamt; + int pgsz, nid; + + nid = tdmr_get_nid(tdmr, tmb_list); + + /* + * Calculate the PAMT size for each TDX supported page size + * and the total PAMT size. + */ + tdmr_pamt_size = 0; + for (pgsz = TDX_PS_4K; pgsz <= TDX_PS_1G ; pgsz++) { + pamt_size[pgsz] = tdmr_get_pamt_sz(tdmr, pgsz, + pamt_entry_size); + tdmr_pamt_size += pamt_size[pgsz]; + } + + /* + * Allocate one chunk of physically contiguous memory for all + * PAMTs. This helps minimize the PAMT's use of reserved areas + * in overlapped TDMRs. + */ + pamt = alloc_contig_pages(tdmr_pamt_size >> PAGE_SHIFT, GFP_KERNEL, + nid, &node_online_map); + if (!pamt) + return -ENOMEM; + + /* + * Break the contiguous allocation back up into the + * individual PAMTs for each page size. + */ + tdmr_pamt_base = page_to_pfn(pamt) << PAGE_SHIFT; + for (pgsz = TDX_PS_4K; pgsz <= TDX_PS_1G; pgsz++) { + pamt_base[pgsz] = tdmr_pamt_base; + tdmr_pamt_base += pamt_size[pgsz]; + } + + tdmr->pamt_4k_base = pamt_base[TDX_PS_4K]; + tdmr->pamt_4k_size = pamt_size[TDX_PS_4K]; + tdmr->pamt_2m_base = pamt_base[TDX_PS_2M]; + tdmr->pamt_2m_size = pamt_size[TDX_PS_2M]; + tdmr->pamt_1g_base = pamt_base[TDX_PS_1G]; + tdmr->pamt_1g_size = pamt_size[TDX_PS_1G]; + + return 0; +} + +static void tdmr_get_pamt(struct tdmr_info *tdmr, unsigned long *pamt_pfn, + unsigned long *pamt_npages) +{ + unsigned long pamt_base, pamt_sz; + + /* + * The PAMT was allocated in one contiguous unit. The 4K PAMT + * should always point to the beginning of that allocation. + */ + pamt_base = tdmr->pamt_4k_base; + pamt_sz = tdmr->pamt_4k_size + tdmr->pamt_2m_size + tdmr->pamt_1g_size; + + *pamt_pfn = PHYS_PFN(pamt_base); + *pamt_npages = pamt_sz >> PAGE_SHIFT; +} + +static void tdmr_free_pamt(struct tdmr_info *tdmr) +{ + unsigned long pamt_pfn, pamt_npages; + + tdmr_get_pamt(tdmr, &pamt_pfn, &pamt_npages); + + /* Do nothing if PAMT hasn't been allocated for this TDMR */ + if (!pamt_npages) + return; + + if (WARN_ON_ONCE(!pamt_pfn)) + return; + + free_contig_range(pamt_pfn, pamt_npages); +} + +static void tdmrs_free_pamt_all(struct tdmr_info_list *tdmr_list) +{ + int i; + + for (i = 0; i < tdmr_list->nr_tdmrs; i++) + tdmr_free_pamt(tdmr_entry(tdmr_list, i)); +} + +/* Allocate and set up PAMTs for all TDMRs */ +static int tdmrs_set_up_pamt_all(struct tdmr_info_list *tdmr_list, + struct list_head *tmb_list, + u16 pamt_entry_size) +{ + int i, ret = 0; + + for (i = 0; i < tdmr_list->nr_tdmrs; i++) { + ret = tdmr_set_up_pamt(tdmr_entry(tdmr_list, i), tmb_list, + pamt_entry_size); + if (ret) + goto err; + } + + return 0; +err: + tdmrs_free_pamt_all(tdmr_list); + return ret; +} + +static unsigned long tdmrs_count_pamt_pages(struct tdmr_info_list *tdmr_list) +{ + unsigned long pamt_npages = 0; + int i; + + for (i = 0; i < tdmr_list->nr_tdmrs; i++) { + unsigned long pfn, npages; + + tdmr_get_pamt(tdmr_entry(tdmr_list, i), &pfn, &npages); + pamt_npages += npages; + } + + return pamt_npages; +} + /* * Construct a list of TDMRs on the preallocated space in @tdmr_list * to cover all TDX memory regions in @tmb_list based on the TDX module @@ -506,15 +702,19 @@ static int construct_tdmrs(struct list_head *tmb_list, if (ret) goto err; + ret = tdmrs_set_up_pamt_all(tdmr_list, tmb_list, + sysinfo->pamt_entry_size); + if (ret) + goto err; /* * TODO: * - * - Allocate and set up PAMTs for each TDMR. * - Designate reserved areas for each TDMR. * * Return -EINVAL until constructing TDMRs is done */ ret = -EINVAL; + tdmrs_free_pamt_all(tdmr_list); err: return ret; } @@ -574,6 +774,11 @@ static int init_tdx_module(void) * Return error before all steps are done. */ ret = -EINVAL; + if (ret) + tdmrs_free_pamt_all(&tdmr_list); + else + pr_info("%lu pages allocated for PAMT.\n", + tdmrs_count_pamt_pages(&tdmr_list)); out_free_tdmrs: /* * Free the space for the TDMRs no matter the initialization is From patchwork Fri Dec 9 06:52:32 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Kai" X-Patchwork-Id: 13069323 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8EF95C4708D for ; Fri, 9 Dec 2022 06:56:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230055AbiLIG4M (ORCPT ); Fri, 9 Dec 2022 01:56:12 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39554 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229661AbiLIGzs (ORCPT ); Fri, 9 Dec 2022 01:55:48 -0500 Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1E11D37F98; Thu, 8 Dec 2022 22:54:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1670568846; x=1702104846; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=zj8VYCQPKpJr/rbKEs7DLA4ziOlJLcuFQ401mUvTogY=; b=h/4/TNmHj3+fhc5MC/Lt7Z8HqsqzOVGqm14m6nNicrr7SPXaVhK6gMlA 1uODRZBhEIr6h+hKajk/coadJscGfPn2d8lNHYXNBzi3MQhSx0b7iRfiG W7nQugSpazhffirouAyH78JxIE8lCQwyNqoR9LLS3FKw0qWdYYPkXLJhJ +gSqLyrqY/saAiiyHvOl1YqQXcwhTdchsT+oykBadPsQ5AeyNiZaR9CMS t009iW1bwyvCxl3Zp5zYPs8q7ig9sycaGFyEKSSpG40qH4C0DB9YdbNOU 8eNpZnq7jWemClRjQJeNxfw1Flar8AGMX9Fcn+d4y8mAlIP5crKeap2EP g==; X-IronPort-AV: E=McAfee;i="6500,9779,10555"; a="318551438" X-IronPort-AV: E=Sophos;i="5.96,230,1665471600"; d="scan'208";a="318551438" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Dec 2022 22:53:57 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10555"; a="679837035" X-IronPort-AV: E=Sophos;i="5.96,230,1665471600"; d="scan'208";a="679837035" Received: from omiramon-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.212.28.82]) by orsmga001-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Dec 2022 22:53:52 -0800 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, dave.hansen@intel.com, peterz@infradead.org, tglx@linutronix.de, seanjc@google.com, pbonzini@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, kirill.shutemov@linux.intel.com, ying.huang@intel.com, reinette.chatre@intel.com, len.brown@intel.com, tony.luck@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v8 11/16] x86/virt/tdx: Designate reserved areas for all TDMRs Date: Fri, 9 Dec 2022 19:52:32 +1300 Message-Id: <27dcd2781a450b3f77a2aec833de6a3669bc0fb8.1670566861.git.kai.huang@intel.com> X-Mailer: git-send-email 2.38.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org As the last step of constructing TDMRs, populate reserved areas for all TDMRs. For each TDMR, put all memory holes within this TDMR to the reserved areas. And for all PAMTs which overlap with this TDMR, put all the overlapping parts to reserved areas too. Reviewed-by: Isaku Yamahata Signed-off-by: Kai Huang --- v7 -> v8: (Dave) - "set_up" -> "populate" in function name change (Dave). - Improved comment suggested by Dave. - Other changes due to 'struct tdmr_info_list'. v6 -> v7: - No change. v5 -> v6: - Rebase due to using 'tdx_memblock' instead of memblock. - Split tdmr_set_up_rsvd_areas() into two functions to handle memory hole and PAMT respectively. - Added Isaku's Reviewed-by. --- arch/x86/virt/vmx/tdx/tdx.c | 213 ++++++++++++++++++++++++++++++++++-- 1 file changed, 205 insertions(+), 8 deletions(-) diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index cf970a783f1f..620b35e2a61b 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -21,6 +21,7 @@ #include #include #include +#include #include #include #include @@ -687,6 +688,202 @@ static unsigned long tdmrs_count_pamt_pages(struct tdmr_info_list *tdmr_list) return pamt_npages; } +static int tdmr_add_rsvd_area(struct tdmr_info *tdmr, int *p_idx, u64 addr, + u64 size, u16 max_reserved_per_tdmr) +{ + struct tdmr_reserved_area *rsvd_areas = tdmr->reserved_areas; + int idx = *p_idx; + + /* Reserved area must be 4K aligned in offset and size */ + if (WARN_ON(addr & ~PAGE_MASK || size & ~PAGE_MASK)) + return -EINVAL; + + if (idx >= max_reserved_per_tdmr) + return -E2BIG; + + rsvd_areas[idx].offset = addr - tdmr->base; + rsvd_areas[idx].size = size; + + *p_idx = idx + 1; + + return 0; +} + +/* + * Go through @tmb_list to find holes between memory areas. If any of + * those holes fall within @tdmr, set up a TDMR reserved area to cover + * the hole. + */ +static int tdmr_populate_rsvd_holes(struct list_head *tmb_list, + struct tdmr_info *tdmr, + int *rsvd_idx, + u16 max_reserved_per_tdmr) +{ + struct tdx_memblock *tmb; + u64 prev_end; + int ret; + + /* + * Start looking for reserved blocks at the + * beginning of the TDMR. + */ + prev_end = tdmr->base; + list_for_each_entry(tmb, tmb_list, list) { + u64 start, end; + + start = PFN_PHYS(tmb->start_pfn); + end = PFN_PHYS(tmb->end_pfn); + + /* Break if this region is after the TDMR */ + if (start >= tdmr_end(tdmr)) + break; + + /* Exclude regions before this TDMR */ + if (end < tdmr->base) + continue; + + /* + * Skip over memory areas that + * have already been dealt with. + */ + if (start <= prev_end) { + prev_end = end; + continue; + } + + /* Add the hole before this region */ + ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, prev_end, + start - prev_end, + max_reserved_per_tdmr); + if (ret) + return ret; + + prev_end = end; + } + + /* Add the hole after the last region if it exists. */ + if (prev_end < tdmr_end(tdmr)) { + ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, prev_end, + tdmr_end(tdmr) - prev_end, + max_reserved_per_tdmr); + if (ret) + return ret; + } + + return 0; +} + +/* + * Go through @tdmr_list to find all PAMTs. If any of those PAMTs + * overlaps with @tdmr, set up a TDMR reserved area to cover the + * overlapping part. + */ +static int tdmr_populate_rsvd_pamts(struct tdmr_info_list *tdmr_list, + struct tdmr_info *tdmr, + int *rsvd_idx, + u16 max_reserved_per_tdmr) +{ + int i, ret; + + for (i = 0; i < tdmr_list->nr_tdmrs; i++) { + struct tdmr_info *tmp = tdmr_entry(tdmr_list, i); + unsigned long pamt_start_pfn, pamt_npages; + u64 pamt_start, pamt_end; + + tdmr_get_pamt(tmp, &pamt_start_pfn, &pamt_npages); + /* Each TDMR must already have PAMT allocated */ + WARN_ON_ONCE(!pamt_npages || !pamt_start_pfn); + + pamt_start = PFN_PHYS(pamt_start_pfn); + pamt_end = PFN_PHYS(pamt_start_pfn + pamt_npages); + + /* Skip PAMTs outside of the given TDMR */ + if ((pamt_end <= tdmr->base) || + (pamt_start >= tdmr_end(tdmr))) + continue; + + /* Only mark the part within the TDMR as reserved */ + if (pamt_start < tdmr->base) + pamt_start = tdmr->base; + if (pamt_end > tdmr_end(tdmr)) + pamt_end = tdmr_end(tdmr); + + ret = tdmr_add_rsvd_area(tdmr, rsvd_idx, pamt_start, + pamt_end - pamt_start, + max_reserved_per_tdmr); + if (ret) + return ret; + } + + return 0; +} + +/* Compare function called by sort() for TDMR reserved areas */ +static int rsvd_area_cmp_func(const void *a, const void *b) +{ + struct tdmr_reserved_area *r1 = (struct tdmr_reserved_area *)a; + struct tdmr_reserved_area *r2 = (struct tdmr_reserved_area *)b; + + if (r1->offset + r1->size <= r2->offset) + return -1; + if (r1->offset >= r2->offset + r2->size) + return 1; + + /* Reserved areas cannot overlap. The caller must guarantee. */ + WARN_ON_ONCE(1); + return -1; +} + +/* + * Populate reserved areas for the given @tdmr, including memory holes + * (via @tmb_list) and PAMTs (via @tdmr_list). + */ +static int tdmr_populate_rsvd_areas(struct tdmr_info *tdmr, + struct list_head *tmb_list, + struct tdmr_info_list *tdmr_list, + u16 max_reserved_per_tdmr) +{ + int ret, rsvd_idx = 0; + + ret = tdmr_populate_rsvd_holes(tmb_list, tdmr, &rsvd_idx, + max_reserved_per_tdmr); + if (ret) + return ret; + + ret = tdmr_populate_rsvd_pamts(tdmr_list, tdmr, &rsvd_idx, + max_reserved_per_tdmr); + if (ret) + return ret; + + /* TDX requires reserved areas listed in address ascending order */ + sort(tdmr->reserved_areas, rsvd_idx, sizeof(struct tdmr_reserved_area), + rsvd_area_cmp_func, NULL); + + return 0; +} + +/* + * Populate reserved areas for all TDMRs in @tdmr_list, including memory + * holes (via @tmb_list) and PAMTs. + */ +static int tdmrs_populate_rsvd_areas_all(struct tdmr_info_list *tdmr_list, + struct list_head *tmb_list, + u16 max_reserved_per_tdmr) +{ + int i; + + for (i = 0; i < tdmr_list->nr_tdmrs; i++) { + int ret; + + ret = tdmr_populate_rsvd_areas(tdmr_entry(tdmr_list, i), + tmb_list, tdmr_list, max_reserved_per_tdmr); + if (ret) + return ret; + } + + return 0; +} + /* * Construct a list of TDMRs on the preallocated space in @tdmr_list * to cover all TDX memory regions in @tmb_list based on the TDX module @@ -706,14 +903,14 @@ static int construct_tdmrs(struct list_head *tmb_list, sysinfo->pamt_entry_size); if (ret) goto err; - /* - * TODO: - * - * - Designate reserved areas for each TDMR. - * - * Return -EINVAL until constructing TDMRs is done - */ - ret = -EINVAL; + + ret = tdmrs_populate_rsvd_areas_all(tdmr_list, tmb_list, + sysinfo->max_reserved_per_tdmr); + if (ret) + goto err_free_pamts; + + return 0; +err_free_pamts: tdmrs_free_pamt_all(tdmr_list); err: return ret; From patchwork Fri Dec 9 06:52:33 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Kai" X-Patchwork-Id: 13069324 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id CA789C4332F for ; Fri, 9 Dec 2022 06:56:17 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230078AbiLIG4N (ORCPT ); Fri, 9 Dec 2022 01:56:13 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40926 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229907AbiLIGzs (ORCPT ); Fri, 9 Dec 2022 01:55:48 -0500 Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BA517ACB2C; Thu, 8 Dec 2022 22:54:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1670568846; x=1702104846; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=HQ0ARWB5QqQjACtiLE4nnaxso1L0OZgpdlJycabR6Bk=; b=VYO7EMeluQfsCE+7cNveQtJ/5JDx5+OeefJTXSz2UGdg4u9M2MroDl11 aGHf7v724uwbiqsBI8vz4lA2KTyMi+DytG6GV9hznLMLUOALQ5ukQ5EEa XukCf4BHL6ZhYzmZYzeSBzTYf8wSb5I5BhkIj13BNks/ZC9TfpyrKCPKM NmiBKndu/b0H3OBWCib3vidp1KrqdKw0bHqoiCrwna7Ha6EysnSE6M2xx I4sDRGN4XNuOPNhULdk1iFKVwL1ROlOiG63MXwjf5ryxv1WthNsQvAIgm oGeBHPGoJGM3eCjmM1BIFUrw5WgdLdRlrsbXqyUttAFBc9fY2TULXQfE6 w==; X-IronPort-AV: E=McAfee;i="6500,9779,10555"; a="318551455" X-IronPort-AV: E=Sophos;i="5.96,230,1665471600"; d="scan'208";a="318551455" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Dec 2022 22:54:02 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10555"; a="679837066" X-IronPort-AV: E=Sophos;i="5.96,230,1665471600"; d="scan'208";a="679837066" Received: from omiramon-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.212.28.82]) by orsmga001-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Dec 2022 22:53:57 -0800 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, dave.hansen@intel.com, peterz@infradead.org, tglx@linutronix.de, seanjc@google.com, pbonzini@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, kirill.shutemov@linux.intel.com, ying.huang@intel.com, reinette.chatre@intel.com, len.brown@intel.com, tony.luck@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v8 12/16] x86/virt/tdx: Designate the global KeyID and configure the TDX module Date: Fri, 9 Dec 2022 19:52:33 +1300 Message-Id: X-Mailer: git-send-email 2.38.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org After a list of "TD Memory Regions" (TDMRs) has been constructed to cover all TDX-usable memory regions, the next step is to pick up a TDX private KeyID as the "global KeyID" (which protects, i.e. TDX module's metadata), and configure it to the TDX module along with the TDMRs. To keep things simple, just use the first TDX KeyID as the global KeyID. Reviewed-by: Isaku Yamahata Signed-off-by: Kai Huang --- v7 -> v8: - Merged "Reserve TDX module global KeyID" patch to this patch, and removed 'tdx_global_keyid' but use 'tdx_keyid_start' directly. - Changed changelog accordingly. - Changed how to allocate aligned array (Dave). --- arch/x86/virt/vmx/tdx/tdx.c | 41 +++++++++++++++++++++++++++++++++++-- arch/x86/virt/vmx/tdx/tdx.h | 2 ++ 2 files changed, 41 insertions(+), 2 deletions(-) diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index 620b35e2a61b..ab961443fed5 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -916,6 +916,36 @@ static int construct_tdmrs(struct list_head *tmb_list, return ret; } +static int config_tdx_module(struct tdmr_info_list *tdmr_list, u64 global_keyid) +{ + u64 *tdmr_pa_array, *p; + size_t array_sz; + int i, ret; + + /* + * TDMRs are passed to the TDX module via an array of physical + * addresses of each TDMR. The array itself has alignment + * requirement. + */ + array_sz = tdmr_list->nr_tdmrs * sizeof(u64) + + TDMR_INFO_PA_ARRAY_ALIGNMENT - 1; + p = kzalloc(array_sz, GFP_KERNEL); + if (!p) + return -ENOMEM; + + tdmr_pa_array = PTR_ALIGN(p, TDMR_INFO_PA_ARRAY_ALIGNMENT); + for (i = 0; i < tdmr_list->nr_tdmrs; i++) + tdmr_pa_array[i] = __pa(tdmr_entry(tdmr_list, i)); + + ret = seamcall(TDH_SYS_CONFIG, __pa(tdmr_pa_array), tdmr_list->nr_tdmrs, + global_keyid, 0, NULL, NULL); + + /* Free the array as it is not required anymore. */ + kfree(p); + + return ret; +} + static int init_tdx_module(void) { /* @@ -960,17 +990,24 @@ static int init_tdx_module(void) if (ret) goto out_free_tdmrs; + /* + * Use the first private KeyID as the global KeyID, and pass + * it along with the TDMRs to the TDX module. + */ + ret = config_tdx_module(&tdmr_list, tdx_keyid_start); + if (ret) + goto out_free_pamts; + /* * TODO: * - * - Pick up one TDX private KeyID as the global KeyID. - * - Configure the TDMRs and the global KeyID to the TDX module. * - Configure the global KeyID on all packages. * - Initialize all TDMRs. * * Return error before all steps are done. */ ret = -EINVAL; +out_free_pamts: if (ret) tdmrs_free_pamt_all(&tdmr_list); else diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h index d0c762f1a94c..4d2edd477480 100644 --- a/arch/x86/virt/vmx/tdx/tdx.h +++ b/arch/x86/virt/vmx/tdx/tdx.h @@ -20,6 +20,7 @@ * TDX module SEAMCALL leaf functions */ #define TDH_SYS_INFO 32 +#define TDH_SYS_CONFIG 45 struct cmr_info { u64 base; @@ -96,6 +97,7 @@ struct tdmr_reserved_area { } __packed; #define TDMR_INFO_ALIGNMENT 512 +#define TDMR_INFO_PA_ARRAY_ALIGNMENT 512 struct tdmr_info { u64 base; From patchwork Fri Dec 9 06:52:34 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Kai" X-Patchwork-Id: 13069325 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id ACC25C4332F for ; Fri, 9 Dec 2022 06:56:53 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230095AbiLIG4w (ORCPT ); Fri, 9 Dec 2022 01:56:52 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40840 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229674AbiLIG4D (ORCPT ); Fri, 9 Dec 2022 01:56:03 -0500 Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 38684AE4D3; Thu, 8 Dec 2022 22:54:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1670568867; x=1702104867; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=r1pswkKPmCuwe3o1L5UTwD6LpxkqI/Lu1UdlGzXTNBc=; b=D1JJT1gvcmgonF/O8kBB5LqYwixtDGOeOEA6/s5dkUgdQPk0g/JaMSKR 7Jxfjth4kBvq4yDlOb570R1A9/RHROsL/ZyerNgSDViiXtre/91jS7Tld Vr8X417UAybuRmYArEUQDL/PCLBqpuJN0N8uJ5fuPyG8PqinP/uWZZA+S 5KnikKilupFwZi2o5wsRwJ4VpTC+92vgZesZxLO+LyFCzhz/O+kDkX6K8 B6Y8cOrsSS2kbDUUcNSMTiwQyfNhxzuBReTnG1TtatKuMzVoEFILR95O8 gJlJgzCjarQ85LIDBzfcZWVzdrIggrKCxlBWoAIErmjUjKxG0WfIpP0iE Q==; X-IronPort-AV: E=McAfee;i="6500,9779,10555"; a="318551477" X-IronPort-AV: E=Sophos;i="5.96,230,1665471600"; d="scan'208";a="318551477" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Dec 2022 22:54:09 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10555"; a="679837108" X-IronPort-AV: E=Sophos;i="5.96,230,1665471600"; d="scan'208";a="679837108" Received: from omiramon-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.212.28.82]) by orsmga001-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Dec 2022 22:54:02 -0800 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, dave.hansen@intel.com, peterz@infradead.org, tglx@linutronix.de, seanjc@google.com, pbonzini@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, kirill.shutemov@linux.intel.com, ying.huang@intel.com, reinette.chatre@intel.com, len.brown@intel.com, tony.luck@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v8 13/16] x86/virt/tdx: Configure global KeyID on all packages Date: Fri, 9 Dec 2022 19:52:34 +1300 Message-Id: <383a2fb52a36d1e772bc547c289c5aeb8ea5d9cb.1670566861.git.kai.huang@intel.com> X-Mailer: git-send-email 2.38.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org After the list of TDMRs and the global KeyID are configured to the TDX module, the kernel needs to configure the key of the global KeyID on all packages using TDH.SYS.KEY.CONFIG. TDH.SYS.KEY.CONFIG needs to be done on one (any) cpu for each package. Also, it cannot run concurrently on different cpus, so just use smp_call_function_single() to do it one by one. Note to keep things simple, neither the function to configure the global KeyID on all packages nor the tdx_enable() checks whether there's at least one online cpu for each package. Also, neither of them explicitly prevents any cpu from going offline. It is caller's responsibility to guarantee this. Intel hardware doesn't guarantee cache coherency across different KeyIDs. The kernel needs to flush PAMT's dirty cachelines (associated with KeyID 0) before the TDX module uses the global KeyID to access the PAMT. Otherwise, those dirty cachelines can silently corrupt the TDX module's metadata. Note this breaks TDX from functionality point of view but TDX's security remains intact. Following the TDX module specification, flush cache before configuring the global KeyID on all packages. Given the PAMT size can be large (~1/256th of system RAM), just use WBINVD on all CPUs to flush. Note if any TDH.SYS.KEY.CONFIG fails, the TDX module may already have used the global KeyID to write any PAMT. Therefore, need to use WBINVD to flush cache before freeing the PAMTs back to the kernel. Reviewed-by: Isaku Yamahata Signed-off-by: Kai Huang --- v7 -> v8: (Dave) - Changelog changes: - Point out this is the step of "multi-steps" of init_tdx_module(). - Removed MOVDIR64B part. - Other changes due to removing TDH.SYS.SHUTDOWN and TDH.SYS.LP.INIT. - Changed to loop over online cpus and use smp_call_function_single() directly as the patch to shut down TDX module has been removed. - Removed MOVDIR64B part in comment. v6 -> v7: - Improved changelong and comment to explain why MOVDIR64B isn't used when returning PAMTs back to the kernel. --- arch/x86/virt/vmx/tdx/tdx.c | 97 +++++++++++++++++++++++++++++++++++-- arch/x86/virt/vmx/tdx/tdx.h | 1 + 2 files changed, 94 insertions(+), 4 deletions(-) diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index ab961443fed5..4c779e8412f1 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -946,6 +946,66 @@ static int config_tdx_module(struct tdmr_info_list *tdmr_list, u64 global_keyid) return ret; } +static void do_global_key_config(void *data) +{ + int ret; + + /* + * TDH.SYS.KEY.CONFIG may fail with entropy error (which is a + * recoverable error). Assume this is exceedingly rare and + * just return error if encountered instead of retrying. + */ + ret = seamcall(TDH_SYS_KEY_CONFIG, 0, 0, 0, 0, NULL, NULL); + + *(int *)data = ret; +} + +/* + * Configure the global KeyID on all packages by doing TDH.SYS.KEY.CONFIG + * on one online cpu for each package. If any package doesn't have any + * online + * + * Note: + * + * This function neither checks whether there's at least one online cpu + * for each package, nor explicitly prevents any cpu from going offline. + * If any package doesn't have any online cpu then the SEAMCALL won't be + * done on that package and the later step of TDX module initialization + * will fail. The caller needs to guarantee this. + */ +static int config_global_keyid(void) +{ + cpumask_var_t packages; + int cpu, ret = 0; + + if (!zalloc_cpumask_var(&packages, GFP_KERNEL)) + return -ENOMEM; + + for_each_online_cpu(cpu) { + int err; + + if (cpumask_test_and_set_cpu(topology_physical_package_id(cpu), + packages)) + continue; + + /* + * TDH.SYS.KEY.CONFIG cannot run concurrently on + * different cpus, so just do it one by one. + */ + ret = smp_call_function_single(cpu, do_global_key_config, &err, + true); + if (ret) + break; + if (err) { + ret = err; + break; + } + } + + free_cpumask_var(packages); + return ret; +} + static int init_tdx_module(void) { /* @@ -998,19 +1058,46 @@ static int init_tdx_module(void) if (ret) goto out_free_pamts; + /* + * Hardware doesn't guarantee cache coherency across different + * KeyIDs. The kernel needs to flush PAMT's dirty cachelines + * (associated with KeyID 0) before the TDX module can use the + * global KeyID to access the PAMT. Given PAMTs are potentially + * large (~1/256th of system RAM), just use WBINVD on all cpus + * to flush the cache. + * + * Follow the TDX spec to flush cache before configuring the + * global KeyID on all packages. + */ + wbinvd_on_all_cpus(); + + /* Config the key of global KeyID on all packages */ + ret = config_global_keyid(); + if (ret) + goto out_free_pamts; + /* * TODO: * - * - Configure the global KeyID on all packages. * - Initialize all TDMRs. * * Return error before all steps are done. */ ret = -EINVAL; out_free_pamts: - if (ret) + if (ret) { + /* + * Part of PAMT may already have been initialized by the + * TDX module. Flush cache before returning PAMT back + * to the kernel. + * + * No need to worry about integrity checks here. KeyID + * 0 has integrity checking disabled. + */ + wbinvd_on_all_cpus(); + tdmrs_free_pamt_all(&tdmr_list); - else + } else pr_info("%lu pages allocated for PAMT.\n", tdmrs_count_pamt_pages(&tdmr_list)); out_free_tdmrs: @@ -1057,7 +1144,9 @@ static int __tdx_enable(void) * tdx_enable - Enable TDX by initializing the TDX module * * The caller must make sure all online cpus are in VMX operation before - * calling this function. + * calling this function. Also, the caller must make sure there is at + * least one online cpu for each package, and to prevent any cpu from + * going offline during this function. * * This function can be called in parallel by multiple callers. * diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h index 4d2edd477480..f5c12a2543d4 100644 --- a/arch/x86/virt/vmx/tdx/tdx.h +++ b/arch/x86/virt/vmx/tdx/tdx.h @@ -19,6 +19,7 @@ /* * TDX module SEAMCALL leaf functions */ +#define TDH_SYS_KEY_CONFIG 31 #define TDH_SYS_INFO 32 #define TDH_SYS_CONFIG 45 From patchwork Fri Dec 9 06:52:35 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Kai" X-Patchwork-Id: 13069326 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 99726C4332F for ; Fri, 9 Dec 2022 06:57:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229512AbiLIG5C (ORCPT ); Fri, 9 Dec 2022 01:57:02 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39538 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229470AbiLIG4H (ORCPT ); Fri, 9 Dec 2022 01:56:07 -0500 Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E6C14AF4C6; Thu, 8 Dec 2022 22:54:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1670568870; x=1702104870; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=pEzHyd4Ac6oG5xApJ9gB+fF+e/HTbnBocyb4jHOlMgw=; b=EM9LWwvWnteHyZfx3QTWLi9fciiyoxZBr5+upEk0o38LPpnw2qD73mYI VhUFHJnMD/FcfhLSNZEt0Ere2OPx4JTh6Pg6wMrlJnmGhiq27MhMcBu3L ToFcohpOMD3mbdlO/1dBj2ebAdTICHFhm2e7HE9Ov5hPu3+mXNLmfIzkH w0tMvBZLuVebmzIMWxYQt5MIrZe0VHjaToHri5nYZsIiVSNbmgJKnms/H gNYklZtJXyKhIjzEcqS3ar3EV5VJgWFmzZeNUNz0bIkRgBrA3iyu3eNtD unB6sq+bKv7yn0t+WcvC5BRx3CfKMMuS/OlBT95M+sqDOsl7yj2/T5gmU A==; X-IronPort-AV: E=McAfee;i="6500,9779,10555"; a="318551509" X-IronPort-AV: E=Sophos;i="5.96,230,1665471600"; d="scan'208";a="318551509" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Dec 2022 22:54:14 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10555"; a="679837143" X-IronPort-AV: E=Sophos;i="5.96,230,1665471600"; d="scan'208";a="679837143" Received: from omiramon-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.212.28.82]) by orsmga001-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Dec 2022 22:54:09 -0800 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, dave.hansen@intel.com, peterz@infradead.org, tglx@linutronix.de, seanjc@google.com, pbonzini@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, kirill.shutemov@linux.intel.com, ying.huang@intel.com, reinette.chatre@intel.com, len.brown@intel.com, tony.luck@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v8 14/16] x86/virt/tdx: Initialize all TDMRs Date: Fri, 9 Dec 2022 19:52:35 +1300 Message-Id: <16d47f9611d53c0f07a4af2b5739ed83e41b6e48.1670566861.git.kai.huang@intel.com> X-Mailer: git-send-email 2.38.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org After the global KeyID has been configured on all packages, initialize all TDMRs to make all TDX-usable memory regions that are passed to the TDX module become usable. This is the last step of initializing the TDX module. Initializing different TDMRs can be parallelized. For now to keep it simple, just initialize all TDMRs one by one. It can be enhanced in the future. Reviewed-by: Isaku Yamahata Signed-off-by: Kai Huang --- v7 -> v8: (Dave) - Changelog: - explicitly call out this is the last step of TDX module initialization. - Trimed down changelog by removing SEAMCALL name and details. - Removed/trimmed down unnecessary comments. - Other changes due to 'struct tdmr_info_list'. v6 -> v7: - Removed need_resched() check. -- Andi. --- arch/x86/virt/vmx/tdx/tdx.c | 61 ++++++++++++++++++++++++++++++++----- arch/x86/virt/vmx/tdx/tdx.h | 1 + 2 files changed, 54 insertions(+), 8 deletions(-) diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index 4c779e8412f1..8b7314f19df2 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -1006,6 +1006,55 @@ static int config_global_keyid(void) return ret; } +static int init_tdmr(struct tdmr_info *tdmr) +{ + u64 next; + + /* + * Initializing a TDMR can be time consuming. To avoid long + * SEAMCALLs, the TDX module may only initialize a part of the + * TDMR in each call. + */ + do { + struct tdx_module_output out; + int ret; + + /* All 0's are unused parameters, they mean nothing. */ + ret = seamcall(TDH_SYS_TDMR_INIT, tdmr->base, 0, 0, 0, NULL, + &out); + if (ret) + return ret; + /* + * RDX contains 'next-to-initialize' address if + * TDH.SYS.TDMR.INT succeeded. + */ + next = out.rdx; + cond_resched(); + /* Keep making SEAMCALLs until the TDMR is done */ + } while (next < tdmr->base + tdmr->size); + + return 0; +} + +static int init_tdmrs(struct tdmr_info_list *tdmr_list) +{ + int i; + + /* + * This operation is costly. It can be parallelized, + * but keep it simple for now. + */ + for (i = 0; i < tdmr_list->nr_tdmrs; i++) { + int ret; + + ret = init_tdmr(tdmr_entry(tdmr_list, i)); + if (ret) + return ret; + } + + return 0; +} + static int init_tdx_module(void) { /* @@ -1076,14 +1125,10 @@ static int init_tdx_module(void) if (ret) goto out_free_pamts; - /* - * TODO: - * - * - Initialize all TDMRs. - * - * Return error before all steps are done. - */ - ret = -EINVAL; + /* Initialize TDMRs to complete the TDX module initialization */ + ret = init_tdmrs(&tdmr_list); + if (ret) + goto out_free_pamts; out_free_pamts: if (ret) { /* diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h index f5c12a2543d4..163c4876dee4 100644 --- a/arch/x86/virt/vmx/tdx/tdx.h +++ b/arch/x86/virt/vmx/tdx/tdx.h @@ -21,6 +21,7 @@ */ #define TDH_SYS_KEY_CONFIG 31 #define TDH_SYS_INFO 32 +#define TDH_SYS_TDMR_INIT 36 #define TDH_SYS_CONFIG 45 struct cmr_info { From patchwork Fri Dec 9 06:52:36 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Kai" X-Patchwork-Id: 13069327 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1B56AC4332F for ; Fri, 9 Dec 2022 06:57:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230036AbiLIG5G (ORCPT ); Fri, 9 Dec 2022 01:57:06 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40888 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229901AbiLIG4I (ORCPT ); Fri, 9 Dec 2022 01:56:08 -0500 Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 70939AF4C7; Thu, 8 Dec 2022 22:54:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1670568871; x=1702104871; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=47FUNXOBhS2NG9KMSmMIewrFFmTIx7+CnlvTMHcGc9w=; b=FOjc4LttxCa5ddL9CFKoKJkclr6jmi+UTDb/FzEyo1Og78qBO1NJqCBV kmkG6fpRZR2rj6J7SJsd+t3pWv3pQLeZvL/PBorjkZOU26a7UTrMvXFBy 9DWcl1gkAQu2bVV5VqK/fvJ0UTSBOFpRCobdqD63Qd9UOAP9yR6keNc2/ iEPLFtQRLrhYNbuzZdpzYeKpAQd/YHj5lxlLBjr0r0H3iXVjWSnaQIC6X l7gsKL0L4JYOJQVK4/EOBOt46Ba/gfiIz/CWOXwVww/Z1Qn/6GYPjuRSz jeFlzBmwlZXmWdj5Ag3lIPZwokYR9UN/HVKWqntKRXZ/J5ZOAl6vKK2Vv Q==; X-IronPort-AV: E=McAfee;i="6500,9779,10555"; a="318551540" X-IronPort-AV: E=Sophos;i="5.96,230,1665471600"; d="scan'208";a="318551540" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Dec 2022 22:54:19 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10555"; a="679837169" X-IronPort-AV: E=Sophos;i="5.96,230,1665471600"; d="scan'208";a="679837169" Received: from omiramon-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.212.28.82]) by orsmga001-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Dec 2022 22:54:14 -0800 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, dave.hansen@intel.com, peterz@infradead.org, tglx@linutronix.de, seanjc@google.com, pbonzini@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, kirill.shutemov@linux.intel.com, ying.huang@intel.com, reinette.chatre@intel.com, len.brown@intel.com, tony.luck@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v8 15/16] x86/virt/tdx: Flush cache in kexec() when TDX is enabled Date: Fri, 9 Dec 2022 19:52:36 +1300 Message-Id: X-Mailer: git-send-email 2.38.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org There are two problems in terms of using kexec() to boot to a new kernel when the old kernel has enabled TDX: 1) Part of the memory pages are still TDX private pages (i.e. metadata used by the TDX module, and any TDX guest memory if kexec() happens when there's any TDX guest alive). 2) There might be dirty cachelines associated with TDX private pages. Because the hardware doesn't guarantee cache coherency among different KeyIDs, the old kernel needs to flush cache (of those TDX private pages) before booting to the new kernel. Also, reading TDX private page using any shared non-TDX KeyID with integrity-check enabled can trigger #MC. Therefore ideally, the kernel should convert all TDX private pages back to normal before booting to the new kernel. However, this implementation doesn't convert TDX private pages back to normal in kexec() because of below considerations: 1) Neither the kernel nor the TDX module has existing infrastructure to track which pages are TDX private pages. 2) The number of TDX private pages can be large, and converting all of them (cache flush + using MOVDIR64B to clear the page) in kexec() can be time consuming. 3) The new kernel will almost only use KeyID 0 to access memory. KeyID 0 doesn't support integrity-check, so it's OK. 4) The kernel doesn't (and may never) support MKTME. If any 3rd party kernel ever supports MKTME, it can/should do MOVDIR64B to clear the page with the new MKTME KeyID (just like TDX does) before using it. Therefore, this implementation just flushes cache to make sure there are no stale dirty cachelines associated with any TDX private KeyIDs before booting to the new kernel, otherwise they may silently corrupt the new kernel. Following SME support, use wbinvd() to flush cache in stop_this_cpu(). Theoretically, cache flush is only needed when the TDX module has been initialized. However initializing the TDX module is done on demand at runtime, and it takes a mutex to read the module status. Just check whether TDX is enabled by BIOS instead to flush cache. Reviewed-by: Isaku Yamahata Signed-off-by: Kai Huang --- v7 -> v8: - Changelog: - Removed "leave TDX module open" part due to shut down patch has been removed. v6 -> v7: - Improved changelog to explain why don't convert TDX private pages back to normal. --- arch/x86/kernel/process.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index c21b7347a26d..0cc84977dc62 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -765,8 +765,14 @@ void __noreturn stop_this_cpu(void *dummy) * * Test the CPUID bit directly because the machine might've cleared * X86_FEATURE_SME due to cmdline options. + * + * Similar to SME, if the TDX module is ever initialized, the + * cachelines associated with any TDX private KeyID must be flushed + * before transiting to the new kernel. The TDX module is initialized + * on demand, and it takes the mutex to read its status. Just check + * whether TDX is enabled by BIOS instead to flush cache. */ - if (cpuid_eax(0x8000001f) & BIT(0)) + if (cpuid_eax(0x8000001f) & BIT(0) || platform_tdx_enabled()) native_wbinvd(); for (;;) { /* From patchwork Fri Dec 9 06:52:37 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Kai" X-Patchwork-Id: 13069328 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 80FE0C4332F for ; Fri, 9 Dec 2022 06:57:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230136AbiLIG5b (ORCPT ); Fri, 9 Dec 2022 01:57:31 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40698 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229966AbiLIG4p (ORCPT ); Fri, 9 Dec 2022 01:56:45 -0500 Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2EAA17D04A; Thu, 8 Dec 2022 22:54:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1670568883; x=1702104883; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=A8Dl7ZGsiYbv17bp8M14VPIPaDUgGEKolvJTmYFdtsE=; b=dVVDUHt+TRtq6V3Ej2jzo1XECMUzBHMVwsLrp+uaLKBCM8fH/00rYDCY WAxzbcEsIO8kTYsQC2dFYMPZo6NBSlW9Fa8382iwesjzt0u0OspNolicz Ydb4OK/YqBKT/dLNMVjuzPrUnrfWckX5nu3JDd1EUJF/S+MHt7GiwPPEg 4KVETfsIxt/+XUK9tDdtlA8bUBczfVaRLoQxeWzgijt/r3cTDFWIYD8h4 SC/ML9Gfv1b2TNy1bXgN3+PkwR8Cj8xIulAyH2FCAXvdRW92qiEzIQ30q 6qzkhyPvlWIgHzk8QcNIMEx0q/w9Vomy/jSQUfvdjlf1u4JRBNu920OrR A==; X-IronPort-AV: E=McAfee;i="6500,9779,10555"; a="318551557" X-IronPort-AV: E=Sophos;i="5.96,230,1665471600"; d="scan'208";a="318551557" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Dec 2022 22:54:24 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10555"; a="679837193" X-IronPort-AV: E=Sophos;i="5.96,230,1665471600"; d="scan'208";a="679837193" Received: from omiramon-mobl1.amr.corp.intel.com (HELO khuang2-desk.gar.corp.intel.com) ([10.212.28.82]) by orsmga001-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Dec 2022 22:54:19 -0800 From: Kai Huang To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: linux-mm@kvack.org, dave.hansen@intel.com, peterz@infradead.org, tglx@linutronix.de, seanjc@google.com, pbonzini@redhat.com, dan.j.williams@intel.com, rafael.j.wysocki@intel.com, kirill.shutemov@linux.intel.com, ying.huang@intel.com, reinette.chatre@intel.com, len.brown@intel.com, tony.luck@intel.com, ak@linux.intel.com, isaku.yamahata@intel.com, chao.gao@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com, bagasdotme@gmail.com, sagis@google.com, imammedo@redhat.com, kai.huang@intel.com Subject: [PATCH v8 16/16] Documentation/x86: Add documentation for TDX host support Date: Fri, 9 Dec 2022 19:52:37 +1300 Message-Id: <11fb07f0ca78c89c4850adbf9b0d146201c98ef2.1670566861.git.kai.huang@intel.com> X-Mailer: git-send-email 2.38.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org Add documentation for TDX host kernel support. There is already one file Documentation/x86/tdx.rst containing documentation for TDX guest internals. Also reuse it for TDX host kernel support. Introduce a new level menu "TDX Guest Support" and move existing materials under it, and add a new menu for TDX host kernel support. Signed-off-by: Kai Huang --- Documentation/x86/tdx.rst | 169 +++++++++++++++++++++++++++++++++++--- 1 file changed, 158 insertions(+), 11 deletions(-) diff --git a/Documentation/x86/tdx.rst b/Documentation/x86/tdx.rst index dc8d9fd2c3f7..207b91610b36 100644 --- a/Documentation/x86/tdx.rst +++ b/Documentation/x86/tdx.rst @@ -10,6 +10,153 @@ encrypting the guest memory. In TDX, a special module running in a special mode sits between the host and the guest and manages the guest/host separation. +TDX Host Kernel Support +======================= + +TDX introduces a new CPU mode called Secure Arbitration Mode (SEAM) and +a new isolated range pointed by the SEAM Ranger Register (SEAMRR). A +CPU-attested software module called 'the TDX module' runs inside the new +isolated range to provide the functionalities to manage and run protected +VMs. + +TDX also leverages Intel Multi-Key Total Memory Encryption (MKTME) to +provide crypto-protection to the VMs. TDX reserves part of MKTME KeyIDs +as TDX private KeyIDs, which are only accessible within the SEAM mode. +BIOS is responsible for partitioning legacy MKTME KeyIDs and TDX KeyIDs. + +Before the TDX module can be used to create and run protected VMs, it +must be loaded into the isolated range and properly initialized. The TDX +architecture doesn't require the BIOS to load the TDX module, but the +kernel assumes it is loaded by the BIOS. + +TDX boot-time detection +----------------------- + +The kernel detects TDX by detecting TDX private KeyIDs during kernel +boot. Below dmesg shows when TDX is enabled by BIOS:: + + [..] tdx: BIOS enabled: private KeyID range: [16, 64). + +TDX module detection and initialization +--------------------------------------- + +There is no CPUID or MSR to detect the TDX module. The kernel detects it +by initializing it. + +The kernel talks to the TDX module via the new SEAMCALL instruction. The +TDX module implements SEAMCALL leaf functions to allow the kernel to +initialize it. + +Initializing the TDX module consumes roughly ~1/256th system RAM size to +use it as 'metadata' for the TDX memory. It also takes additional CPU +time to initialize those metadata along with the TDX module itself. Both +are not trivial. The kernel initializes the TDX module at runtime on +demand. The caller to call tdx_enable() to initialize the TDX module:: + + ret = tdx_enable(); + if (ret) + goto no_tdx; + // TDX is ready to use + +One step of initializing the TDX module requires at least one online cpu +for each package. The caller needs to guarantee this otherwise the +initialization will fail. + +Making SEAMCALL requires the CPU already being in VMX operation (VMXON +has been done). For now tdx_enable() doesn't handle VMXON internally, +but depends on the caller to guarantee that. So far only KVM calls +tdx_enable() and KVM already handles VMXON. + +User can consult dmesg to see the presence of the TDX module, and whether +it has been initialized. + +If the TDX module is not loaded, dmesg shows below:: + + [..] tdx: TDX module is not loaded. + +If the TDX module is initialized successfully, dmesg shows something +like below:: + + [..] tdx: TDX module: attributes 0x0, vendor_id 0x8086, major_version 1, minor_version 0, build_date 20211209, build_num 160 + [..] tdx: 65667 pages allocated for PAMT. + [..] tdx: TDX module initialized. + +If the TDX module failed to initialize, dmesg also shows it failed to +initialize:: + + [..] tdx: initialization failed ... + +TDX Interaction to Other Kernel Components +------------------------------------------ + +TDX Memory Policy +~~~~~~~~~~~~~~~~~ + +TDX reports a list of "Convertible Memory Region" (CMR) to tell the +kernel which memory is TDX compatible. The kernel needs to build a list +of memory regions (out of CMRs) as "TDX-usable" memory and pass those +regions to the TDX module. Once this is done, those "TDX-usable" memory +regions are fixed during module's lifetime. + +To keep things simple, currently the kernel simply guarantees all pages +in the page allocator are TDX memory. Specifically, the kernel uses all +system memory in the core-mm at the time of initializing the TDX module +as TDX memory, and in the meantime, refuses to online any non-TDX-memory +in the memory hotplug. + +This can be enhanced in the future, i.e. by allowing adding non-TDX +memory to a separate NUMA node. In this case, the "TDX-capable" nodes +and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace +needs to guarantee memory pages for TDX guests are always allocated from +the "TDX-capable" nodes. + +Physical Memory Hotplug +~~~~~~~~~~~~~~~~~~~~~~~ + +Note TDX assumes convertible memory is always physically present during +machine's runtime. A non-buggy BIOS should never support hot-removal of +any convertible memory. This implementation doesn't handle ACPI memory +removal but depends on the BIOS to behave correctly. + +CPU Hotplug +~~~~~~~~~~~ + +TDX doesn't support physical (ACPI) CPU hotplug. During machine boot, +TDX verifies all boot-time present logical CPUs are TDX compatible before +enabling TDX. A non-buggy BIOS should never support hot-add/removal of +physical CPU. Currently the kernel doesn't handle physical CPU hotplug, +but depends on the BIOS to behave correctly. + +Note TDX works with CPU logical online/offline, thus the kernel still +allows to offline logical CPU and online it again. + +Kexec() +~~~~~~~ + +There are two problems in terms of using kexec() to boot to a new kernel +when the old kernel has enabled TDX: 1) Part of the memory pages are +still TDX private pages (i.e. metadata used by the TDX module, and any +TDX guest memory if kexec() is executed when there's live TDX guests). +2) There might be dirty cachelines associated with TDX private pages. + +Because the hardware doesn't guarantee cache coherency among different +KeyIDs, the old kernel needs to flush cache (of TDX private pages) +before booting to the new kernel. Also, the kernel doesn't convert all +TDX private pages back to normal because of below considerations: + +1) Neither the kernel nor the TDX module has existing infrastructure to + track which pages are TDX private page. +2) The number of TDX private pages can be large, and converting all of + them (cache flush + using MOVDIR64B to clear the page) can be time + consuming. +3) The new kernel will almost only use KeyID 0 to access memory. KeyID + 0 doesn't support integrity-check, so it's OK. +4) The kernel doesn't (and may never) support MKTME. If any 3rd party + kernel ever supports MKTME, it can/should do MOVDIR64B to clear the + page with the new MKTME KeyID (just like TDX does) before using it. + +TDX Guest Support +================= Since the host cannot directly access guest registers or memory, much normal functionality of a hypervisor must be moved into the guest. This is implemented using a Virtualization Exception (#VE) that is handled by the @@ -20,7 +167,7 @@ TDX includes new hypercall-like mechanisms for communicating from the guest to the hypervisor or the TDX module. New TDX Exceptions -================== +------------------ TDX guests behave differently from bare-metal and traditional VMX guests. In TDX guests, otherwise normal instructions or memory accesses can cause @@ -30,7 +177,7 @@ Instructions marked with an '*' conditionally cause exceptions. The details for these instructions are discussed below. Instruction-based #VE ---------------------- +~~~~~~~~~~~~~~~~~~~~~ - Port I/O (INS, OUTS, IN, OUT) - HLT @@ -41,7 +188,7 @@ Instruction-based #VE - CPUID* Instruction-based #GP ---------------------- +~~~~~~~~~~~~~~~~~~~~~ - All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH, VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON @@ -52,7 +199,7 @@ Instruction-based #GP - RDMSR*,WRMSR* RDMSR/WRMSR Behavior --------------------- +~~~~~~~~~~~~~~~~~~~~ MSR access behavior falls into three categories: @@ -73,7 +220,7 @@ trapping and handling in the TDX module. Other than possibly being slow, these MSRs appear to function just as they would on bare metal. CPUID Behavior --------------- +~~~~~~~~~~~~~~ For some CPUID leaves and sub-leaves, the virtualized bit fields of CPUID return values (in guest EAX/EBX/ECX/EDX) are configurable by the @@ -93,7 +240,7 @@ not know how to handle. The guest kernel may ask the hypervisor for the value with a hypercall. #VE on Memory Accesses -====================== +---------------------- There are essentially two classes of TDX memory: private and shared. Private memory receives full TDX protections. Its content is protected @@ -107,7 +254,7 @@ entries. This helps ensure that a guest does not place sensitive information in shared memory, exposing it to the untrusted hypervisor. #VE on Shared Memory --------------------- +~~~~~~~~~~~~~~~~~~~~ Access to shared mappings can cause a #VE. The hypervisor ultimately controls whether a shared memory access causes a #VE, so the guest must be @@ -127,7 +274,7 @@ be careful not to access device MMIO regions unless it is also prepared to handle a #VE. #VE on Private Pages --------------------- +~~~~~~~~~~~~~~~~~~~~ An access to private mappings can also cause a #VE. Since all kernel memory is also private memory, the kernel might theoretically need to @@ -145,7 +292,7 @@ The hypervisor is permitted to unilaterally move accepted pages to a to handle the exception. Linux #VE handler -================= +----------------- Just like page faults or #GP's, #VE exceptions can be either handled or be fatal. Typically, an unhandled userspace #VE results in a SIGSEGV. @@ -167,7 +314,7 @@ While the block is in place, any #VE is elevated to a double fault (#DF) which is not recoverable. MMIO handling -============= +------------- In non-TDX VMs, MMIO is usually implemented by giving a guest access to a mapping which will cause a VMEXIT on access, and then the hypervisor @@ -189,7 +336,7 @@ MMIO access via other means (like structure overlays) may result in an oops. Shared Memory Conversions -========================= +------------------------- All TDX guest memory starts out as private at boot. This memory can not be accessed by the hypervisor. However, some kernel users like device