From patchwork Wed Apr  6 04:49:13 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Kai Huang <kai.huang@intel.com>
X-Patchwork-Id: 12803717
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 342ECC433FE
	for <kvm@archiver.kernel.org>; Wed,  6 Apr 2022 16:05:00 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S236840AbiDFQG4 (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Wed, 6 Apr 2022 12:06:56 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56924 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S237094AbiDFQG3 (ORCPT <rfc822;kvm@vger.kernel.org>);
        Wed, 6 Apr 2022 12:06:29 -0400
Received: from mga18.intel.com (mga18.intel.com [134.134.136.126])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5020444DCF3;
        Tue,  5 Apr 2022 21:49:49 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1649220590; x=1680756590;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=wso/JihE0Lkn+Tpp3hqJJYVPBq5LekT3PvHWwFWkoqs=;
  b=PcfQ/1CObE6Y2xXrzWZIzeUQQTJgizaffED9jHEiRqjQXGwMxhagSpFJ
   d8ofaPGKo8703k2cKvzINKbJw4fJUaIs9/9+r8P0ZiUjDoo8SUfu1FkTn
   P1SvXGBFsx4TM4NXLg4b2+WtHPaYMOohmeJBqB02kUXj9ilze1Y601qPJ
   neloQCoht6+EA59YAUgxD5uR8gJ9exoGp62TnwKcDr8B719ZboWFzupdF
   pDBUxHxlnZxI0XR7g+9uYPAweOVeJkXHGeL0jYwh5uxBcj+WP9LwvchiG
   PQ5p2RCPqL4/mFmdpe/rjFBZgrdfnem99zAuLrNWL86dP8dM9Pms9RPCS
   Q==;
X-IronPort-AV: E=McAfee;i="6200,9189,10308"; a="243089765"
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="243089765"
Received: from orsmga006.jf.intel.com ([10.7.209.51])
  by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:49:49 -0700
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="524302089"
Received: from dchang1-mobl3.amr.corp.intel.com (HELO
 khuang2-desk.gar.corp.intel.com) ([10.254.29.17])
  by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:49:45 -0700
From: Kai Huang <kai.huang@intel.com>
To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org
Cc: seanjc@google.com, pbonzini@redhat.com, dave.hansen@intel.com,
        len.brown@intel.com, tony.luck@intel.com,
        rafael.j.wysocki@intel.com, reinette.chatre@intel.com,
        dan.j.williams@intel.com, peterz@infradead.org, ak@linux.intel.com,
        kirill.shutemov@linux.intel.com,
        sathyanarayanan.kuppuswamy@linux.intel.com,
        isaku.yamahata@intel.com, kai.huang@intel.com
Subject: [PATCH v3 01/21] x86/virt/tdx: Detect SEAM
Date: Wed,  6 Apr 2022 16:49:13 +1200
Message-Id: 
 <ab118fb9bd39b200feb843660a9b10421943aa70.1649219184.git.kai.huang@intel.com>
X-Mailer: git-send-email 2.35.1
In-Reply-To: <cover.1649219184.git.kai.huang@intel.com>
References: <cover.1649219184.git.kai.huang@intel.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

Intel Trusted Domain Extensions (TDX) protects guest VMs from malicious
host and certain physical attacks.  To support TDX, a new CPU mode called
Secure Arbitration Mode (SEAM) is added to Intel processors.

SEAM is an extension to the VMX architecture to define a new VMX root
operation (SEAM VMX root) and a new VMX non-root operation (SEAM VMX
non-root).  SEAM VMX root operation is designed to host a CPU-attested
software module called the 'TDX module' which implements functions to
manage crypto-protected VMs called Trust Domains (TD).  It is also
designed to additionally host a CPU-attested software module called the
'Intel Persistent SEAMLDR (Intel P-SEAMLDR)' to load and update the TDX
module.

Software modules in SEAM VMX root run in a memory region defined by the
SEAM range register (SEAMRR).  So the first thing of detecting Intel TDX
is to detect the validity of SEAMRR.

The presence of SEAMRR is reported via a new SEAMRR bit (15) of the
IA32_MTRRCAP MSR.  The SEAMRR range registers consist of a pair of MSRs:

        IA32_SEAMRR_PHYS_BASE and IA32_SEAMRR_PHYS_MASK

BIOS is expected to configure SEAMRR with the same value across all
cores.  In case of BIOS misconfiguration, detect and compare SEAMRR
on all cpus.

TDX also leverages Intel Multi-Key Total Memory Encryption (MKTME) to
crypto-protect TD guests.  Part of MKTME KeyIDs are reserved as "TDX
private KeyID" or "TDX KeyIDs" for short.  Similar to detecting SEAMRR,
detecting TDX private KeyIDs also needs to be done on all cpus to detect
any BIOS misconfiguration.

To start to support TDX, create a new arch/x86/virt/vmx/tdx/tdx.c for
TDX host kernel support.  Add a function to detect all TDX preliminaries
(SEAMRR, TDX private KeyIDs) for a given cpu when it is brought up.  As
the first step, detect the validity of SEAMRR.

Also add a new Kconfig option CONFIG_INTEL_TDX_HOST to opt-in TDX host
kernel support (to distinguish with TDX guest kernel support).

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/Kconfig               |  12 ++++
 arch/x86/Makefile              |   2 +
 arch/x86/include/asm/tdx.h     |   9 +++
 arch/x86/kernel/cpu/intel.c    |   3 +
 arch/x86/virt/Makefile         |   2 +
 arch/x86/virt/vmx/Makefile     |   2 +
 arch/x86/virt/vmx/tdx/Makefile |   2 +
 arch/x86/virt/vmx/tdx/tdx.c    | 102 +++++++++++++++++++++++++++++++++
 8 files changed, 134 insertions(+)
 create mode 100644 arch/x86/virt/Makefile
 create mode 100644 arch/x86/virt/vmx/Makefile
 create mode 100644 arch/x86/virt/vmx/tdx/Makefile
 create mode 100644 arch/x86/virt/vmx/tdx/tdx.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7021ec725dd3..9113bf09f358 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1967,6 +1967,18 @@ config X86_SGX
 
 	  If unsure, say N.
 
+config INTEL_TDX_HOST
+	bool "Intel Trust Domain Extensions (TDX) host support"
+	default n
+	depends on CPU_SUP_INTEL
+	depends on X86_64
+	help
+	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
+	  host and certain physical attacks.  This option enables necessary TDX
+	  support in host kernel to run protected VMs.
+
+	  If unsure, say N.
+
 config EFI
 	bool "EFI runtime service support"
 	depends on ACPI
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 63d50f65b828..2ca3a2a36dc5 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -234,6 +234,8 @@ head-y += arch/x86/kernel/platform-quirks.o
 
 libs-y  += arch/x86/lib/
 
+core-y += arch/x86/virt/
+
 # drivers-y are linked after core-y
 drivers-$(CONFIG_MATH_EMULATION) += arch/x86/math-emu/
 drivers-$(CONFIG_PCI)            += arch/x86/pci/
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 020c81a7c729..1f29813b1646 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -20,6 +20,8 @@
 
 #ifndef __ASSEMBLY__
 
+#include <asm/processor.h>
+
 /*
  * Used to gather the output registers values of the TDCALL and SEAMCALL
  * instructions when requesting services from the TDX module.
@@ -87,5 +89,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
 	return -ENODEV;
 }
 #endif /* CONFIG_INTEL_TDX_GUEST && CONFIG_KVM_GUEST */
+
+#ifdef CONFIG_INTEL_TDX_HOST
+void tdx_detect_cpu(struct cpuinfo_x86 *c);
+#else
+static inline void tdx_detect_cpu(struct cpuinfo_x86 *c) { }
+#endif /* CONFIG_INTEL_TDX_HOST */
+
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_X86_TDX_H */
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 8321c43554a1..b142a640fb8e 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -26,6 +26,7 @@
 #include <asm/resctrl.h>
 #include <asm/numa.h>
 #include <asm/thermal.h>
+#include <asm/tdx.h>
 
 #ifdef CONFIG_X86_64
 #include <linux/topology.h>
@@ -715,6 +716,8 @@ static void init_intel(struct cpuinfo_x86 *c)
 	if (cpu_has(c, X86_FEATURE_TME))
 		detect_tme(c);
 
+	tdx_detect_cpu(c);
+
 	init_intel_misc_features(c);
 
 	if (tsx_ctrl_state == TSX_CTRL_ENABLE)
diff --git a/arch/x86/virt/Makefile b/arch/x86/virt/Makefile
new file mode 100644
index 000000000000..1e36502cd738
--- /dev/null
+++ b/arch/x86/virt/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-y	+= vmx/
diff --git a/arch/x86/virt/vmx/Makefile b/arch/x86/virt/vmx/Makefile
new file mode 100644
index 000000000000..1e1fcd7d3bd1
--- /dev/null
+++ b/arch/x86/virt/vmx/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-y	+= tdx/
diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
new file mode 100644
index 000000000000..1bd688684716
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-$(CONFIG_INTEL_TDX_HOST)	+= tdx.o
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
new file mode 100644
index 000000000000..03f35c75f439
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -0,0 +1,102 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright(c) 2022 Intel Corporation.
+ *
+ * Intel Trusted Domain Extensions (TDX) support
+ */
+
+#define pr_fmt(fmt)	"tdx: " fmt
+
+#include <linux/types.h>
+#include <linux/cpumask.h>
+#include <asm/msr-index.h>
+#include <asm/msr.h>
+#include <asm/cpufeature.h>
+#include <asm/cpufeatures.h>
+#include <asm/tdx.h>
+
+/* Support Intel Secure Arbitration Mode Range Registers (SEAMRR) */
+#define MTRR_CAP_SEAMRR			BIT(15)
+
+/* Core-scope Intel SEAMRR base and mask registers. */
+#define MSR_IA32_SEAMRR_PHYS_BASE	0x00001400
+#define MSR_IA32_SEAMRR_PHYS_MASK	0x00001401
+
+#define SEAMRR_PHYS_BASE_CONFIGURED	BIT_ULL(3)
+#define SEAMRR_PHYS_MASK_ENABLED	BIT_ULL(11)
+#define SEAMRR_PHYS_MASK_LOCKED		BIT_ULL(10)
+
+#define SEAMRR_ENABLED_BITS	\
+	(SEAMRR_PHYS_MASK_ENABLED | SEAMRR_PHYS_MASK_LOCKED)
+
+/* BIOS must configure SEAMRR registers for all cores consistently */
+static u64 seamrr_base, seamrr_mask;
+
+static bool __seamrr_enabled(void)
+{
+	return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
+}
+
+static void detect_seam_bsp(struct cpuinfo_x86 *c)
+{
+	u64 mtrrcap, base, mask;
+
+	/* SEAMRR is reported via MTRRcap */
+	if (!boot_cpu_has(X86_FEATURE_MTRR))
+		return;
+
+	rdmsrl(MSR_MTRRcap, mtrrcap);
+	if (!(mtrrcap & MTRR_CAP_SEAMRR))
+		return;
+
+	rdmsrl(MSR_IA32_SEAMRR_PHYS_BASE, base);
+	if (!(base & SEAMRR_PHYS_BASE_CONFIGURED)) {
+		pr_info("SEAMRR base is not configured by BIOS\n");
+		return;
+	}
+
+	rdmsrl(MSR_IA32_SEAMRR_PHYS_MASK, mask);
+	if ((mask & SEAMRR_ENABLED_BITS) != SEAMRR_ENABLED_BITS) {
+		pr_info("SEAMRR is not enabled by BIOS\n");
+		return;
+	}
+
+	seamrr_base = base;
+	seamrr_mask = mask;
+}
+
+static void detect_seam_ap(struct cpuinfo_x86 *c)
+{
+	u64 base, mask;
+
+	/*
+	 * Don't bother to detect this AP if SEAMRR is not
+	 * enabled after earlier detections.
+	 */
+	if (!__seamrr_enabled())
+		return;
+
+	rdmsrl(MSR_IA32_SEAMRR_PHYS_BASE, base);
+	rdmsrl(MSR_IA32_SEAMRR_PHYS_MASK, mask);
+
+	if (base == seamrr_base && mask == seamrr_mask)
+		return;
+
+	pr_err("Inconsistent SEAMRR configuration by BIOS\n");
+	/* Mark SEAMRR as disabled. */
+	seamrr_base = 0;
+	seamrr_mask = 0;
+}
+
+static void detect_seam(struct cpuinfo_x86 *c)
+{
+	if (c == &boot_cpu_data)
+		detect_seam_bsp(c);
+	else
+		detect_seam_ap(c);
+}
+
+void tdx_detect_cpu(struct cpuinfo_x86 *c)
+{
+	detect_seam(c);
+}

From patchwork Wed Apr  6 04:49:14 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Kai Huang <kai.huang@intel.com>
X-Patchwork-Id: 12803718
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 012C6C433F5
	for <kvm@archiver.kernel.org>; Wed,  6 Apr 2022 16:05:18 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S237163AbiDFQHO (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Wed, 6 Apr 2022 12:07:14 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55840 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S237146AbiDFQG3 (ORCPT <rfc822;kvm@vger.kernel.org>);
        Wed, 6 Apr 2022 12:06:29 -0400
Received: from mga18.intel.com (mga18.intel.com [134.134.136.126])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5E93244DCFE;
        Tue,  5 Apr 2022 21:49:53 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1649220594; x=1680756594;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=Q3oI3RrMki7HmPelaPOBTibAYke76Zj7jDIb9uCW4Yw=;
  b=gS2G5Tv4+nSfbp5z4HW1CN4WQ2GrXS66evZ5F/yQLoDUiuKNVnGO/dYA
   9yjJr4Yb6rqmMc1AnJXtWYoFg8tsNDGxxh8+I1wu7fndb9QI5W/3GfE7C
   BXA8Tjr8NRXHjdzeQhe1e7MY3G+vg0zp6LdjHi7zK2ap8hz6GoMbEaTdw
   EG6VaqKj7v/55oV5Zx9tucE95UmtwvfVHyqdO6EhQ36voZbsZXisUex2R
   XmF9WY5SX3W/9jqwqLgm4sCbnKzEM8qk9oDEOXAQg9ykR1V3Ydbcbhcei
   N5aoI0Z91ZhOU41ve3KRl4NbLXKCd72L0sH5E8cajinV0cEyDgtihxsuF
   A==;
X-IronPort-AV: E=McAfee;i="6200,9189,10308"; a="243089771"
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="243089771"
Received: from orsmga006.jf.intel.com ([10.7.209.51])
  by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:49:53 -0700
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="524302102"
Received: from dchang1-mobl3.amr.corp.intel.com (HELO
 khuang2-desk.gar.corp.intel.com) ([10.254.29.17])
  by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:49:49 -0700
From: Kai Huang <kai.huang@intel.com>
To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org
Cc: seanjc@google.com, pbonzini@redhat.com, dave.hansen@intel.com,
        len.brown@intel.com, tony.luck@intel.com,
        rafael.j.wysocki@intel.com, reinette.chatre@intel.com,
        dan.j.williams@intel.com, peterz@infradead.org, ak@linux.intel.com,
        kirill.shutemov@linux.intel.com,
        sathyanarayanan.kuppuswamy@linux.intel.com,
        isaku.yamahata@intel.com, kai.huang@intel.com
Subject: [PATCH v3 02/21] x86/virt/tdx: Detect TDX private KeyIDs
Date: Wed,  6 Apr 2022 16:49:14 +1200
Message-Id: 
 <2fb62e93734163d2f367fd44e3335cd8a2bf2995.1649219184.git.kai.huang@intel.com>
X-Mailer: git-send-email 2.35.1
In-Reply-To: <cover.1649219184.git.kai.huang@intel.com>
References: <cover.1649219184.git.kai.huang@intel.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

Pre-TDX Intel hardware has support for a memory encryption architecture
called MKTME.  The memory encryption hardware underpinning MKTME is also
used for Intel TDX.  TDX ends up "stealing" some of the physical address
space from the MKTME architecture for crypto-protection to VMs.

A new MSR (IA32_MKTME_KEYID_PARTITIONING) helps to enumerate how MKTME-
enumerated "KeyID" space is distributed between TDX and legacy MKTME.
KeyIDs reserved for TDX are called 'TDX private KeyIDs' or 'TDX KeyIDs'
for short.

The new MSR is per package and BIOS is responsible for partitioning
MKTME KeyIDs and TDX KeyIDs consistently among all packages.

Detect TDX private KeyIDs as a preparation to initialize TDX.  Similar
to detecting SEAMRR, detect on all cpus to detect any potential BIOS
misconfiguration among packages.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 72 +++++++++++++++++++++++++++++++++++++
 1 file changed, 72 insertions(+)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 03f35c75f439..ba2210001ea8 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -29,9 +29,28 @@
 #define SEAMRR_ENABLED_BITS	\
 	(SEAMRR_PHYS_MASK_ENABLED | SEAMRR_PHYS_MASK_LOCKED)
 
+/*
+ * Intel Trusted Domain CPU Architecture Extension spec:
+ *
+ * IA32_MKTME_KEYID_PARTIONING:
+ *
+ *   Bit [31:0]: number of MKTME KeyIDs.
+ *   Bit [63:32]: number of TDX private KeyIDs.
+ *
+ * TDX private KeyIDs start after the last MKTME KeyID.
+ */
+#define MSR_IA32_MKTME_KEYID_PARTITIONING	0x00000087
+
+#define TDX_KEYID_START(_keyid_part)	\
+		((u32)(((_keyid_part) & 0xffffffffull) + 1))
+#define TDX_KEYID_NUM(_keyid_part)	((u32)((_keyid_part) >> 32))
+
 /* BIOS must configure SEAMRR registers for all cores consistently */
 static u64 seamrr_base, seamrr_mask;
 
+static u32 tdx_keyid_start;
+static u32 tdx_keyid_num;
+
 static bool __seamrr_enabled(void)
 {
 	return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
@@ -96,7 +115,60 @@ static void detect_seam(struct cpuinfo_x86 *c)
 		detect_seam_ap(c);
 }
 
+static void detect_tdx_keyids_bsp(struct cpuinfo_x86 *c)
+{
+	u64 keyid_part;
+
+	/* TDX is built on MKTME, which is based on TME */
+	if (!boot_cpu_has(X86_FEATURE_TME))
+		return;
+
+	if (rdmsrl_safe(MSR_IA32_MKTME_KEYID_PARTITIONING, &keyid_part))
+		return;
+
+	/* If MSR value is 0, TDX is not enabled by BIOS. */
+	if (!keyid_part)
+		return;
+
+	tdx_keyid_num = TDX_KEYID_NUM(keyid_part);
+	if (!tdx_keyid_num)
+		return;
+
+	tdx_keyid_start = TDX_KEYID_START(keyid_part);
+}
+
+static void detect_tdx_keyids_ap(struct cpuinfo_x86 *c)
+{
+	u64 keyid_part;
+
+	/*
+	 * Don't bother to detect this AP if TDX KeyIDs are
+	 * not detected or cleared after earlier detections.
+	 */
+	if (!tdx_keyid_num)
+		return;
+
+	rdmsrl(MSR_IA32_MKTME_KEYID_PARTITIONING, keyid_part);
+
+	if ((tdx_keyid_start == TDX_KEYID_START(keyid_part)) &&
+			(tdx_keyid_num == TDX_KEYID_NUM(keyid_part)))
+		return;
+
+	pr_err("Inconsistent TDX KeyID configuration among packages by BIOS\n");
+	tdx_keyid_start = 0;
+	tdx_keyid_num = 0;
+}
+
+static void detect_tdx_keyids(struct cpuinfo_x86 *c)
+{
+	if (c == &boot_cpu_data)
+		detect_tdx_keyids_bsp(c);
+	else
+		detect_tdx_keyids_ap(c);
+}
+
 void tdx_detect_cpu(struct cpuinfo_x86 *c)
 {
 	detect_seam(c);
+	detect_tdx_keyids(c);
 }

From patchwork Wed Apr  6 04:49:15 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Kai Huang <kai.huang@intel.com>
X-Patchwork-Id: 12803719
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id C745AC433F5
	for <kvm@archiver.kernel.org>; Wed,  6 Apr 2022 16:05:23 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S237210AbiDFQHT (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Wed, 6 Apr 2022 12:07:19 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52010 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S237147AbiDFQG3 (ORCPT <rfc822;kvm@vger.kernel.org>);
        Wed, 6 Apr 2022 12:06:29 -0400
Received: from mga18.intel.com (mga18.intel.com [134.134.136.126])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 10A0944E582;
        Tue,  5 Apr 2022 21:49:57 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1649220598; x=1680756598;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=WHxeJqE3D4iwmpDphKbul2MIPxVCEPcU/ycrWHIMsbA=;
  b=V3+e5A8HTXcp5sUz3yKTJLzDkKXxTqaXfHBo8TsX0Ney1m8BKRbotx/V
   mCzBi6X52BF6KGHERVHy2AjBG34h5WiGwmhWbw4kFjuEKAoL7fuTBjhOM
   taFdlW1A7JlYNYmD6R4smEj7TO9FGXNpKABpVYnYh8ymJQhMBV6MNN8nA
   8YVq9Ys0QVc2rWFw/fFWLd76S/19rK7EsVjJnjeQR2Emb1WC9rYAHlkpI
   ZIMSwv0x7Cq6oo3g+75v8SXRNoQw+oX2kaQKLra62UlJsQj4QdLdX1BSm
   on5z8JYlCMB4yScjtfBaceN0nWvHRvbkpNpAnCuwb/nl5hhGaOgOlNrJ1
   w==;
X-IronPort-AV: E=McAfee;i="6200,9189,10308"; a="243089776"
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="243089776"
Received: from orsmga006.jf.intel.com ([10.7.209.51])
  by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:49:56 -0700
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="524302107"
Received: from dchang1-mobl3.amr.corp.intel.com (HELO
 khuang2-desk.gar.corp.intel.com) ([10.254.29.17])
  by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:49:53 -0700
From: Kai Huang <kai.huang@intel.com>
To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org
Cc: seanjc@google.com, pbonzini@redhat.com, dave.hansen@intel.com,
        len.brown@intel.com, tony.luck@intel.com,
        rafael.j.wysocki@intel.com, reinette.chatre@intel.com,
        dan.j.williams@intel.com, peterz@infradead.org, ak@linux.intel.com,
        kirill.shutemov@linux.intel.com,
        sathyanarayanan.kuppuswamy@linux.intel.com,
        isaku.yamahata@intel.com, kai.huang@intel.com
Subject: [PATCH v3 03/21] x86/virt/tdx: Implement the SEAMCALL base function
Date: Wed,  6 Apr 2022 16:49:15 +1200
Message-Id: 
 <1c3f555934c73301a9cbf10232500f3d15efe3cc.1649219184.git.kai.huang@intel.com>
X-Mailer: git-send-email 2.35.1
In-Reply-To: <cover.1649219184.git.kai.huang@intel.com>
References: <cover.1649219184.git.kai.huang@intel.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

Secure Arbitration Mode (SEAM) is an extension of VMX architecture.  It
defines a new VMX root operation (SEAM VMX root) and a new VMX non-root
operation (SEAM VMX non-root) which are isolated from legacy VMX root
and VMX non-root mode.

A CPU-attested software module (called the 'TDX module') runs in SEAM
VMX root to manage the crypto-protected VMs running in SEAM VMX non-root.
SEAM VMX root is also used to host another CPU-attested software module
(called the 'P-SEAMLDR') to load and update the TDX module.

Host kernel transits to either the P-SEAMLDR or the TDX module via the
new SEAMCALL instruction.  SEAMCALL leaf functions are host-side
interface functions defined by the P-SEAMLDR and the TDX module around
the new SEAMCALL instruction.  They are similar to a hypercall, except
they are made by host kernel to the SEAM software.

SEAMCALL leaf functions use an ABI different from the x86-64 system-v
ABI.  Instead, they share the same ABI with the TDCALL leaf functions.
%rax is used to carry both the SEAMCALL leaf function number (input) and
the completion status code (output).  Additional GPRs (%rcx, %rdx,
%r8->%r11) may be further used as both input and output operands in
individual leaf functions.

Implement a C function __seamcall() to do SEAMCALL leaf functions using
the assembly macro used by __tdx_module_call() (the implementation of
TDCALL leaf functions).  The only exception not covered here is TDENTER
leaf function which takes all GPRs and XMM0-XMM15 as both input and
output.  The caller of TDENTER should implement its own logic to call
TDENTER directly instead of using this function.

SEAMCALL instruction is essentially a VMExit from VMX root to SEAM VMX
root, and it can fail with VMfailInvalid, for instance, when the SEAM
software module is not loaded.  The C function __seamcall() returns
TDX_SEAMCALL_VMFAILINVALID, which doesn't conflict with any actual error
code of SEAMCALLs, to uniquely represent this case.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/virt/vmx/tdx/Makefile   |  2 +-
 arch/x86/virt/vmx/tdx/seamcall.S | 52 ++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h      | 11 +++++++
 3 files changed, 64 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/virt/vmx/tdx/seamcall.S
 create mode 100644 arch/x86/virt/vmx/tdx/tdx.h

diff --git a/arch/x86/virt/vmx/tdx/Makefile b/arch/x86/virt/vmx/tdx/Makefile
index 1bd688684716..fd577619620e 100644
--- a/arch/x86/virt/vmx/tdx/Makefile
+++ b/arch/x86/virt/vmx/tdx/Makefile
@@ -1,2 +1,2 @@
 # SPDX-License-Identifier: GPL-2.0-only
-obj-$(CONFIG_INTEL_TDX_HOST)	+= tdx.o
+obj-$(CONFIG_INTEL_TDX_HOST)	+= tdx.o seamcall.o
diff --git a/arch/x86/virt/vmx/tdx/seamcall.S b/arch/x86/virt/vmx/tdx/seamcall.S
new file mode 100644
index 000000000000..327961b2dd5a
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/seamcall.S
@@ -0,0 +1,52 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <linux/linkage.h>
+#include <asm/frame.h>
+
+#include "tdxcall.S"
+
+/*
+ * __seamcall()  - Host-side interface functions to SEAM software module
+ *		   (the P-SEAMLDR or the TDX module)
+ *
+ * Transform function call register arguments into the SEAMCALL register
+ * ABI.  Return TDX_SEAMCALL_VMFAILINVALID, or the completion status of
+ * the SEAMCALL.  Additional output operands are saved in @out (if it is
+ * provided by caller).
+ *
+ *-------------------------------------------------------------------------
+ * SEAMCALL ABI:
+ *-------------------------------------------------------------------------
+ * Input Registers:
+ *
+ * RAX                 - SEAMCALL Leaf number.
+ * RCX,RDX,R8-R9       - SEAMCALL Leaf specific input registers.
+ *
+ * Output Registers:
+ *
+ * RAX                 - SEAMCALL completion status code.
+ * RCX,RDX,R8-R11      - SEAMCALL Leaf specific output registers.
+ *
+ *-------------------------------------------------------------------------
+ *
+ * __seamcall() function ABI:
+ *
+ * @fn  (RDI)          - SEAMCALL Leaf number, moved to RAX
+ * @rcx (RSI)          - Input parameter 1, moved to RCX
+ * @rdx (RDX)          - Input parameter 2, moved to RDX
+ * @r8  (RCX)          - Input parameter 3, moved to R8
+ * @r9  (R8)           - Input parameter 4, moved to R9
+ *
+ * @out (R9)           - struct tdx_module_output pointer
+ *			 stored temporarily in R12 (not
+ *			 used by the P-SEAMLDR or the TDX
+ *			 module). It can be NULL.
+ *
+ * Return (via RAX) the completion status of the SEAMCALL, or
+ * TDX_SEAMCALL_VMFAILINVALID.
+ */
+SYM_FUNC_START(__seamcall)
+	FRAME_BEGIN
+	TDX_MODULE_CALL host=1
+	FRAME_END
+	ret
+SYM_FUNC_END(__seamcall)
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
new file mode 100644
index 000000000000..9d5b6f554c20
--- /dev/null
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _X86_VIRT_TDX_H
+#define _X86_VIRT_TDX_H
+
+#include <linux/types.h>
+
+struct tdx_module_output;
+u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+	       struct tdx_module_output *out);
+
+#endif

From patchwork Wed Apr  6 04:49:16 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Kai Huang <kai.huang@intel.com>
X-Patchwork-Id: 12803716
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 3F97CC433F5
	for <kvm@archiver.kernel.org>; Wed,  6 Apr 2022 16:04:45 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S237045AbiDFQGm (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Wed, 6 Apr 2022 12:06:42 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51830 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S237044AbiDFQGG (ORCPT <rfc822;kvm@vger.kernel.org>);
        Wed, 6 Apr 2022 12:06:06 -0400
Received: from mga18.intel.com (mga18.intel.com [134.134.136.126])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 29BB344E58F;
        Tue,  5 Apr 2022 21:50:01 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1649220602; x=1680756602;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=rEIhYcTJLuTC3eAzOL6jHXASDcVWPWVlc0lPGEOqWdc=;
  b=iyToUJNNH2E0tvhlT+XJqZhvv0w4m6Pa7NwmzHM+L1x7eX6gI4KRipIC
   VhtA0rD4cxW0F5I7QcpTokFgmmbGvVjryhrrQGPh/XFHQ/pxBm4FB6nBs
   WFSDdqanOO1pospxbw2/cAm+WF5Q4nUgcsq4A1PoLc7qK5rb5esfsawvm
   fCdrFj/87laNvIULGtXB9vWObzRL1SXl18UEi9NohbrYK7MazFe78uimb
   lZrPJjFGFi71FxxModQXFH99RjZacv7ohNCWNoOCaJlRsGdOaNCQjtI73
   DbY0MmVAp3uadZnuKHnr0GP9J/6rr0CGWazN/QJraZD5mjLxTXztsu7xH
   Q==;
X-IronPort-AV: E=McAfee;i="6200,9189,10308"; a="243089786"
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="243089786"
Received: from orsmga006.jf.intel.com ([10.7.209.51])
  by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:50:01 -0700
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="524302116"
Received: from dchang1-mobl3.amr.corp.intel.com (HELO
 khuang2-desk.gar.corp.intel.com) ([10.254.29.17])
  by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:49:57 -0700
From: Kai Huang <kai.huang@intel.com>
To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org
Cc: seanjc@google.com, pbonzini@redhat.com, dave.hansen@intel.com,
        len.brown@intel.com, tony.luck@intel.com,
        rafael.j.wysocki@intel.com, reinette.chatre@intel.com,
        dan.j.williams@intel.com, peterz@infradead.org, ak@linux.intel.com,
        kirill.shutemov@linux.intel.com,
        sathyanarayanan.kuppuswamy@linux.intel.com,
        isaku.yamahata@intel.com, kai.huang@intel.com
Subject: [PATCH v3 04/21] x86/virt/tdx: Add skeleton for detecting and
 initializing TDX on demand
Date: Wed,  6 Apr 2022 16:49:16 +1200
Message-Id: 
 <32dcf4c7acc95244a391458d79cd6907125c5c29.1649219184.git.kai.huang@intel.com>
X-Mailer: git-send-email 2.35.1
In-Reply-To: <cover.1649219184.git.kai.huang@intel.com>
References: <cover.1649219184.git.kai.huang@intel.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

The TDX module is essentially a CPU-attested software module running
in the new Secure Arbitration Mode (SEAM) to protect VMs from malicious
host and certain physical attacks.  The TDX module implements the
functions to build, tear down and start execution of the protected VMs
called Trusted Domains (TD).  Before the TDX module can be used to
create and run TD guests, it must be loaded into the SEAM Range Register
(SEAMRR) and properly initialized.  The TDX module is expected to be
loaded by BIOS before booting to the kernel, and the kernel is expected
to detect and initialize it, using the SEAMCALLs defined by TDX
architecture.

The TDX module can be initialized only once in its lifetime.  Instead
of always initializing it at boot time, this implementation chooses an
on-demand approach to initialize TDX until there is a real need (e.g
when requested by KVM).  This avoids consuming the memory that must be
allocated by kernel and given to the TDX module as metadata (~1/256th of
the TDX-usable memory), and also saves the time of initializing the TDX
module (and the metadata) when TDX is not used at all.  Initializing the
TDX module at runtime on-demand also is more flexible to support TDX
module runtime updating in the future (after updating the TDX module, it
needs to be initialized again).

Introduce two placeholders tdx_detect() and tdx_init() to detect and
initialize the TDX module on demand, with a state machine introduced to
orchestrate the entire process (in case of multiple callers).

To start with, tdx_detect() checks SEAMRR and TDX private KeyIDs.  The
TDX module is reported as not loaded if either SEAMRR is not enabled, or
there are no enough TDX private KeyIDs to create any TD guest.  The TDX
module itself requires one global TDX private KeyID to crypto protect
its metadata.

And tdx_init() is currently empty.  The TDX module will be initialized
in multi-steps defined by the TDX architecture:

  1) Global initialization;
  2) Logical-CPU scope initialization;
  3) Enumerate the TDX module capabilities and platform configuration;
  4) Configure the TDX module about usable memory ranges and global
     KeyID information;
  5) Package-scope configuration for the global KeyID;
  6) Initialize usable memory ranges based on 4).

The TDX module can also be shut down at any time during its lifetime.
In case of any error during the initialization process, shut down the
module.  It's pointless to leave the module in any intermediate state
during the initialization.

SEAMCALL requires SEAMRR being enabled and CPU being already in VMX
operation (VMXON has been done), otherwise it generates #UD.  So far
only KVM handles VMXON/VMXOFF.  Choose to not handle VMXON/VMXOFF in
tdx_detect() and tdx_init() but depend on the caller to guarantee that,
since so far KVM is the only user of TDX.  In the long term, more kernel
components are likely to use VMXON/VMXOFF to support TDX (i.e. TDX
module runtime update), so a reference-based approach to do VMXON/VMXOFF
is likely needed.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/include/asm/tdx.h  |   4 +
 arch/x86/virt/vmx/tdx/tdx.c | 222 ++++++++++++++++++++++++++++++++++++
 2 files changed, 226 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 1f29813b1646..c8af2ba6bb8a 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -92,8 +92,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
 
 #ifdef CONFIG_INTEL_TDX_HOST
 void tdx_detect_cpu(struct cpuinfo_x86 *c);
+int tdx_detect(void);
+int tdx_init(void);
 #else
 static inline void tdx_detect_cpu(struct cpuinfo_x86 *c) { }
+static inline int tdx_detect(void) { return -ENODEV; }
+static inline int tdx_init(void) { return -ENODEV; }
 #endif /* CONFIG_INTEL_TDX_HOST */
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index ba2210001ea8..53093d4ad458 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -9,6 +9,8 @@
 
 #include <linux/types.h>
 #include <linux/cpumask.h>
+#include <linux/mutex.h>
+#include <linux/cpu.h>
 #include <asm/msr-index.h>
 #include <asm/msr.h>
 #include <asm/cpufeature.h>
@@ -45,12 +47,33 @@
 		((u32)(((_keyid_part) & 0xffffffffull) + 1))
 #define TDX_KEYID_NUM(_keyid_part)	((u32)((_keyid_part) >> 32))
 
+/*
+ * TDX module status during initialization
+ */
+enum tdx_module_status_t {
+	/* TDX module status is unknown */
+	TDX_MODULE_UNKNOWN,
+	/* TDX module is not loaded */
+	TDX_MODULE_NONE,
+	/* TDX module is loaded, but not initialized */
+	TDX_MODULE_LOADED,
+	/* TDX module is fully initialized */
+	TDX_MODULE_INITIALIZED,
+	/* TDX module is shutdown due to error during initialization */
+	TDX_MODULE_SHUTDOWN,
+};
+
 /* BIOS must configure SEAMRR registers for all cores consistently */
 static u64 seamrr_base, seamrr_mask;
 
 static u32 tdx_keyid_start;
 static u32 tdx_keyid_num;
 
+static enum tdx_module_status_t tdx_module_status;
+
+/* Prevent concurrent attempts on TDX detection and initialization */
+static DEFINE_MUTEX(tdx_module_lock);
+
 static bool __seamrr_enabled(void)
 {
 	return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
@@ -172,3 +195,202 @@ void tdx_detect_cpu(struct cpuinfo_x86 *c)
 	detect_seam(c);
 	detect_tdx_keyids(c);
 }
+
+static bool seamrr_enabled(void)
+{
+	/*
+	 * To detect any BIOS misconfiguration among cores, all logical
+	 * cpus must have been brought up at least once.  This is true
+	 * unless 'maxcpus' kernel command line is used to limit the
+	 * number of cpus to be brought up during boot time.  However
+	 * 'maxcpus' is basically an invalid operation mode due to the
+	 * MCE broadcast problem, and it should not be used on a TDX
+	 * capable machine.  Just do paranoid check here and do not
+	 * report SEAMRR as enabled in this case.
+	 */
+	if (!cpumask_equal(&cpus_booted_once_mask,
+					cpu_present_mask))
+		return false;
+
+	return __seamrr_enabled();
+}
+
+static bool tdx_keyid_sufficient(void)
+{
+	if (!cpumask_equal(&cpus_booted_once_mask,
+					cpu_present_mask))
+		return false;
+
+	/*
+	 * TDX requires at least two KeyIDs: one global KeyID to
+	 * protect the metadata of the TDX module and one or more
+	 * KeyIDs to run TD guests.
+	 */
+	return tdx_keyid_num >= 2;
+}
+
+static int __tdx_detect(void)
+{
+	/* The TDX module is not loaded if SEAMRR is disabled */
+	if (!seamrr_enabled()) {
+		pr_info("SEAMRR not enabled.\n");
+		goto no_tdx_module;
+	}
+
+	/*
+	 * Also do not report the TDX module as loaded if there's
+	 * no enough TDX private KeyIDs to run any TD guests.
+	 */
+	if (!tdx_keyid_sufficient()) {
+		pr_info("Number of TDX private KeyIDs too small: %u.\n",
+				tdx_keyid_num);
+		goto no_tdx_module;
+	}
+
+	/* Return -ENODEV until the TDX module is detected */
+no_tdx_module:
+	tdx_module_status = TDX_MODULE_NONE;
+	return -ENODEV;
+}
+
+static int init_tdx_module(void)
+{
+	/*
+	 * Return -EFAULT until all steps of TDX module
+	 * initialization are done.
+	 */
+	return -EFAULT;
+}
+
+static void shutdown_tdx_module(void)
+{
+	/* TODO: Shut down the TDX module */
+	tdx_module_status = TDX_MODULE_SHUTDOWN;
+}
+
+static int __tdx_init(void)
+{
+	int ret;
+
+	/*
+	 * Logical-cpu scope initialization requires calling one SEAMCALL
+	 * on all logical cpus enabled by BIOS.  Shutting down the TDX
+	 * module also has such requirement.  Further more, configuring
+	 * the key of the global KeyID requires calling one SEAMCALL for
+	 * each package.  For simplicity, disable CPU hotplug in the whole
+	 * initialization process.
+	 *
+	 * It's perhaps better to check whether all BIOS-enabled cpus are
+	 * online before starting initializing, and return early if not.
+	 * But none of 'possible', 'present' and 'online' CPU masks
+	 * represents BIOS-enabled cpus.  For example, 'possible' mask is
+	 * impacted by 'nr_cpus' or 'possible_cpus' kernel command line.
+	 * Just let the SEAMCALL to fail if not all BIOS-enabled cpus are
+	 * online.
+	 */
+	cpus_read_lock();
+
+	ret = init_tdx_module();
+
+	/*
+	 * Shut down the TDX module in case of any error during the
+	 * initialization process.  It's meaningless to leave the TDX
+	 * module in any middle state of the initialization process.
+	 */
+	if (ret)
+		shutdown_tdx_module();
+
+	cpus_read_unlock();
+
+	return ret;
+}
+
+/**
+ * tdx_detect - Detect whether the TDX module has been loaded
+ *
+ * Detect whether the TDX module has been loaded and ready for
+ * initialization.  Only call this function when all cpus are
+ * already in VMX operation.
+ *
+ * This function can be called in parallel by multiple callers.
+ *
+ * Return:
+ *
+ * * -0:	The TDX module has been loaded and ready for
+ *		initialization.
+ * * -ENODEV:	The TDX module is not loaded.
+ * * -EPERM:	CPU is not in VMX operation.
+ * * -EFAULT:	Other internal fatal errors.
+ */
+int tdx_detect(void)
+{
+	int ret;
+
+	mutex_lock(&tdx_module_lock);
+
+	switch (tdx_module_status) {
+	case TDX_MODULE_UNKNOWN:
+		ret = __tdx_detect();
+		break;
+	case TDX_MODULE_NONE:
+		ret = -ENODEV;
+		break;
+	case TDX_MODULE_LOADED:
+	case TDX_MODULE_INITIALIZED:
+		ret = 0;
+		break;
+	case TDX_MODULE_SHUTDOWN:
+		ret = -EFAULT;
+		break;
+	default:
+		WARN_ON(1);
+		ret = -EFAULT;
+	}
+
+	mutex_unlock(&tdx_module_lock);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(tdx_detect);
+
+/**
+ * tdx_init - Initialize the TDX module
+ *
+ * Initialize the TDX module to make it ready to run TD guests.  This
+ * function should be called after tdx_detect() returns successful.
+ * Only call this function when all cpus are online and are in VMX
+ * operation.  CPU hotplug is temporarily disabled internally.
+ *
+ * This function can be called in parallel by multiple callers.
+ *
+ * Return:
+ *
+ * * -0:	The TDX module has been successfully initialized.
+ * * -ENODEV:	The TDX module is not loaded.
+ * * -EPERM:	The CPU which does SEAMCALL is not in VMX operation.
+ * * -EFAULT:	Other internal fatal errors.
+ */
+int tdx_init(void)
+{
+	int ret;
+
+	mutex_lock(&tdx_module_lock);
+
+	switch (tdx_module_status) {
+	case TDX_MODULE_NONE:
+		ret = -ENODEV;
+		break;
+	case TDX_MODULE_LOADED:
+		ret = __tdx_init();
+		break;
+	case TDX_MODULE_INITIALIZED:
+		ret = 0;
+		break;
+	default:
+		ret = -EFAULT;
+		break;
+	}
+	mutex_unlock(&tdx_module_lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(tdx_init);

From patchwork Wed Apr  6 04:49:17 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Kai Huang <kai.huang@intel.com>
X-Patchwork-Id: 12803714
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 40FCDC433EF
	for <kvm@archiver.kernel.org>; Wed,  6 Apr 2022 16:04:32 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S237103AbiDFQGa (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Wed, 6 Apr 2022 12:06:30 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51840 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S236999AbiDFQGI (ORCPT <rfc822;kvm@vger.kernel.org>);
        Wed, 6 Apr 2022 12:06:08 -0400
Received: from mga18.intel.com (mga18.intel.com [134.134.136.126])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 27DDF44E594;
        Tue,  5 Apr 2022 21:50:05 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1649220606; x=1680756606;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=PjePiKLWyAZTDiIs2MHRUMC+Vn/D8Qwc+vrle+rDM7o=;
  b=Ih/kESGu+lj70HkdPPE/CwasWTQSa9o4NGMrAqTn5yKxKxkmcBpBV46l
   VGpyPTOB0+TZUKD46M2Vu9M05rMcrXmsXZhyx7rFnFTM3ITzBRR06/Oyv
   4yXvrXU6fm9xYoaJ6jlbdgd8u35CVQ+iQaREriVwjkBRs72/I+0Vk/iW7
   UxOOyJL55sDuSq4AyhPO0hfGeYKgH+ARaE8G1HeO8XPMdkMLvluJFR2Ba
   Jq7Z4nPCMST20JQQU0omgpQ6jvbuSgWJq0VKayibrXTJJ14zx2Xdk7uUa
   tmQRCXQi3a6yrI9ZZnpgbgReM/V0Tw58d35CP0gRh4mM/n7/l93XzX8ZV
   w==;
X-IronPort-AV: E=McAfee;i="6200,9189,10308"; a="243089796"
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="243089796"
Received: from orsmga006.jf.intel.com ([10.7.209.51])
  by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:50:05 -0700
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="524302169"
Received: from dchang1-mobl3.amr.corp.intel.com (HELO
 khuang2-desk.gar.corp.intel.com) ([10.254.29.17])
  by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:50:01 -0700
From: Kai Huang <kai.huang@intel.com>
To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org
Cc: seanjc@google.com, pbonzini@redhat.com, dave.hansen@intel.com,
        len.brown@intel.com, tony.luck@intel.com,
        rafael.j.wysocki@intel.com, reinette.chatre@intel.com,
        dan.j.williams@intel.com, peterz@infradead.org, ak@linux.intel.com,
        kirill.shutemov@linux.intel.com,
        sathyanarayanan.kuppuswamy@linux.intel.com,
        isaku.yamahata@intel.com, kai.huang@intel.com
Subject: [PATCH v3 05/21] x86/virt/tdx: Detect P-SEAMLDR and TDX module
Date: Wed,  6 Apr 2022 16:49:17 +1200
Message-Id: 
 <b9f4d4afd244d685182ce9ab5ffdd0bf245be6e2.1649219184.git.kai.huang@intel.com>
X-Mailer: git-send-email 2.35.1
In-Reply-To: <cover.1649219184.git.kai.huang@intel.com>
References: <cover.1649219184.git.kai.huang@intel.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

The P-SEAMLDR (persistent SEAM loader) is the first software module that
runs in SEAM VMX root, responsible for loading and updating the TDX
module.  Both the P-SEAMLDR and the TDX module are expected to be loaded
before host kernel boots.

There is no CPUID or MSR to detect whether the P-SEAMLDR or the TDX
module has been loaded.  SEAMCALL instruction fails with VMfailInvalid
when the target SEAM software module is not loaded, so SEAMCALL can be
used to detect whether the P-SEAMLDR and the TDX module are loaded.

Detect the P-SEAMLDR and the TDX module by calling SEAMLDR.INFO SEAMCALL
to get the P-SEAMLDR information.  If the SEAMCALL succeeds, the
P-SEAMLDR information further tells whether the TDX module is loaded or
not.

Also add a wrapper of __seamcall() to make SEAMCALL to the P-SEAMLDR and
the TDX module with additional defensive check on SEAMRR and CR4.VMXE,
since both detecting and initializing TDX module require the caller of
TDX to handle VMXON.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 175 +++++++++++++++++++++++++++++++++++-
 arch/x86/virt/vmx/tdx/tdx.h |  31 +++++++
 2 files changed, 205 insertions(+), 1 deletion(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 53093d4ad458..674867bccc14 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -15,7 +15,9 @@
 #include <asm/msr.h>
 #include <asm/cpufeature.h>
 #include <asm/cpufeatures.h>
+#include <asm/virtext.h>
 #include <asm/tdx.h>
+#include "tdx.h"
 
 /* Support Intel Secure Arbitration Mode Range Registers (SEAMRR) */
 #define MTRR_CAP_SEAMRR			BIT(15)
@@ -74,6 +76,8 @@ static enum tdx_module_status_t tdx_module_status;
 /* Prevent concurrent attempts on TDX detection and initialization */
 static DEFINE_MUTEX(tdx_module_lock);
 
+static struct p_seamldr_info p_seamldr_info;
+
 static bool __seamrr_enabled(void)
 {
 	return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
@@ -229,6 +233,160 @@ static bool tdx_keyid_sufficient(void)
 	return tdx_keyid_num >= 2;
 }
 
+/*
+ * All error codes of both the P-SEAMLDR and the TDX module SEAMCALLs
+ * have bit 63 set if SEAMCALL fails.
+ */
+#define SEAMCALL_LEAF_ERROR(_ret)	((_ret) & BIT_ULL(63))
+
+/**
+ * seamcall - make SEAMCALL to the P-SEAMLDR or the TDX module with
+ *	      additional check on SEAMRR and CR4.VMXE
+ *
+ * @fn:			SEAMCALL leaf number.
+ * @rcx:		Input operand RCX.
+ * @rdx:		Input operand RDX.
+ * @r8:			Input operand R8.
+ * @r9:			Input operand R9.
+ * @seamcall_ret:	SEAMCALL completion status (can be NULL).
+ * @out:		Additional output operands (can be NULL).
+ *
+ * Wrapper of __seamcall() to make SEAMCALL to the P-SEAMLDR or the TDX
+ * module with additional defensive check on SEAMRR and CR4.VMXE.  Caller
+ * to make sure SEAMRR is enabled and CPU is already in VMX operation
+ * before calling this function.
+ *
+ * Unlike __seamcall(), it returns kernel error code instead of SEAMCALL
+ * completion status, which is returned via @seamcall_ret if desired.
+ *
+ * Return:
+ *
+ * * -ENODEV:	SEAMCALL failed with VMfailInvalid, or SEAMRR is not enabled.
+ * * -EPERM:	CR4.VMXE is not enabled
+ * * -EFAULT:	SEAMCALL failed
+ * * -0:	SEAMCALL succeeded
+ */
+static int seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+		    u64 *seamcall_ret, struct tdx_module_output *out)
+{
+	u64 ret;
+
+	if (WARN_ON_ONCE(!seamrr_enabled()))
+		return -ENODEV;
+
+	/*
+	 * SEAMCALL instruction requires CPU being already in VMX
+	 * operation (VMXON has been done), otherwise it causes #UD.
+	 * Sanity check whether CR4.VMXE has been enabled.
+	 *
+	 * Note VMX being enabled in CR4 doesn't mean CPU is already
+	 * in VMX operation, but unfortunately there's no way to do
+	 * such check.  However in practice enabling CR4.VMXE and
+	 * doing VMXON are done together (for now) so in practice it
+	 * checks whether VMXON has been done.
+	 *
+	 * Preemption is disabled during the CR4.VMXE check and the
+	 * actual SEAMCALL so VMX doesn't get disabled by other threads
+	 * due to scheduling.
+	 */
+	preempt_disable();
+	if (WARN_ON_ONCE(!cpu_vmx_enabled())) {
+		preempt_enable_no_resched();
+		return -EPERM;
+	}
+
+	ret = __seamcall(fn, rcx, rdx, r8, r9, out);
+
+	preempt_enable_no_resched();
+
+	/*
+	 * Convert SEAMCALL error code to kernel error code:
+	 *  - -ENODEV:	VMfailInvalid
+	 *  - -EFAULT:	SEAMCALL failed
+	 *  - 0:	SEAMCALL was successful
+	 */
+	if (ret == TDX_SEAMCALL_VMFAILINVALID)
+		return -ENODEV;
+
+	/* Save the completion status if caller wants to use it */
+	if (seamcall_ret)
+		*seamcall_ret = ret;
+
+	/*
+	 * TDX module SEAMCALLs may also return non-zero completion
+	 * status codes but w/o bit 63 set.  Those codes are treated
+	 * as additional information/warning while the SEAMCALL is
+	 * treated as completed successfully.  Return 0 in this case.
+	 * Caller can use @seamcall_ret to get the additional code
+	 * when it is desired.
+	 */
+	if (SEAMCALL_LEAF_ERROR(ret)) {
+		pr_err("SEAMCALL leaf %llu failed: 0x%llx\n", fn, ret);
+		return -EFAULT;
+	}
+
+	return 0;
+}
+
+static inline bool p_seamldr_ready(void)
+{
+	return !!p_seamldr_info.p_seamldr_ready;
+}
+
+static inline bool tdx_module_ready(void)
+{
+	/*
+	 * SEAMLDR_INFO.SEAM_READY indicates whether TDX module
+	 * is (loaded and) ready for SEAMCALL.
+	 */
+	return p_seamldr_ready() && !!p_seamldr_info.seam_ready;
+}
+
+/*
+ * Detect whether the P-SEAMLDR has been loaded by calling SEAMLDR.INFO
+ * SEAMCALL to get the P-SEAMLDR information, which further tells whether
+ * the TDX module has been loaded and ready for SEAMCALL.  Caller to make
+ * sure only calling this function when CPU is already in VMX operation.
+ */
+static int detect_p_seamldr(void)
+{
+	int ret;
+
+	/*
+	 * SEAMCALL fails with VMfailInvalid when SEAM software is not
+	 * loaded, in which case seamcall() returns -ENODEV.  Use this
+	 * to detect the P-SEAMLDR.
+	 *
+	 * Note the P-SEAMLDR SEAMCALL also fails with VMfailInvalid when
+	 * the P-SEAMLDR is already busy with another SEAMCALL.  But this
+	 * won't happen here as this function is only called once.
+	 */
+	ret = seamcall(P_SEAMCALL_SEAMLDR_INFO, __pa(&p_seamldr_info),
+			0, 0, 0, NULL, NULL);
+	if (ret) {
+		if (ret == -ENODEV)
+			pr_info("P-SEAMLDR is not loaded.\n");
+		else
+			pr_info("Failed to detect P-SEAMLDR.\n");
+
+		return ret;
+	}
+
+	/*
+	 * If SEAMLDR.INFO was successful, it must be ready for SEAMCALL.
+	 * Otherwise it's either kernel or firmware bug.
+	 */
+	if (WARN_ON_ONCE(!p_seamldr_ready()))
+		return -ENODEV;
+
+	pr_info("P-SEAMLDR: version 0x%x, vendor_id: 0x%x, build_date: %u, build_num %u, major %u, minor %u\n",
+		p_seamldr_info.version, p_seamldr_info.vendor_id,
+		p_seamldr_info.build_date, p_seamldr_info.build_num,
+		p_seamldr_info.major, p_seamldr_info.minor);
+
+	return 0;
+}
+
 static int __tdx_detect(void)
 {
 	/* The TDX module is not loaded if SEAMRR is disabled */
@@ -247,7 +405,22 @@ static int __tdx_detect(void)
 		goto no_tdx_module;
 	}
 
-	/* Return -ENODEV until the TDX module is detected */
+	/*
+	 * For simplicity any error during detect_p_seamldr() marks
+	 * TDX module as not loaded.
+	 */
+	if (detect_p_seamldr())
+		goto no_tdx_module;
+
+	if (!tdx_module_ready()) {
+		pr_info("TDX module is not loaded.\n");
+		goto no_tdx_module;
+	}
+
+	pr_info("TDX module detected.\n");
+	tdx_module_status = TDX_MODULE_LOADED;
+	return 0;
+
 no_tdx_module:
 	tdx_module_status = TDX_MODULE_NONE;
 	return -ENODEV;
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 9d5b6f554c20..6990c93198b3 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -3,6 +3,37 @@
 #define _X86_VIRT_TDX_H
 
 #include <linux/types.h>
+#include <linux/compiler.h>
+
+/*
+ * TDX architectural data structures
+ */
+
+#define P_SEAMLDR_INFO_ALIGNMENT	256
+
+struct p_seamldr_info {
+	u32	version;
+	u32	attributes;
+	u32	vendor_id;
+	u32	build_date;
+	u16	build_num;
+	u16	minor;
+	u16	major;
+	u8	reserved0[2];
+	u32	acm_x2apicid;
+	u8	reserved1[4];
+	u8	seaminfo[128];
+	u8	seam_ready;
+	u8	seam_debug;
+	u8	p_seamldr_ready;
+	u8	reserved2[88];
+} __packed __aligned(P_SEAMLDR_INFO_ALIGNMENT);
+
+/*
+ * P-SEAMLDR SEAMCALL leaf function
+ */
+#define P_SEAMLDR_SEAMCALL_BASE		BIT_ULL(63)
+#define P_SEAMCALL_SEAMLDR_INFO		(P_SEAMLDR_SEAMCALL_BASE | 0x0)
 
 struct tdx_module_output;
 u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,

From patchwork Wed Apr  6 04:49:18 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Kai Huang <kai.huang@intel.com>
X-Patchwork-Id: 12803595
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id D749FC433F5
	for <kvm@archiver.kernel.org>; Wed,  6 Apr 2022 14:59:04 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S235504AbiDFPBC (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Wed, 6 Apr 2022 11:01:02 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38662 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S235458AbiDFPAi (ORCPT <rfc822;kvm@vger.kernel.org>);
        Wed, 6 Apr 2022 11:00:38 -0400
Received: from mga18.intel.com (mga18.intel.com [134.134.136.126])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0DD4E44E5AC;
        Tue,  5 Apr 2022 21:50:08 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1649220610; x=1680756610;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=SDYGwWRipsleBRGVaBxblRNs6WOMicKlRx1n0NSrBG8=;
  b=HA8U+Go9IwQDUe8Dvfs2UjfEJEVYUsMZjsIM3ZVE3gWRMXTnkf8avrob
   BCI73xn7vEZtvJ4bG+vzdOiugi9qbRrNUTV4GWofae1U5FvwSf4nF6dkS
   Zo/fGj/0jRMPzCUQLFctFJh8xE3XDsjQh5j+bjshhnNTx5B6RPRJyHTIj
   eAut1wvpOfEBCpgjo3G5hn2hFuHqaa7sCf/fWbnbVkMM/5pVpJYQF17Sc
   RLZ/kBBUv1jNA9E6Szzf93MD7Po4DtS3UXQ0M0fi/+KfZ64533t37VsPy
   nOvGMnk5qOtdcc+J5py1Zd1WfjNvstP+5v70xjrWx0L/YksAklXRp1e3J
   Q==;
X-IronPort-AV: E=McAfee;i="6200,9189,10308"; a="243089801"
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="243089801"
Received: from orsmga006.jf.intel.com ([10.7.209.51])
  by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:50:08 -0700
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="524302212"
Received: from dchang1-mobl3.amr.corp.intel.com (HELO
 khuang2-desk.gar.corp.intel.com) ([10.254.29.17])
  by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:50:05 -0700
From: Kai Huang <kai.huang@intel.com>
To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org
Cc: seanjc@google.com, pbonzini@redhat.com, dave.hansen@intel.com,
        len.brown@intel.com, tony.luck@intel.com,
        rafael.j.wysocki@intel.com, reinette.chatre@intel.com,
        dan.j.williams@intel.com, peterz@infradead.org, ak@linux.intel.com,
        kirill.shutemov@linux.intel.com,
        sathyanarayanan.kuppuswamy@linux.intel.com,
        isaku.yamahata@intel.com, kai.huang@intel.com
Subject: [PATCH v3 06/21] x86/virt/tdx: Shut down TDX module in case of error
Date: Wed,  6 Apr 2022 16:49:18 +1200
Message-Id: 
 <3f19ac995d184e52107e7117a82376cb7ecb35e7.1649219184.git.kai.huang@intel.com>
X-Mailer: git-send-email 2.35.1
In-Reply-To: <cover.1649219184.git.kai.huang@intel.com>
References: <cover.1649219184.git.kai.huang@intel.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

TDX supports shutting down the TDX module at any time during its
lifetime.  After TDX module is shut down, no further SEAMCALL can be
made on any logical cpu.

Shut down the TDX module in case of any error happened during the
initialization process.  It's pointless to leave the TDX module in some
middle state.

Shutting down the TDX module requires calling TDH.SYS.LP.SHUTDOWN on all
BIOS-enabled cpus, and the SEMACALL can run concurrently on different
cpus.  Implement a mechanism to run SEAMCALL concurrently on all online
cpus.  Logical-cpu scope initialization will use it too.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 40 ++++++++++++++++++++++++++++++++++++-
 arch/x86/virt/vmx/tdx/tdx.h |  5 +++++
 2 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 674867bccc14..faf8355965a5 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -11,6 +11,8 @@
 #include <linux/cpumask.h>
 #include <linux/mutex.h>
 #include <linux/cpu.h>
+#include <linux/smp.h>
+#include <linux/atomic.h>
 #include <asm/msr-index.h>
 #include <asm/msr.h>
 #include <asm/cpufeature.h>
@@ -328,6 +330,39 @@ static int seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 	return 0;
 }
 
+/* Data structure to make SEAMCALL on multiple CPUs concurrently */
+struct seamcall_ctx {
+	u64 fn;
+	u64 rcx;
+	u64 rdx;
+	u64 r8;
+	u64 r9;
+	atomic_t err;
+	u64 seamcall_ret;
+	struct tdx_module_output out;
+};
+
+static void seamcall_smp_call_function(void *data)
+{
+	struct seamcall_ctx *sc = data;
+	int ret;
+
+	ret = seamcall(sc->fn, sc->rcx, sc->rdx, sc->r8, sc->r9,
+			&sc->seamcall_ret, &sc->out);
+	if (ret)
+		atomic_set(&sc->err, ret);
+}
+
+/*
+ * Call the SEAMCALL on all online cpus concurrently.
+ * Return error if SEAMCALL fails on any cpu.
+ */
+static int seamcall_on_each_cpu(struct seamcall_ctx *sc)
+{
+	on_each_cpu(seamcall_smp_call_function, sc, true);
+	return atomic_read(&sc->err);
+}
+
 static inline bool p_seamldr_ready(void)
 {
 	return !!p_seamldr_info.p_seamldr_ready;
@@ -437,7 +472,10 @@ static int init_tdx_module(void)
 
 static void shutdown_tdx_module(void)
 {
-	/* TODO: Shut down the TDX module */
+	struct seamcall_ctx sc = { .fn = TDH_SYS_LP_SHUTDOWN };
+
+	seamcall_on_each_cpu(&sc);
+
 	tdx_module_status = TDX_MODULE_SHUTDOWN;
 }
 
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 6990c93198b3..dcc1f6dfe378 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -35,6 +35,11 @@ struct p_seamldr_info {
 #define P_SEAMLDR_SEAMCALL_BASE		BIT_ULL(63)
 #define P_SEAMCALL_SEAMLDR_INFO		(P_SEAMLDR_SEAMCALL_BASE | 0x0)
 
+/*
+ * TDX module SEAMCALL leaf functions
+ */
+#define TDH_SYS_LP_SHUTDOWN	44
+
 struct tdx_module_output;
 u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 	       struct tdx_module_output *out);

From patchwork Wed Apr  6 04:49:19 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Kai Huang <kai.huang@intel.com>
X-Patchwork-Id: 12803596
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 1EA95C433EF
	for <kvm@archiver.kernel.org>; Wed,  6 Apr 2022 14:59:14 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S235326AbiDFPBM (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Wed, 6 Apr 2022 11:01:12 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43118 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S235468AbiDFPAi (ORCPT <rfc822;kvm@vger.kernel.org>);
        Wed, 6 Apr 2022 11:00:38 -0400
Received: from mga18.intel.com (mga18.intel.com [134.134.136.126])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 056B244E5B5;
        Tue,  5 Apr 2022 21:50:12 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1649220614; x=1680756614;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=O4RpktnBj9oVWn2CFbXTXcz/tlOO8+KU7JX2IGLZkdk=;
  b=OVxoHrrB/UA4r2ZyfLnBp3dh8Wf+03/GvILywONRdaQc46jp7zjq10qj
   vCBEUJ2t4whjN5vmVv/1Pho+M9T96IK9T3lEAQH1tliKR+VLKtqbAatqs
   A3okzqvoLLo8aom4LPi7uE9+F3bAzFiU4hHM9uj/B6EoUCGZEMMexJYC5
   4qDZsqLJherYOWalNQbYw/NekdAZ57EXpSitc7Swo/1v1wZh5SSgRKSsL
   Oyzt70lpGPM78DiGITZPqbztXS916eoP0iVRh0Xna5gWdINmbOvjlBVsN
   lXDawFkfW+rP6u0akiBDNI7dJSZ0j0+3HY93pxaHILrFkz3+DYDCyM99n
   g==;
X-IronPort-AV: E=McAfee;i="6200,9189,10308"; a="243089809"
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="243089809"
Received: from orsmga006.jf.intel.com ([10.7.209.51])
  by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:50:12 -0700
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="524302246"
Received: from dchang1-mobl3.amr.corp.intel.com (HELO
 khuang2-desk.gar.corp.intel.com) ([10.254.29.17])
  by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:50:09 -0700
From: Kai Huang <kai.huang@intel.com>
To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org
Cc: seanjc@google.com, pbonzini@redhat.com, dave.hansen@intel.com,
        len.brown@intel.com, tony.luck@intel.com,
        rafael.j.wysocki@intel.com, reinette.chatre@intel.com,
        dan.j.williams@intel.com, peterz@infradead.org, ak@linux.intel.com,
        kirill.shutemov@linux.intel.com,
        sathyanarayanan.kuppuswamy@linux.intel.com,
        isaku.yamahata@intel.com, kai.huang@intel.com
Subject: [PATCH v3 07/21] x86/virt/tdx: Do TDX module global initialization
Date: Wed,  6 Apr 2022 16:49:19 +1200
Message-Id: 
 <66e6aa1dc1bade544b81120d7976cb0601f0528b.1649219184.git.kai.huang@intel.com>
X-Mailer: git-send-email 2.35.1
In-Reply-To: <cover.1649219184.git.kai.huang@intel.com>
References: <cover.1649219184.git.kai.huang@intel.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

Do the TDX module global initialization which requires calling
TDH.SYS.INIT once on any logical cpu.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 11 ++++++++++-
 arch/x86/virt/vmx/tdx/tdx.h |  1 +
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index faf8355965a5..5c2f3a30be2f 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -463,11 +463,20 @@ static int __tdx_detect(void)
 
 static int init_tdx_module(void)
 {
+	int ret;
+
+	/* TDX module global initialization */
+	ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL);
+	if (ret)
+		goto out;
+
 	/*
 	 * Return -EFAULT until all steps of TDX module
 	 * initialization are done.
 	 */
-	return -EFAULT;
+	ret = -EFAULT;
+out:
+	return ret;
 }
 
 static void shutdown_tdx_module(void)
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index dcc1f6dfe378..f0983b1936d8 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -38,6 +38,7 @@ struct p_seamldr_info {
 /*
  * TDX module SEAMCALL leaf functions
  */
+#define TDH_SYS_INIT		33
 #define TDH_SYS_LP_SHUTDOWN	44
 
 struct tdx_module_output;

From patchwork Wed Apr  6 04:49:20 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Kai Huang <kai.huang@intel.com>
X-Patchwork-Id: 12803542
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id D193AC433F5
	for <kvm@archiver.kernel.org>; Wed,  6 Apr 2022 14:25:08 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S234875AbiDFO1G (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Wed, 6 Apr 2022 10:27:06 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53870 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S234682AbiDFO05 (ORCPT <rfc822;kvm@vger.kernel.org>);
        Wed, 6 Apr 2022 10:26:57 -0400
Received: from mga18.intel.com (mga18.intel.com [134.134.136.126])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E04BD44E5BC;
        Tue,  5 Apr 2022 21:50:16 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1649220617; x=1680756617;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=JEHzGZUx8hYUK4MYh5P32wAhLE7c3yWf9l91nEbGSFw=;
  b=VfzPQ2LaG8In1grZPBDmwZMfcosuYwV7ocloc3lEh1gNYlgbOLRP91m9
   jH6ArEYOfTQw6MRHSJT791QwaPW5Ebgy8IPE7/UgOLVmzTZSDhcozy/YV
   nhyOHo1EDr7MEf06olGtNSaIbvnSV3vsQ9C0z5rKWW/cKsifIzkz6QWys
   oNGt2QYMn/v8+bzHcPZ5tgVezUUW7AxmjT/TfjEECa+ZfBuZU5hZxUlCd
   qeD1wIeLe6EAeyx+hhEMvdpgABLI4AJtU8aTWf5v+lJyfGw4drpvm7yGW
   hoJFzdbzOSR39jUJI7x5Cin0ep+m4PSEs7OkJTbvs7HX6jIIqdeddqyHV
   Q==;
X-IronPort-AV: E=McAfee;i="6200,9189,10308"; a="243089819"
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="243089819"
Received: from orsmga006.jf.intel.com ([10.7.209.51])
  by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:50:16 -0700
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="524302280"
Received: from dchang1-mobl3.amr.corp.intel.com (HELO
 khuang2-desk.gar.corp.intel.com) ([10.254.29.17])
  by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:50:12 -0700
From: Kai Huang <kai.huang@intel.com>
To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org
Cc: seanjc@google.com, pbonzini@redhat.com, dave.hansen@intel.com,
        len.brown@intel.com, tony.luck@intel.com,
        rafael.j.wysocki@intel.com, reinette.chatre@intel.com,
        dan.j.williams@intel.com, peterz@infradead.org, ak@linux.intel.com,
        kirill.shutemov@linux.intel.com,
        sathyanarayanan.kuppuswamy@linux.intel.com,
        isaku.yamahata@intel.com, kai.huang@intel.com
Subject: [PATCH v3 08/21] x86/virt/tdx: Do logical-cpu scope TDX module
 initialization
Date: Wed,  6 Apr 2022 16:49:20 +1200
Message-Id: 
 <35081dba60ef61c313c2d7334815247248b8d1da.1649219184.git.kai.huang@intel.com>
X-Mailer: git-send-email 2.35.1
In-Reply-To: <cover.1649219184.git.kai.huang@intel.com>
References: <cover.1649219184.git.kai.huang@intel.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

Logical-cpu scope initialization requires calling TDH.SYS.LP.INIT on all
BIOS-enabled cpus, otherwise the TDH.SYS.CONFIG SEAMCALL will fail.
TDH.SYS.LP.INIT can be called concurrently on all cpus.

Following global initialization, do the logical-cpu scope initialization
by calling TDH.SYS.LP.INIT on all online cpus.  Whether all BIOS-enabled
cpus are online is not checked here for simplicity.  The caller of
tdx_init() should guarantee all BIOS-enabled cpus are online.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 12 ++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h |  1 +
 2 files changed, 13 insertions(+)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 5c2f3a30be2f..ef2718423f0f 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -461,6 +461,13 @@ static int __tdx_detect(void)
 	return -ENODEV;
 }
 
+static int tdx_module_init_cpus(void)
+{
+	struct seamcall_ctx sc = { .fn = TDH_SYS_LP_INIT };
+
+	return seamcall_on_each_cpu(&sc);
+}
+
 static int init_tdx_module(void)
 {
 	int ret;
@@ -470,6 +477,11 @@ static int init_tdx_module(void)
 	if (ret)
 		goto out;
 
+	/* Logical-cpu scope initialization */
+	ret = tdx_module_init_cpus();
+	if (ret)
+		goto out;
+
 	/*
 	 * Return -EFAULT until all steps of TDX module
 	 * initialization are done.
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index f0983b1936d8..b8cfdd6e12f3 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -39,6 +39,7 @@ struct p_seamldr_info {
  * TDX module SEAMCALL leaf functions
  */
 #define TDH_SYS_INIT		33
+#define TDH_SYS_LP_INIT		35
 #define TDH_SYS_LP_SHUTDOWN	44
 
 struct tdx_module_output;

From patchwork Wed Apr  6 04:49:21 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Kai Huang <kai.huang@intel.com>
X-Patchwork-Id: 12803323
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 6DFBEC433F5
	for <kvm@archiver.kernel.org>; Wed,  6 Apr 2022 12:21:40 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232995AbiDFMW3 (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Wed, 6 Apr 2022 08:22:29 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56222 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232912AbiDFMWQ (ORCPT <rfc822;kvm@vger.kernel.org>);
        Wed, 6 Apr 2022 08:22:16 -0400
Received: from mga18.intel.com (mga18.intel.com [134.134.136.126])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AC02D451D46;
        Tue,  5 Apr 2022 21:50:20 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1649220620; x=1680756620;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=DUBLkToKIcUrOVeBYiDzWv7vjJBibg0NhwBA4E7Uri4=;
  b=A9B3yv0OMgS97aAjM6rzjlBpcnCib2AdGoraUv1ew84Oc+IqXNnpvW/H
   3ntpEaOfXyrhqtHtBUvEtito3cbMQ6nKLWq+mMng5tRH1cqTTVSpKfGo0
   2h9lp1KbCCnV9q9kiZjrDtgUFqSkQfIxC3SCj2LJrKEUehnvA5aigU0LH
   eW2IMkzqy3dY/+WMCDzX+2WU0aXcjB+B7fbeIO/dkmgyrzc3A247sCKtl
   0Jto+xEOzF/+kiJgxKa9MQQOsy4XHSdswpBLhzSRnRpiKlY2/QThBnLSN
   l8PiVslNS7Bpah5F8E+JDKyUBtcAAh1eEms6+YlPr66l97nt4srw2p4jH
   Q==;
X-IronPort-AV: E=McAfee;i="6200,9189,10308"; a="243089825"
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="243089825"
Received: from orsmga006.jf.intel.com ([10.7.209.51])
  by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:50:20 -0700
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="524302311"
Received: from dchang1-mobl3.amr.corp.intel.com (HELO
 khuang2-desk.gar.corp.intel.com) ([10.254.29.17])
  by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:50:16 -0700
From: Kai Huang <kai.huang@intel.com>
To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org
Cc: seanjc@google.com, pbonzini@redhat.com, dave.hansen@intel.com,
        len.brown@intel.com, tony.luck@intel.com,
        rafael.j.wysocki@intel.com, reinette.chatre@intel.com,
        dan.j.williams@intel.com, peterz@infradead.org, ak@linux.intel.com,
        kirill.shutemov@linux.intel.com,
        sathyanarayanan.kuppuswamy@linux.intel.com,
        isaku.yamahata@intel.com, kai.huang@intel.com
Subject: [PATCH v3 09/21] x86/virt/tdx: Get information about TDX module and
 convertible memory
Date: Wed,  6 Apr 2022 16:49:21 +1200
Message-Id: 
 <145620795852bf24ba2124a3f8234fd4aaac19d4.1649219184.git.kai.huang@intel.com>
X-Mailer: git-send-email 2.35.1
In-Reply-To: <cover.1649219184.git.kai.huang@intel.com>
References: <cover.1649219184.git.kai.huang@intel.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

TDX provides increased levels of memory confidentiality and integrity.
This requires special hardware support for features like memory
encryption and storage of memory integrity checksums.  Not all memory
satisfies these requirements.

As a result, TDX introduced the concept of a "Convertible Memory Region"
(CMR).  During boot, the firmware builds a list of all of the memory
ranges which can provide the TDX security guarantees.  The list of these
ranges, along with TDX module information, is available to the kernel by
querying the TDX module via TDH.SYS.INFO SEAMCALL.

Host kernel can choose whether or not to use all convertible memory
regions as TDX memory.  Before TDX module is ready to create any TD
guests, all TDX memory regions that host kernel intends to use must be
configured to the TDX module, using specific data structures defined by
TDX architecture.  Constructing those structures requires information of
both TDX module and the Convertible Memory Regions.  Call TDH.SYS.INFO
to get this information as preparation to construct those structures.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 131 ++++++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h |  61 +++++++++++++++++
 2 files changed, 192 insertions(+)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index ef2718423f0f..482e6d858181 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -80,6 +80,11 @@ static DEFINE_MUTEX(tdx_module_lock);
 
 static struct p_seamldr_info p_seamldr_info;
 
+/* Base address of CMR array needs to be 512 bytes aligned. */
+static struct cmr_info tdx_cmr_array[MAX_CMRS] __aligned(CMR_INFO_ARRAY_ALIGNMENT);
+static int tdx_cmr_num;
+static struct tdsysinfo_struct tdx_sysinfo;
+
 static bool __seamrr_enabled(void)
 {
 	return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
@@ -468,6 +473,127 @@ static int tdx_module_init_cpus(void)
 	return seamcall_on_each_cpu(&sc);
 }
 
+static inline bool cmr_valid(struct cmr_info *cmr)
+{
+	return !!cmr->size;
+}
+
+static void print_cmrs(struct cmr_info *cmr_array, int cmr_num,
+		       const char *name)
+{
+	int i;
+
+	for (i = 0; i < cmr_num; i++) {
+		struct cmr_info *cmr = &cmr_array[i];
+
+		pr_info("%s : [0x%llx, 0x%llx)\n", name,
+				cmr->base, cmr->base + cmr->size);
+	}
+}
+
+static int sanitize_cmrs(struct cmr_info *cmr_array, int cmr_num)
+{
+	int i, j;
+
+	/*
+	 * Intel TDX module spec, 20.7.3 CMR_INFO:
+	 *
+	 *   TDH.SYS.INFO leaf function returns a MAX_CMRS (32) entry
+	 *   array of CMR_INFO entries. The CMRs are sorted from the
+	 *   lowest base address to the highest base address, and they
+	 *   are non-overlapping.
+	 *
+	 * This implies that BIOS may generate invalid empty entries
+	 * if total CMRs are less than 32.  Skip them manually.
+	 */
+	for (i = 0; i < cmr_num; i++) {
+		struct cmr_info *cmr = &cmr_array[i];
+		struct cmr_info *prev_cmr = NULL;
+
+		/* Skip further invalid CMRs */
+		if (!cmr_valid(cmr))
+			break;
+
+		if (i > 0)
+			prev_cmr = &cmr_array[i - 1];
+
+		/*
+		 * It is a TDX firmware bug if CMRs are not
+		 * in address ascending order.
+		 */
+		if (prev_cmr && ((prev_cmr->base + prev_cmr->size) >
+					cmr->base)) {
+			pr_err("Firmware bug: CMRs not in address ascending order.\n");
+			return -EFAULT;
+		}
+	}
+
+	/*
+	 * Also a sane BIOS should never generate invalid CMR(s) between
+	 * two valid CMRs.  Sanity check this and simply return error in
+	 * this case.
+	 *
+	 * By reaching here @i is the index of the first invalid CMR (or
+	 * cmr_num).  Starting with next entry of @i since it has already
+	 * been checked.
+	 */
+	for (j = i + 1; j < cmr_num; j++)
+		if (cmr_valid(&cmr_array[j])) {
+			pr_err("Firmware bug: invalid CMR(s) among valid CMRs.\n");
+			return -EFAULT;
+		}
+
+	/*
+	 * Trim all tail invalid empty CMRs.  BIOS should generate at
+	 * least one valid CMR, otherwise it's a TDX firmware bug.
+	 */
+	tdx_cmr_num = i;
+	if (!tdx_cmr_num) {
+		pr_err("Firmware bug: No valid CMR.\n");
+		return -EFAULT;
+	}
+
+	/* Print kernel sanitized CMRs */
+	print_cmrs(tdx_cmr_array, tdx_cmr_num, "Kernel-sanitized-CMR");
+
+	return 0;
+}
+
+static int tdx_get_sysinfo(void)
+{
+	struct tdx_module_output out;
+	u64 tdsysinfo_sz, cmr_num;
+	int ret;
+
+	BUILD_BUG_ON(sizeof(struct tdsysinfo_struct) != TDSYSINFO_STRUCT_SIZE);
+
+	ret = seamcall(TDH_SYS_INFO, __pa(&tdx_sysinfo), TDSYSINFO_STRUCT_SIZE,
+			__pa(tdx_cmr_array), MAX_CMRS, NULL, &out);
+	if (ret)
+		return ret;
+
+	/*
+	 * If TDH.SYS.CONFIG succeeds, RDX contains the actual bytes
+	 * written to @tdx_sysinfo and R9 contains the actual entries
+	 * written to @tdx_cmr_array.  Sanity check them.
+	 */
+	tdsysinfo_sz = out.rdx;
+	cmr_num = out.r9;
+	if (WARN_ON_ONCE((tdsysinfo_sz > sizeof(tdx_sysinfo)) || !tdsysinfo_sz ||
+				(cmr_num > MAX_CMRS) || !cmr_num))
+		return -EFAULT;
+
+	pr_info("TDX module: vendor_id 0x%x, major_version %u, minor_version %u, build_date %u, build_num %u",
+		tdx_sysinfo.vendor_id, tdx_sysinfo.major_version,
+		tdx_sysinfo.minor_version, tdx_sysinfo.build_date,
+		tdx_sysinfo.build_num);
+
+	/* Print BIOS provided CMRs */
+	print_cmrs(tdx_cmr_array, cmr_num, "BIOS-CMR");
+
+	return sanitize_cmrs(tdx_cmr_array, cmr_num);
+}
+
 static int init_tdx_module(void)
 {
 	int ret;
@@ -482,6 +608,11 @@ static int init_tdx_module(void)
 	if (ret)
 		goto out;
 
+	/* Get TDX module information and CMRs */
+	ret = tdx_get_sysinfo();
+	if (ret)
+		goto out;
+
 	/*
 	 * Return -EFAULT until all steps of TDX module
 	 * initialization are done.
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index b8cfdd6e12f3..2f21c45df6ac 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -29,6 +29,66 @@ struct p_seamldr_info {
 	u8	reserved2[88];
 } __packed __aligned(P_SEAMLDR_INFO_ALIGNMENT);
 
+struct cmr_info {
+	u64	base;
+	u64	size;
+} __packed;
+
+#define MAX_CMRS			32
+#define CMR_INFO_ARRAY_ALIGNMENT	512
+
+struct cpuid_config {
+	u32	leaf;
+	u32	sub_leaf;
+	u32	eax;
+	u32	ebx;
+	u32	ecx;
+	u32	edx;
+} __packed;
+
+#define TDSYSINFO_STRUCT_SIZE		1024
+#define TDSYSINFO_STRUCT_ALIGNMENT	1024
+
+struct tdsysinfo_struct {
+	/* TDX-SEAM Module Info */
+	u32	attributes;
+	u32	vendor_id;
+	u32	build_date;
+	u16	build_num;
+	u16	minor_version;
+	u16	major_version;
+	u8	reserved0[14];
+	/* Memory Info */
+	u16	max_tdmrs;
+	u16	max_reserved_per_tdmr;
+	u16	pamt_entry_size;
+	u8	reserved1[10];
+	/* Control Struct Info */
+	u16	tdcs_base_size;
+	u8	reserved2[2];
+	u16	tdvps_base_size;
+	u8	tdvps_xfam_dependent_size;
+	u8	reserved3[9];
+	/* TD Capabilities */
+	u64	attributes_fixed0;
+	u64	attributes_fixed1;
+	u64	xfam_fixed0;
+	u64	xfam_fixed1;
+	u8	reserved4[32];
+	u32	num_cpuid_config;
+	/*
+	 * The actual number of CPUID_CONFIG depends on above
+	 * 'num_cpuid_config'.  The size of 'struct tdsysinfo_struct'
+	 * is 1024B defined by TDX architecture.  Use a union with
+	 * specific padding to make 'sizeof(struct tdsysinfo_struct)'
+	 * equal to 1024.
+	 */
+	union {
+		struct cpuid_config	cpuid_configs[0];
+		u8			reserved5[892];
+	};
+} __packed __aligned(TDSYSINFO_STRUCT_ALIGNMENT);
+
 /*
  * P-SEAMLDR SEAMCALL leaf function
  */
@@ -38,6 +98,7 @@ struct p_seamldr_info {
 /*
  * TDX module SEAMCALL leaf functions
  */
+#define TDH_SYS_INFO		32
 #define TDH_SYS_INIT		33
 #define TDH_SYS_LP_INIT		35
 #define TDH_SYS_LP_SHUTDOWN	44

From patchwork Wed Apr  6 04:49:22 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Kai Huang <kai.huang@intel.com>
X-Patchwork-Id: 12803324
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 24FDBC433EF
	for <kvm@archiver.kernel.org>; Wed,  6 Apr 2022 12:21:45 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231158AbiDFMXk (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Wed, 6 Apr 2022 08:23:40 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57046 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232915AbiDFMWQ (ORCPT <rfc822;kvm@vger.kernel.org>);
        Wed, 6 Apr 2022 08:22:16 -0400
Received: from mga18.intel.com (mga18.intel.com [134.134.136.126])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A60E0451D4F;
        Tue,  5 Apr 2022 21:50:24 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1649220625; x=1680756625;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=t7sxXTOzuUv8n2P6tB8DEK22QX54Lj6/5SNi7pG8i1M=;
  b=epB/Ezl4Z9WPuaQ9CqnBIFPwQyNBBvPeKxTOMzkqd2ZNrDpLIRFIMIwG
   ghWm5TV9lw84ECAN8s0zJPpa4Z+Hd1WinnabA7eImR63KHRK4C5rFYEbj
   HnRxnLppum8p2mY3Z+948EYNwduPXUVMrhJ+bd/bbfXqAiGzmcKsjvfPB
   wfmpNY4MwA/vYCVtZGlMF3mzK+510QXt/HFU2fc0/g/0iY+prv1TMqPRM
   EcsSE7nWMh/M3EIxNTT1Ld9T+H6oWCLPqc7QAZ5XapTjjU2fqQJJpAXAl
   pg1mOLhsCSni7VlNz+LtePb28nhU/0mmyPR6R8hj2FrmJwMR0zJ7Xno03
   w==;
X-IronPort-AV: E=McAfee;i="6200,9189,10308"; a="243089837"
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="243089837"
Received: from orsmga006.jf.intel.com ([10.7.209.51])
  by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:50:24 -0700
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="524302334"
Received: from dchang1-mobl3.amr.corp.intel.com (HELO
 khuang2-desk.gar.corp.intel.com) ([10.254.29.17])
  by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:50:20 -0700
From: Kai Huang <kai.huang@intel.com>
To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org
Cc: seanjc@google.com, pbonzini@redhat.com, dave.hansen@intel.com,
        len.brown@intel.com, tony.luck@intel.com,
        rafael.j.wysocki@intel.com, reinette.chatre@intel.com,
        dan.j.williams@intel.com, peterz@infradead.org, ak@linux.intel.com,
        kirill.shutemov@linux.intel.com,
        sathyanarayanan.kuppuswamy@linux.intel.com,
        isaku.yamahata@intel.com, kai.huang@intel.com
Subject: [PATCH v3 10/21] x86/virt/tdx: Add placeholder to coveret all system
 RAM as TDX memory
Date: Wed,  6 Apr 2022 16:49:22 +1200
Message-Id: 
 <6230ef28be8c360ab326c8f592acf1964ac065c1.1649219184.git.kai.huang@intel.com>
X-Mailer: git-send-email 2.35.1
In-Reply-To: <cover.1649219184.git.kai.huang@intel.com>
References: <cover.1649219184.git.kai.huang@intel.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

TDX provides increased levels of memory confidentiality and integrity.
This requires special hardware support for features like memory
encryption and storage of memory integrity checksums.  Not all memory
satisfies these requirements.

As a result, TDX introduced the concept of a "Convertible Memory Region"
(CMR).  During boot, the firmware builds a list of all of the memory
ranges which can provide the TDX security guarantees.  The list of these
ranges, along with TDX module information, is available to the kernel by
querying the TDX module.

In order to provide crypto protection to TD guests, the TDX architecture
also needs additional metadata to record things like which TD guest
"owns" a given page of memory.  This metadata essentially serves as the
'struct page' for the TDX module.  The space for this metadata is not
reserved by the hardware upfront and must be allocated by the kernel
and given to the TDX module.

Since this metadata consumes space, the VMM can choose whether or not to
allocate it for a given area of convertible memory.  If it chooses not
to, the memory cannot receive TDX protections and can not be used by TDX
guests as private memory.

For every memory region that the VMM wants to use as TDX memory, it sets
up a "TD Memory Region" (TDMR).  Each TDMR represents a physically
contiguous convertible range and must also have its own physically
contiguous metadata table, referred to as a Physical Address Metadata
Table (PAMT), to track status for each page in the TDMR range.

Unlike a CMR, each TDMR requires 1G granularity and alignment.  To
support physical RAM areas that don't meet those strict requirements,
each TDMR permits a number of internal "reserved areas" which can be
placed over memory holes.  If PAMT metadata is placed within a TDMR it
must be covered by one of these reserved areas.

Let's summarize the concepts:

 CMR - Firmware-enumerated physical ranges that support TDX.  CMRs are
       4K aligned.
TDMR - Physical address range which is chosen by the kernel to support
       TDX.  1G granularity and alignment required.  Each TDMR has
       reserved areas where TDX memory holes and overlapping PAMTs can
       be put into.
PAMT - Physically contiguous TDX metadata.  One table for each page size
       per TDMR.  Roughly 1/256th of TDMR in size.  256G TDMR = ~1G
       PAMT.

As one step of initializing the TDX module, the memory regions that TDX
module can use must be configured to the TDX module via an array of
TDMRs.

Constructing TDMRs to build the TDX memory consists below steps:

1) Create TDMRs to cover all memory regions that TDX module can use;
2) Allocate and set up PAMT for each TDMR;
3) Set up reserved areas for each TDMR.

Add a placeholder right after getting TDX module and CMRs information to
construct TDMRs to do the above steps, as the preparation to configure
the TDX module.  Always free TDMRs at the end of the initialization (no
matter successful or not), as TDMRs are only used during the
initialization.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 47 +++++++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h | 23 ++++++++++++++++++
 2 files changed, 70 insertions(+)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 482e6d858181..ec27350d53c1 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -13,6 +13,7 @@
 #include <linux/cpu.h>
 #include <linux/smp.h>
 #include <linux/atomic.h>
+#include <linux/slab.h>
 #include <asm/msr-index.h>
 #include <asm/msr.h>
 #include <asm/cpufeature.h>
@@ -594,8 +595,29 @@ static int tdx_get_sysinfo(void)
 	return sanitize_cmrs(tdx_cmr_array, cmr_num);
 }
 
+static void free_tdmrs(struct tdmr_info **tdmr_array, int tdmr_num)
+{
+	int i;
+
+	for (i = 0; i < tdmr_num; i++) {
+		struct tdmr_info *tdmr = tdmr_array[i];
+
+		/* kfree() works with NULL */
+		kfree(tdmr);
+		tdmr_array[i] = NULL;
+	}
+}
+
+static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
+{
+	/* Return -EFAULT until constructing TDMRs is done */
+	return -EFAULT;
+}
+
 static int init_tdx_module(void)
 {
+	struct tdmr_info **tdmr_array;
+	int tdmr_num;
 	int ret;
 
 	/* TDX module global initialization */
@@ -613,11 +635,36 @@ static int init_tdx_module(void)
 	if (ret)
 		goto out;
 
+	/*
+	 * Prepare enough space to hold pointers of TDMRs (TDMR_INFO).
+	 * TDX requires TDMR_INFO being 512 aligned.  Each TDMR is
+	 * allocated individually within construct_tdmrs() to meet
+	 * this requirement.
+	 */
+	tdmr_array = kcalloc(tdx_sysinfo.max_tdmrs, sizeof(struct tdmr_info *),
+			GFP_KERNEL);
+	if (!tdmr_array) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	/* Construct TDMRs to build TDX memory */
+	ret = construct_tdmrs(tdmr_array, &tdmr_num);
+	if (ret)
+		goto out_free_tdmrs;
+
 	/*
 	 * Return -EFAULT until all steps of TDX module
 	 * initialization are done.
 	 */
 	ret = -EFAULT;
+out_free_tdmrs:
+	/*
+	 * TDMRs are only used during initializing TDX module.  Always
+	 * free them no matter the initialization was successful or not.
+	 */
+	free_tdmrs(tdmr_array, tdmr_num);
+	kfree(tdmr_array);
 out:
 	return ret;
 }
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 2f21c45df6ac..05bf9fe6bd00 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -89,6 +89,29 @@ struct tdsysinfo_struct {
 	};
 } __packed __aligned(TDSYSINFO_STRUCT_ALIGNMENT);
 
+struct tdmr_reserved_area {
+	u64 offset;
+	u64 size;
+} __packed;
+
+#define TDMR_INFO_ALIGNMENT	512
+
+struct tdmr_info {
+	u64 base;
+	u64 size;
+	u64 pamt_1g_base;
+	u64 pamt_1g_size;
+	u64 pamt_2m_base;
+	u64 pamt_2m_size;
+	u64 pamt_4k_base;
+	u64 pamt_4k_size;
+	/*
+	 * Actual number of reserved areas depends on
+	 * 'struct tdsysinfo_struct'::max_reserved_per_tdmr.
+	 */
+	struct tdmr_reserved_area reserved_areas[0];
+} __packed __aligned(TDMR_INFO_ALIGNMENT);
+
 /*
  * P-SEAMLDR SEAMCALL leaf function
  */

From patchwork Wed Apr  6 04:49:23 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Kai Huang <kai.huang@intel.com>
X-Patchwork-Id: 12803325
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 63D3CC433EF
	for <kvm@archiver.kernel.org>; Wed,  6 Apr 2022 12:21:52 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232439AbiDFMXp (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Wed, 6 Apr 2022 08:23:45 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50620 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232569AbiDFMWS (ORCPT <rfc822;kvm@vger.kernel.org>);
        Wed, 6 Apr 2022 08:22:18 -0400
Received: from mga18.intel.com (mga18.intel.com [134.134.136.126])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id ABB0C451D61;
        Tue,  5 Apr 2022 21:50:28 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1649220629; x=1680756629;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=ivZ4KN6KJd2K8ZZ9Ph2D23/EMUyoKpcBJSTRvSggXwY=;
  b=geIhRRK5dXym+nVSHPxC1dkBkaenxcKgaX0EgnfIRCfXTohB4efje81Y
   u9dRsTWTPORYWeHX4+SJKt5pip4IaYTgKTyNqZwIdJKmkhEJGuNuPHaRS
   qMQ86CvS6j3pKi5Pns5rolY5CTEjmP7hGxUbWAsAh7kv2BGT7oCu8iPhq
   +KS3Ns86s20rFDuCOWzj4h2BWZU2TXIYcFvhD2m7OOIxzd3IabHhmmmqD
   j5QFeuJvf4M0ms97DV0CKqDBgiXs4UQT3qAF0CVkXYgPd0lFrNV2hMfPd
   VD3uqnprTkIJIfbIvkku2rQJigfGlGa2G862xZb/DljG04vzLJSho0K37
   w==;
X-IronPort-AV: E=McAfee;i="6200,9189,10308"; a="243089846"
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="243089846"
Received: from orsmga006.jf.intel.com ([10.7.209.51])
  by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:50:28 -0700
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="524302355"
Received: from dchang1-mobl3.amr.corp.intel.com (HELO
 khuang2-desk.gar.corp.intel.com) ([10.254.29.17])
  by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:50:24 -0700
From: Kai Huang <kai.huang@intel.com>
To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org
Cc: seanjc@google.com, pbonzini@redhat.com, dave.hansen@intel.com,
        len.brown@intel.com, tony.luck@intel.com,
        rafael.j.wysocki@intel.com, reinette.chatre@intel.com,
        dan.j.williams@intel.com, peterz@infradead.org, ak@linux.intel.com,
        kirill.shutemov@linux.intel.com,
        sathyanarayanan.kuppuswamy@linux.intel.com,
        isaku.yamahata@intel.com, kai.huang@intel.com
Subject: [PATCH v3 11/21] x86/virt/tdx: Choose to use all system RAM as TDX
 memory
Date: Wed,  6 Apr 2022 16:49:23 +1200
Message-Id: 
 <dee8fb1cc2ab79cf80d4718405069715b0d51235.1649219184.git.kai.huang@intel.com>
X-Mailer: git-send-email 2.35.1
In-Reply-To: <cover.1649219184.git.kai.huang@intel.com>
References: <cover.1649219184.git.kai.huang@intel.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

As one step of initializing the TDX module, the memory regions that the
TDX module can use must be configured to it via an array of 'TD Memory
Regions' (TDMR).  The kernel is responsible for choosing which memory
regions to be used as TDX memory and building the array of TDMRs to
cover those memory regions.

The first generation of TDX-capable platforms basically guarantees all
system RAM regions during machine boot are Convertible Memory Regions
(excluding the memory below 1MB) and can be used by TDX.  The memory
pages allocated to TD guests can be any pages managed by the page
allocator.  To avoid having to modify the page allocator to distinguish
TDX and non-TDX memory allocation, adopt a simple policy to use all
system RAM regions as TDX memory.  The low 1MB pages are excluded from
TDX memory since they are not in CMRs in some platforms (those pages are
reserved at boot time and won't be managed by page allocator anyway).

This policy could be revised later if future TDX generations break
the guarantee or when the size of the metadata (~1/256th of the size of
the TDX usable memory) becomes a concern.  At that time a CMR-aware
page allocator may be necessary.

Also, on the first generation of TDX-capable machine, the system RAM
ranges discovered during boot time are all memory regions that kernel
can use during its runtime.  This is because the first generation of TDX
architecturally doesn't support ACPI memory hotplug (CMRs are generated
during machine boot and are static during machine's runtime).  Also, the
first generation of TDX-capable platform doesn't support TDX and ACPI
memory hotplug at the same time on a single machine.  Another case of
memory hotplug is user may use NVDIMM as system RAM via kmem driver.
But the first generation of TDX-capable machine doesn't support TDX and
NVDIMM simultaneously, therefore in practice it cannot happen.  One
special case is user may use 'memmap' kernel command line to reserve
part of system RAM as x86 legacy PMEMs, and user can theoretically add
them as system RAM via kmem driver.  This can be resolved by always
treating legacy PMEMs as TDX memory.

Implement a helper to loop over all RAM entries in e820 table to find
all system RAM ranges, as a preparation to covert all of them to TDX
memory.  Use 'e820_table', rather than 'e820_table_firmware' to honor
'mem' and 'memmap' command lines.  Following e820__memblock_setup(),
both E820_TYPE_RAM and E820_TYPE_RESERVED_KERN types are treated as TDX
memory, and contiguous ranges in the same NUMA node are merged together.

One difference is, as mentioned above, x86 legacy PMEMs (E820_TYPE_PRAM)
are also always treated as TDX memory.  They are underneath RAM, and
they could be used as TD guest memory.  Always including them as TDX
memory also avoids having to modify memory hotplug code to handle adding
them as system RAM via kmem driver.

To begin with, sanity check all memory regions found in e820 are fully
covered by any CMR and can be used as TDX memory.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/Kconfig            |   1 +
 arch/x86/virt/vmx/tdx/tdx.c | 228 +++++++++++++++++++++++++++++++++++-
 2 files changed, 228 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 9113bf09f358..7414625b938f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1972,6 +1972,7 @@ config INTEL_TDX_HOST
 	default n
 	depends on CPU_SUP_INTEL
 	depends on X86_64
+	select NUMA_KEEP_MEMINFO if NUMA
 	help
 	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
 	  host and certain physical attacks.  This option enables necessary TDX
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index ec27350d53c1..6b0c51aaa7f2 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -14,11 +14,13 @@
 #include <linux/smp.h>
 #include <linux/atomic.h>
 #include <linux/slab.h>
+#include <linux/math.h>
 #include <asm/msr-index.h>
 #include <asm/msr.h>
 #include <asm/cpufeature.h>
 #include <asm/cpufeatures.h>
 #include <asm/virtext.h>
+#include <asm/e820/api.h>
 #include <asm/tdx.h>
 #include "tdx.h"
 
@@ -595,6 +597,222 @@ static int tdx_get_sysinfo(void)
 	return sanitize_cmrs(tdx_cmr_array, cmr_num);
 }
 
+/* Check whether one e820 entry is RAM and could be used as TDX memory */
+static bool e820_entry_is_ram(struct e820_entry *entry)
+{
+	/*
+	 * Besides E820_TYPE_RAM, E820_TYPE_RESERVED_KERN type entries
+	 * are also treated as TDX memory as they are also added to
+	 * memblock.memory in e820__memblock_setup().
+	 *
+	 * E820_TYPE_SOFT_RESERVED type entries are excluded as they are
+	 * marked as reserved and are not later freed to page allocator
+	 * (only part of kernel image, initrd, etc are freed to page
+	 * allocator).
+	 *
+	 * Also unconditionally treat x86 legacy PMEMs (E820_TYPE_PRAM)
+	 * as TDX memory since they are RAM underneath, and could be used
+	 * as TD guest memory.
+	 */
+	return (entry->type == E820_TYPE_RAM) ||
+		(entry->type == E820_TYPE_RESERVED_KERN) ||
+		(entry->type == E820_TYPE_PRAM);
+}
+
+/*
+ * The low memory below 1MB is not covered by CMRs on some TDX platforms.
+ * In practice, this range cannot be used for guest memory because it is
+ * not managed by the page allocator due to boot-time reservation.  Just
+ * skip the low 1MB so this range won't be treated as TDX memory.
+ *
+ * Return true if the e820 entry is completely skipped, in which case
+ * caller should ignore this entry.  Otherwise the actual memory range
+ * after skipping the low 1MB is returned via @start and @end.
+ */
+static bool e820_entry_skip_lowmem(struct e820_entry *entry, u64 *start,
+				   u64 *end)
+{
+	u64 _start = entry->addr;
+	u64 _end = entry->addr + entry->size;
+
+	if (_start < SZ_1M)
+		_start = SZ_1M;
+
+	*start = _start;
+	*end = _end;
+
+	return _start >= _end;
+}
+
+/*
+ * Trim away non-page-aligned memory at the beginning and the end for a
+ * given region.  Return true when there are still pages remaining after
+ * trimming, and the trimmed region is returned via @start and @end.
+ */
+static bool e820_entry_trim(u64 *start, u64 *end)
+{
+	u64 s, e;
+
+	s = round_up(*start, PAGE_SIZE);
+	e = round_down(*end, PAGE_SIZE);
+
+	if (s >= e)
+		return false;
+
+	*start = s;
+	*end = e;
+
+	return true;
+}
+
+/*
+ * Get the next memory region (excluding low 1MB) in e820.  @idx points
+ * to the entry to start to walk with.  Multiple memory regions in the
+ * same NUMA node that are contiguous are merged together (following
+ * e820__memblock_setup()).  The merged range is returned via @start and
+ * @end.  After return, @idx points to the next entry of the last RAM
+ * entry that has been walked, or table->nr_entries (indicating all
+ * entries in the e820 table have been walked).
+ */
+static void e820_next_mem(struct e820_table *table, int *idx, u64 *start,
+			  u64 *end)
+{
+	u64 rs, re;
+	int rnid, i;
+
+again:
+	rs = re = 0;
+	for (i = *idx; i < table->nr_entries; i++) {
+		struct e820_entry *entry = &table->entries[i];
+		u64 s, e;
+		int nid;
+
+		if (!e820_entry_is_ram(entry))
+			continue;
+
+		if (e820_entry_skip_lowmem(entry, &s, &e))
+			continue;
+
+		/*
+		 * Found the first RAM entry.  Record it and keep
+		 * looping to find other RAM entries that can be
+		 * merged.
+		 */
+		if (!rs) {
+			rs = s;
+			re = e;
+			rnid = phys_to_target_node(rs);
+			if (WARN_ON_ONCE(rnid == NUMA_NO_NODE))
+				rnid = 0;
+			continue;
+		}
+
+		/*
+		 * Try to merge with previous RAM entry.  E820 entries
+		 * are not necessarily page aligned.  For instance, the
+		 * setup_data elements in boot_params are marked as
+		 * E820_TYPE_RESERVED_KERN, and they may not be page
+		 * aligned.  In e820__memblock_setup() all adjancent
+		 * memory regions within the same NUMA node are merged to
+		 * a single one, and the non-page-aligned parts (at the
+		 * beginning and the end) are trimmed.  Follow the same
+		 * rule here.
+		 */
+		nid = phys_to_target_node(s);
+		if (WARN_ON_ONCE(nid == NUMA_NO_NODE))
+			nid = 0;
+		if ((nid == rnid) && (s == re)) {
+			/* Merge with previous range and update the end */
+			re = e;
+			continue;
+		}
+
+		/*
+		 * Stop if current entry cannot be merged with previous
+		 * one (or more) entries.
+		 */
+		break;
+	}
+
+	/*
+	 * @i is either the RAM entry that cannot be merged with previous
+	 * one (or more) entries, or table->nr_entries.
+	 */
+	*idx = i;
+	/*
+	 * Trim non-page-aligned parts of [@rs, @re), which is either a
+	 * valid memory region, or empty.  If there's nothing left after
+	 * trimming and there are still entries that have not been
+	 * walked, continue to walk.
+	 */
+	if (!e820_entry_trim(&rs, &re) && i < table->nr_entries)
+		goto again;
+
+	*start = rs;
+	*end = re;
+}
+
+/*
+ * Helper to loop all e820 RAM entries with low 1MB excluded
+ * in a given e820 table.
+ */
+#define _e820_for_each_mem(_table, _i, _start, _end)				\
+	for ((_i) = 0, e820_next_mem((_table), &(_i), &(_start), &(_end));	\
+		(_start) < (_end);						\
+		e820_next_mem((_table), &(_i), &(_start), &(_end)))
+
+/*
+ * Helper to loop all e820 RAM entries with low 1MB excluded
+ * in kernel modified 'e820_table' to honor 'mem' and 'memmap' kernel
+ * command lines.
+ */
+#define e820_for_each_mem(_i, _start, _end)	\
+	_e820_for_each_mem(e820_table, _i, _start, _end)
+
+/* Check whether first range is the subrange of the second */
+static bool is_subrange(u64 r1_start, u64 r1_end, u64 r2_start, u64 r2_end)
+{
+	return (r1_start >= r2_start && r1_end <= r2_end) ? true : false;
+}
+
+/* Check whether address range is covered by any CMR or not. */
+static bool range_covered_by_cmr(struct cmr_info *cmr_array, int cmr_num,
+				 u64 start, u64 end)
+{
+	int i;
+
+	for (i = 0; i < cmr_num; i++) {
+		struct cmr_info *cmr = &cmr_array[i];
+
+		if (is_subrange(start, end, cmr->base, cmr->base + cmr->size))
+			return true;
+	}
+
+	return false;
+}
+
+/* Sanity check whether all e820 RAM entries are fully covered by CMRs. */
+static int e820_check_against_cmrs(void)
+{
+	u64 start, end;
+	int i;
+
+	/*
+	 * Loop over e820_table to find all RAM entries and check
+	 * whether they are all fully covered by any CMR.
+	 */
+	e820_for_each_mem(i, start, end) {
+		if (!range_covered_by_cmr(tdx_cmr_array, tdx_cmr_num,
+					start, end)) {
+			pr_err("[0x%llx, 0x%llx) is not fully convertible memory\n",
+					start, end);
+			return -EFAULT;
+		}
+	}
+
+	return 0;
+}
+
 static void free_tdmrs(struct tdmr_info **tdmr_array, int tdmr_num)
 {
 	int i;
@@ -610,8 +828,16 @@ static void free_tdmrs(struct tdmr_info **tdmr_array, int tdmr_num)
 
 static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
 {
+	int ret;
+
+	ret = e820_check_against_cmrs();
+	if (ret)
+		goto err;
+
 	/* Return -EFAULT until constructing TDMRs is done */
-	return -EFAULT;
+	ret = -EFAULT;
+err:
+	return ret;
 }
 
 static int init_tdx_module(void)

From patchwork Wed Apr  6 04:49:24 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Kai Huang <kai.huang@intel.com>
X-Patchwork-Id: 12803712
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 8F455C433EF
	for <kvm@archiver.kernel.org>; Wed,  6 Apr 2022 16:04:12 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S237151AbiDFQGJ (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Wed, 6 Apr 2022 12:06:09 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55560 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S236990AbiDFQF5 (ORCPT <rfc822;kvm@vger.kernel.org>);
        Wed, 6 Apr 2022 12:05:57 -0400
Received: from mga18.intel.com (mga18.intel.com [134.134.136.126])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5A7C9451D69;
        Tue,  5 Apr 2022 21:50:32 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1649220633; x=1680756633;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=q18Cb2EmfT3fu2+KbEiAYS7sKnhqfdZ2OKCafmYkaEk=;
  b=l2dRNc7chlronfN3dPjp5wxLjsYR6jF7j6ewY1IWW5I9Y/EO3gjUKoKa
   DiQGSG2GaG6um+By5k1G/qf84bGdyRrsKYR+x3m4LtGB4VQTsShv2ja/z
   4VhtPIqxMKPaBzKXeZRMKPd16HpC5PjJr+smWZtH7ogjp+jixAXiMr0DQ
   o+ybQgaxpQaeLHJa9IhjSp0PdbBHl2vfcbKRqaRriQ9ri8bOtnaYygQGB
   QvE3O1l34rsshU+wfTryaXodpLO5cyM6G5bHM0PNJizlaUhBARMbTJxjo
   sKrJqRCj3J1zvkVVHWPV3AFY7eAAa/FbYVASOI+2BA0oscDeW0i4jvb0G
   A==;
X-IronPort-AV: E=McAfee;i="6200,9189,10308"; a="243089857"
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="243089857"
Received: from orsmga006.jf.intel.com ([10.7.209.51])
  by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:50:32 -0700
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="524302374"
Received: from dchang1-mobl3.amr.corp.intel.com (HELO
 khuang2-desk.gar.corp.intel.com) ([10.254.29.17])
  by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:50:28 -0700
From: Kai Huang <kai.huang@intel.com>
To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org
Cc: seanjc@google.com, pbonzini@redhat.com, dave.hansen@intel.com,
        len.brown@intel.com, tony.luck@intel.com,
        rafael.j.wysocki@intel.com, reinette.chatre@intel.com,
        dan.j.williams@intel.com, peterz@infradead.org, ak@linux.intel.com,
        kirill.shutemov@linux.intel.com,
        sathyanarayanan.kuppuswamy@linux.intel.com,
        isaku.yamahata@intel.com, kai.huang@intel.com
Subject: [PATCH v3 12/21] x86/virt/tdx: Create TDMRs to cover all system RAM
Date: Wed,  6 Apr 2022 16:49:24 +1200
Message-Id: 
 <6cc984d5c23e06c9c87b4c7342758b29f8c8c022.1649219184.git.kai.huang@intel.com>
X-Mailer: git-send-email 2.35.1
In-Reply-To: <cover.1649219184.git.kai.huang@intel.com>
References: <cover.1649219184.git.kai.huang@intel.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

The kernel configures TDX usable memory regions to the TDX module via
an array of "TD Memory Region" (TDMR).  Each TDMR entry (TDMR_INFO)
contains the information of the base/size of a memory region, the
base/size of the associated Physical Address Metadata Table (PAMT) and
a list of reserved areas in the region.

Create a number of TDMRs according to the verified e820 RAM entries.
As the first step only set up the base/size information for each TDMR.

TDMR must be 1G aligned and the size must be in 1G granularity.  This
implies that one TDMR could cover multiple e820 RAM entries.  If a RAM
entry spans the 1GB boundary and the former part is already covered by
the previous TDMR, just create a new TDMR for the latter part.

TDX only supports a limited number of TDMRs (currently 64).  Abort the
TDMR construction process when the number of TDMRs exceeds this
limitation.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 138 ++++++++++++++++++++++++++++++++++++
 1 file changed, 138 insertions(+)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 6b0c51aaa7f2..82534e70df96 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -54,6 +54,18 @@
 		((u32)(((_keyid_part) & 0xffffffffull) + 1))
 #define TDX_KEYID_NUM(_keyid_part)	((u32)((_keyid_part) >> 32))
 
+/* TDMR must be 1gb aligned */
+#define TDMR_ALIGNMENT		BIT_ULL(30)
+#define TDMR_PFN_ALIGNMENT	(TDMR_ALIGNMENT >> PAGE_SHIFT)
+
+/* Align up and down the address to TDMR boundary */
+#define TDMR_ALIGN_DOWN(_addr)	ALIGN_DOWN((_addr), TDMR_ALIGNMENT)
+#define TDMR_ALIGN_UP(_addr)	ALIGN((_addr), TDMR_ALIGNMENT)
+
+/* TDMR's start and end address */
+#define TDMR_START(_tdmr)	((_tdmr)->base)
+#define TDMR_END(_tdmr)		((_tdmr)->base + (_tdmr)->size)
+
 /*
  * TDX module status during initialization
  */
@@ -813,6 +825,44 @@ static int e820_check_against_cmrs(void)
 	return 0;
 }
 
+/* The starting offset of reserved areas within TDMR_INFO */
+#define TDMR_RSVD_START		64
+
+static struct tdmr_info *__alloc_tdmr(void)
+{
+	int tdmr_sz;
+
+	/*
+	 * TDMR_INFO's actual size depends on maximum number of reserved
+	 * areas that one TDMR supports.
+	 */
+	tdmr_sz = TDMR_RSVD_START + tdx_sysinfo.max_reserved_per_tdmr *
+		sizeof(struct tdmr_reserved_area);
+
+	/*
+	 * TDX requires TDMR_INFO to be 512 aligned.  Always align up
+	 * TDMR_INFO size to 512 so the memory allocated via kzalloc()
+	 * can meet the alignment requirement.
+	 */
+	tdmr_sz = ALIGN(tdmr_sz, TDMR_INFO_ALIGNMENT);
+
+	return kzalloc(tdmr_sz, GFP_KERNEL);
+}
+
+/* Create a new TDMR at given index in the TDMR array */
+static struct tdmr_info *alloc_tdmr(struct tdmr_info **tdmr_array, int idx)
+{
+	struct tdmr_info *tdmr;
+
+	if (WARN_ON_ONCE(tdmr_array[idx]))
+		return NULL;
+
+	tdmr = __alloc_tdmr();
+	tdmr_array[idx] = tdmr;
+
+	return tdmr;
+}
+
 static void free_tdmrs(struct tdmr_info **tdmr_array, int tdmr_num)
 {
 	int i;
@@ -826,6 +876,89 @@ static void free_tdmrs(struct tdmr_info **tdmr_array, int tdmr_num)
 	}
 }
 
+/*
+ * Create TDMRs to cover all RAM entries in e820_table.  The created
+ * TDMRs are saved to @tdmr_array and @tdmr_num is set to the actual
+ * number of TDMRs.  All entries in @tdmr_array must be initially NULL.
+ */
+static int create_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
+{
+	struct tdmr_info *tdmr;
+	u64 start, end;
+	int i, tdmr_idx;
+	int ret = 0;
+
+	tdmr_idx = 0;
+	tdmr = alloc_tdmr(tdmr_array, 0);
+	if (!tdmr)
+		return -ENOMEM;
+	/*
+	 * Loop over all RAM entries in e820 and create TDMRs to cover
+	 * them.  To keep it simple, always try to use one TDMR to cover
+	 * one RAM entry.
+	 */
+	e820_for_each_mem(i, start, end) {
+		start = TDMR_ALIGN_DOWN(start);
+		end = TDMR_ALIGN_UP(end);
+
+		/*
+		 * If the current TDMR's size hasn't been initialized, it
+		 * is a new allocated TDMR to cover the new RAM entry.
+		 * Otherwise the current TDMR already covers the previous
+		 * RAM entry.  In the latter case, check whether the
+		 * current RAM entry has been fully or partially covered
+		 * by the current TDMR, since TDMR is 1G aligned.
+		 */
+		if (tdmr->size) {
+			/*
+			 * Loop to next RAM entry if the current entry
+			 * is already fully covered by the current TDMR.
+			 */
+			if (end <= TDMR_END(tdmr))
+				continue;
+
+			/*
+			 * If part of current RAM entry has already been
+			 * covered by current TDMR, skip the already
+			 * covered part.
+			 */
+			if (start < TDMR_END(tdmr))
+				start = TDMR_END(tdmr);
+
+			/*
+			 * Create a new TDMR to cover the current RAM
+			 * entry, or the remaining part of it.
+			 */
+			tdmr_idx++;
+			if (tdmr_idx >= tdx_sysinfo.max_tdmrs) {
+				ret = -E2BIG;
+				goto err;
+			}
+			tdmr = alloc_tdmr(tdmr_array, tdmr_idx);
+			if (!tdmr) {
+				ret = -ENOMEM;
+				goto err;
+			}
+		}
+
+		tdmr->base = start;
+		tdmr->size = end - start;
+	}
+
+	/* @tdmr_idx is always the index of last valid TDMR. */
+	*tdmr_num = tdmr_idx + 1;
+
+	return 0;
+err:
+	/*
+	 * Clean up already allocated TDMRs in case of error.  @tdmr_idx
+	 * indicates the last TDMR that wasn't created successfully,
+	 * therefore only needs to free @tdmr_idx TDMRs.
+	 */
+	free_tdmrs(tdmr_array, tdmr_idx);
+	return ret;
+}
+
 static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
 {
 	int ret;
@@ -834,8 +967,13 @@ static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
 	if (ret)
 		goto err;
 
+	ret = create_tdmrs(tdmr_array, tdmr_num);
+	if (ret)
+		goto err;
+
 	/* Return -EFAULT until constructing TDMRs is done */
 	ret = -EFAULT;
+	free_tdmrs(tdmr_array, *tdmr_num);
 err:
 	return ret;
 }

From patchwork Wed Apr  6 04:49:25 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Kai Huang <kai.huang@intel.com>
X-Patchwork-Id: 12803715
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 455D0C433FE
	for <kvm@archiver.kernel.org>; Wed,  6 Apr 2022 16:04:43 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S236896AbiDFQGl (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Wed, 6 Apr 2022 12:06:41 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56898 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S236994AbiDFQGI (ORCPT <rfc822;kvm@vger.kernel.org>);
        Wed, 6 Apr 2022 12:06:08 -0400
Received: from mga18.intel.com (mga18.intel.com [134.134.136.126])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5F459451D76;
        Tue,  5 Apr 2022 21:50:36 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1649220636; x=1680756636;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=U7JIb6jRGaOT3Ma7lXiC1ZND2BbsDZGoURUyJcHBlHw=;
  b=eEw4pvWPbXrO+SkDyCvIjELa7ITe0wLJemM30qmeL9ml4LkDBqW0u6/7
   i/auPoBewbRv7T+LgtTkGv0NFWF9ORnSInFfX4f3RyAhy0bOoEJYFZjQk
   9B4f7yT7qP/k4xRzH4Fv4gygDKAJOrFv4AtXK1Pagsa2td6mbUeYr8Zdy
   ZTD2fH3P8YisJa/XcY1vJ8qlaaFosgS0cVRxKqbRqIyELru7wxR24MaFa
   YN/sQ6dHuX5YOJUhFyCq+8mTmkKI1yf4Ge6o2+/86EvopZ+4Hf/Q/6/Vb
   18mjTrAVBg4KDFHYaxrFo1+YreLuy0/4DrfkXb46UlVn9ztx7dJ/edzH3
   w==;
X-IronPort-AV: E=McAfee;i="6200,9189,10308"; a="243089867"
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="243089867"
Received: from orsmga006.jf.intel.com ([10.7.209.51])
  by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:50:36 -0700
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="524302385"
Received: from dchang1-mobl3.amr.corp.intel.com (HELO
 khuang2-desk.gar.corp.intel.com) ([10.254.29.17])
  by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:50:32 -0700
From: Kai Huang <kai.huang@intel.com>
To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org
Cc: seanjc@google.com, pbonzini@redhat.com, dave.hansen@intel.com,
        len.brown@intel.com, tony.luck@intel.com,
        rafael.j.wysocki@intel.com, reinette.chatre@intel.com,
        dan.j.williams@intel.com, peterz@infradead.org, ak@linux.intel.com,
        kirill.shutemov@linux.intel.com,
        sathyanarayanan.kuppuswamy@linux.intel.com,
        isaku.yamahata@intel.com, kai.huang@intel.com
Subject: [PATCH v3 13/21] x86/virt/tdx: Allocate and set up PAMTs for TDMRs
Date: Wed,  6 Apr 2022 16:49:25 +1200
Message-Id: 
 <ffc2eefdd212a31278978e8bfccd571355db69b0.1649219184.git.kai.huang@intel.com>
X-Mailer: git-send-email 2.35.1
In-Reply-To: <cover.1649219184.git.kai.huang@intel.com>
References: <cover.1649219184.git.kai.huang@intel.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

In order to provide crypto protection to guests, the TDX module uses
additional metadata to record things like which guest "owns" a given
page of memory.  This metadata, referred as Physical Address Metadata
Table (PAMT), essentially serves as the 'struct page' for the TDX
module.  PAMTs are not reserved by hardware upfront.  They must be
allocated by the kernel and then given to the TDX module.

TDX supports 3 page sizes: 4K, 2M, and 1G.  Each "TD Memory Region"
(TDMR) has 3 PAMTs to track the 3 supported page sizes respectively.
Each PAMT must be a physically contiguous area from the Convertible
Memory Regions (CMR).  However, the PAMTs which track pages in one TDMR
do not need to reside within that TDMR but can be anywhere in CMRs.
If one PAMT overlaps with any TDMR, the overlapping part must be
reported as a reserved area in that particular TDMR.

Use alloc_contig_pages() since PAMT must be a physically contiguous area
and it may be potentially large (~1/256th of the size of the given TDMR).

The current version of TDX supports at most 16 reserved areas per TDMR
to cover both PAMTs and potential memory holes within the TDMR.  If many
PAMTs are allocated within a single TDMR, 16 reserved areas may not be
sufficient to cover all of them.

Adopt the following policies when allocating PAMTs for a given TDMR:

  - Allocate three PAMTs of the TDMR in one contiguous chunk to minimize
    the total number of reserved areas consumed for PAMTs.
  - Try to first allocate PAMT from the local node of the TDMR for better
    NUMA locality.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/Kconfig            |   1 +
 arch/x86/virt/vmx/tdx/tdx.c | 165 ++++++++++++++++++++++++++++++++++++
 2 files changed, 166 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7414625b938f..ff68d0829bd7 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1973,6 +1973,7 @@ config INTEL_TDX_HOST
 	depends on CPU_SUP_INTEL
 	depends on X86_64
 	select NUMA_KEEP_MEMINFO if NUMA
+	depends on CONTIG_ALLOC
 	help
 	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
 	  host and certain physical attacks.  This option enables necessary TDX
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 82534e70df96..1b807dcbc101 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -21,6 +21,7 @@
 #include <asm/cpufeatures.h>
 #include <asm/virtext.h>
 #include <asm/e820/api.h>
+#include <asm/pgtable.h>
 #include <asm/tdx.h>
 #include "tdx.h"
 
@@ -66,6 +67,16 @@
 #define TDMR_START(_tdmr)	((_tdmr)->base)
 #define TDMR_END(_tdmr)		((_tdmr)->base + (_tdmr)->size)
 
+/* Page sizes supported by TDX */
+enum tdx_page_sz {
+	TDX_PG_4K = 0,
+	TDX_PG_2M,
+	TDX_PG_1G,
+	TDX_PG_MAX,
+};
+
+#define TDX_HPAGE_SHIFT	9
+
 /*
  * TDX module status during initialization
  */
@@ -959,6 +970,148 @@ static int create_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
 	return ret;
 }
 
+/* Calculate PAMT size given a TDMR and a page size */
+static unsigned long __tdmr_get_pamt_sz(struct tdmr_info *tdmr,
+					enum tdx_page_sz pgsz)
+{
+	unsigned long pamt_sz;
+
+	pamt_sz = (tdmr->size >> ((TDX_HPAGE_SHIFT * pgsz) + PAGE_SHIFT)) *
+		tdx_sysinfo.pamt_entry_size;
+	/* PAMT size must be 4K aligned */
+	pamt_sz = ALIGN(pamt_sz, PAGE_SIZE);
+
+	return pamt_sz;
+}
+
+/* Calculate the size of all PAMTs for a TDMR */
+static unsigned long tdmr_get_pamt_sz(struct tdmr_info *tdmr)
+{
+	enum tdx_page_sz pgsz;
+	unsigned long pamt_sz;
+
+	pamt_sz = 0;
+	for (pgsz = TDX_PG_4K; pgsz < TDX_PG_MAX; pgsz++)
+		pamt_sz += __tdmr_get_pamt_sz(tdmr, pgsz);
+
+	return pamt_sz;
+}
+
+/*
+ * Locate the NUMA node containing the start of the given TDMR's first
+ * RAM entry.  The given TDMR may also cover memory in other NUMA nodes.
+ */
+static int tdmr_get_nid(struct tdmr_info *tdmr)
+{
+	u64 start, end;
+	int i;
+
+	/* Find the first RAM entry covered by the TDMR */
+	e820_for_each_mem(i, start, end)
+		if (end > TDMR_START(tdmr))
+			break;
+
+	/*
+	 * One TDMR must cover at least one (or partial) RAM entry,
+	 * otherwise it is kernel bug.  WARN_ON() in this case.
+	 */
+	if (WARN_ON_ONCE((start >= end) || start >= TDMR_END(tdmr)))
+		return 0;
+
+	/*
+	 * The first RAM entry may be partially covered by the previous
+	 * TDMR.  In this case, use TDMR's start to find the NUMA node.
+	 */
+	if (start < TDMR_START(tdmr))
+		start = TDMR_START(tdmr);
+
+	return phys_to_target_node(start);
+}
+
+static int tdmr_setup_pamt(struct tdmr_info *tdmr)
+{
+	unsigned long tdmr_pamt_base, pamt_base[TDX_PG_MAX];
+	unsigned long pamt_sz[TDX_PG_MAX];
+	unsigned long pamt_npages;
+	struct page *pamt;
+	enum tdx_page_sz pgsz;
+	int nid;
+
+	/*
+	 * Allocate one chunk of physically contiguous memory for all
+	 * PAMTs.  This helps minimize the PAMT's use of reserved areas
+	 * in overlapped TDMRs.
+	 */
+	nid = tdmr_get_nid(tdmr);
+	pamt_npages = tdmr_get_pamt_sz(tdmr) >> PAGE_SHIFT;
+	pamt = alloc_contig_pages(pamt_npages, GFP_KERNEL, nid,
+			&node_online_map);
+	if (!pamt)
+		return -ENOMEM;
+
+	/* Calculate PAMT base and size for all supported page sizes. */
+	tdmr_pamt_base = page_to_pfn(pamt) << PAGE_SHIFT;
+	for (pgsz = TDX_PG_4K; pgsz < TDX_PG_MAX; pgsz++) {
+		unsigned long sz = __tdmr_get_pamt_sz(tdmr, pgsz);
+
+		pamt_base[pgsz] = tdmr_pamt_base;
+		pamt_sz[pgsz] = sz;
+
+		tdmr_pamt_base += sz;
+	}
+
+	tdmr->pamt_4k_base = pamt_base[TDX_PG_4K];
+	tdmr->pamt_4k_size = pamt_sz[TDX_PG_4K];
+	tdmr->pamt_2m_base = pamt_base[TDX_PG_2M];
+	tdmr->pamt_2m_size = pamt_sz[TDX_PG_2M];
+	tdmr->pamt_1g_base = pamt_base[TDX_PG_1G];
+	tdmr->pamt_1g_size = pamt_sz[TDX_PG_1G];
+
+	return 0;
+}
+
+static void tdmr_free_pamt(struct tdmr_info *tdmr)
+{
+	unsigned long pamt_pfn, pamt_sz;
+
+	pamt_pfn = tdmr->pamt_4k_base >> PAGE_SHIFT;
+	pamt_sz = tdmr->pamt_4k_size + tdmr->pamt_2m_size + tdmr->pamt_1g_size;
+
+	/* Do nothing if PAMT hasn't been allocated for this TDMR */
+	if (!pamt_sz)
+		return;
+
+	if (WARN_ON(!pamt_pfn))
+		return;
+
+	free_contig_range(pamt_pfn, pamt_sz >> PAGE_SHIFT);
+}
+
+static void tdmrs_free_pamt_all(struct tdmr_info **tdmr_array, int tdmr_num)
+{
+	int i;
+
+	for (i = 0; i < tdmr_num; i++)
+		tdmr_free_pamt(tdmr_array[i]);
+}
+
+/* Allocate and set up PAMTs for all TDMRs */
+static int tdmrs_setup_pamt_all(struct tdmr_info **tdmr_array, int tdmr_num)
+{
+	int i, ret;
+
+	for (i = 0; i < tdmr_num; i++) {
+		ret = tdmr_setup_pamt(tdmr_array[i]);
+		if (ret)
+			goto err;
+	}
+
+	return 0;
+err:
+	tdmrs_free_pamt_all(tdmr_array, tdmr_num);
+	return -ENOMEM;
+}
+
 static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
 {
 	int ret;
@@ -971,8 +1124,14 @@ static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
 	if (ret)
 		goto err;
 
+	ret = tdmrs_setup_pamt_all(tdmr_array, *tdmr_num);
+	if (ret)
+		goto err_free_tdmrs;
+
 	/* Return -EFAULT until constructing TDMRs is done */
 	ret = -EFAULT;
+	tdmrs_free_pamt_all(tdmr_array, *tdmr_num);
+err_free_tdmrs:
 	free_tdmrs(tdmr_array, *tdmr_num);
 err:
 	return ret;
@@ -1022,6 +1181,12 @@ static int init_tdx_module(void)
 	 * initialization are done.
 	 */
 	ret = -EFAULT;
+	/*
+	 * Free PAMTs allocated in construct_tdmrs() when TDX module
+	 * initialization fails.
+	 */
+	if (ret)
+		tdmrs_free_pamt_all(tdmr_array, tdmr_num);
 out_free_tdmrs:
 	/*
 	 * TDMRs are only used during initializing TDX module.  Always

From patchwork Wed Apr  6 04:49:26 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Kai Huang <kai.huang@intel.com>
X-Patchwork-Id: 12803491
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id B0631C433F5
	for <kvm@archiver.kernel.org>; Wed,  6 Apr 2022 14:01:36 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233995AbiDFODd (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Wed, 6 Apr 2022 10:03:33 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54268 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S234060AbiDFODX (ORCPT <rfc822;kvm@vger.kernel.org>);
        Wed, 6 Apr 2022 10:03:23 -0400
Received: from mga18.intel.com (mga18.intel.com [134.134.136.126])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DE48095A20;
        Tue,  5 Apr 2022 21:50:54 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1649220655; x=1680756655;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=XT+Yu0zqAvB34KyhxTSK4J2jUzqJeBdgYuJKQre9+Fs=;
  b=KCalozvRFFIAKT+i1S41qpEnELIEWtBrF+PUbel6RY3is0qHkNu1FCpT
   7ABlQOYOzBu1QQRAE2U4IVhMpkeec6V5oNO32yeVqiH5oyrynRDKgTnjS
   0dn/zqO2Gy9X6aVH70pQYH4wXzgVkSFbeopB6IJyfjnVQ5vCwFRQ/2tWe
   RzVQ4Cs7RoDAH3WhDYO3Mfm9qS+bBWzMr31Bb88z0TNKs2vWUSUCjX/HW
   5u4KXgh6jKGZKkkp10SU1Xjix24HIHQeWPJJLY+7zmhna4gPI2pjIo7e4
   yzfM1MjoUQEET6Lj+T5enKZBw5h8Na5AQgimSyYFEfx4Ki3MMTVIxWoyW
   Q==;
X-IronPort-AV: E=McAfee;i="6200,9189,10308"; a="243089873"
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="243089873"
Received: from orsmga006.jf.intel.com ([10.7.209.51])
  by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:50:40 -0700
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="524302404"
Received: from dchang1-mobl3.amr.corp.intel.com (HELO
 khuang2-desk.gar.corp.intel.com) ([10.254.29.17])
  by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:50:36 -0700
From: Kai Huang <kai.huang@intel.com>
To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org
Cc: seanjc@google.com, pbonzini@redhat.com, dave.hansen@intel.com,
        len.brown@intel.com, tony.luck@intel.com,
        rafael.j.wysocki@intel.com, reinette.chatre@intel.com,
        dan.j.williams@intel.com, peterz@infradead.org, ak@linux.intel.com,
        kirill.shutemov@linux.intel.com,
        sathyanarayanan.kuppuswamy@linux.intel.com,
        isaku.yamahata@intel.com, kai.huang@intel.com
Subject: [PATCH v3 14/21] x86/virt/tdx: Set up reserved areas for all TDMRs
Date: Wed,  6 Apr 2022 16:49:26 +1200
Message-Id: 
 <b761a1b05a87c2c291daf811c325ceb4198dfba8.1649219184.git.kai.huang@intel.com>
X-Mailer: git-send-email 2.35.1
In-Reply-To: <cover.1649219184.git.kai.huang@intel.com>
References: <cover.1649219184.git.kai.huang@intel.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

As the last step of constructing TDMRs, create reserved area information
for the memory region holes in each TDMR.  If any PAMT (or part of it)
resides within a particular TDMR, also mark it as reserved.

All reserved areas in each TDMR must be in address ascending order,
required by TDX architecture.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 148 +++++++++++++++++++++++++++++++++++-
 1 file changed, 146 insertions(+), 2 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 1b807dcbc101..bf0d13644898 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -15,6 +15,7 @@
 #include <linux/atomic.h>
 #include <linux/slab.h>
 #include <linux/math.h>
+#include <linux/sort.h>
 #include <asm/msr-index.h>
 #include <asm/msr.h>
 #include <asm/cpufeature.h>
@@ -1112,6 +1113,145 @@ static int tdmrs_setup_pamt_all(struct tdmr_info **tdmr_array, int tdmr_num)
 	return -ENOMEM;
 }
 
+static int tdmr_add_rsvd_area(struct tdmr_info *tdmr, int *p_idx,
+			      u64 addr, u64 size)
+{
+	struct tdmr_reserved_area *rsvd_areas = tdmr->reserved_areas;
+	int idx = *p_idx;
+
+	/* Reserved area must be 4K aligned in offset and size */
+	if (WARN_ON(addr & ~PAGE_MASK || size & ~PAGE_MASK))
+		return -EINVAL;
+
+	/* Cannot exceed maximum reserved areas supported by TDX */
+	if (idx >= tdx_sysinfo.max_reserved_per_tdmr)
+		return -E2BIG;
+
+	rsvd_areas[idx].offset = addr - tdmr->base;
+	rsvd_areas[idx].size = size;
+
+	*p_idx = idx + 1;
+
+	return 0;
+}
+
+/* Compare function called by sort() for TDMR reserved areas */
+static int rsvd_area_cmp_func(const void *a, const void *b)
+{
+	struct tdmr_reserved_area *r1 = (struct tdmr_reserved_area *)a;
+	struct tdmr_reserved_area *r2 = (struct tdmr_reserved_area *)b;
+
+	if (r1->offset + r1->size <= r2->offset)
+		return -1;
+	if (r1->offset >= r2->offset + r2->size)
+		return 1;
+
+	/* Reserved areas cannot overlap.  Caller should guarantee. */
+	WARN_ON(1);
+	return -1;
+}
+
+/* Set up reserved areas for a TDMR, including memory holes and PAMTs */
+static int tdmr_setup_rsvd_areas(struct tdmr_info *tdmr,
+				     struct tdmr_info **tdmr_array,
+				     int tdmr_num)
+{
+	u64 start, end, prev_end;
+	int rsvd_idx, i, ret = 0;
+
+	/* Mark holes between e820 RAM entries as reserved */
+	rsvd_idx = 0;
+	prev_end = TDMR_START(tdmr);
+	e820_for_each_mem(i, start, end) {
+		/* Break if this entry is after the TDMR */
+		if (start >= TDMR_END(tdmr))
+			break;
+
+		/* Exclude entries before this TDMR */
+		if (end < TDMR_START(tdmr))
+			continue;
+
+		/*
+		 * Skip if no hole exists before this entry. "<=" is
+		 * used because one e820 entry might span two TDMRs.
+		 * In that case the start address of this entry is
+		 * smaller then the start address of the second TDMR.
+		 */
+		if (start <= prev_end) {
+			prev_end = end;
+			continue;
+		}
+
+		/* Add the hole before this e820 entry */
+		ret = tdmr_add_rsvd_area(tdmr, &rsvd_idx, prev_end,
+				start - prev_end);
+		if (ret)
+			return ret;
+
+		prev_end = end;
+	}
+
+	/* Add the hole after the last RAM entry if it exists. */
+	if (prev_end < TDMR_END(tdmr)) {
+		ret = tdmr_add_rsvd_area(tdmr, &rsvd_idx, prev_end,
+				TDMR_END(tdmr) - prev_end);
+		if (ret)
+			return ret;
+	}
+
+	/*
+	 * Walk over all TDMRs to find out whether any PAMT falls into
+	 * the given TDMR. If yes, mark it as reserved too.
+	 */
+	for (i = 0; i < tdmr_num; i++) {
+		struct tdmr_info *tmp = tdmr_array[i];
+		u64 pamt_start, pamt_end;
+
+		pamt_start = tmp->pamt_4k_base;
+		pamt_end = pamt_start + tmp->pamt_4k_size +
+			tmp->pamt_2m_size + tmp->pamt_1g_size;
+
+		/* Skip PAMTs outside of the given TDMR */
+		if ((pamt_end <= TDMR_START(tdmr)) ||
+				(pamt_start >= TDMR_END(tdmr)))
+			continue;
+
+		/* Only mark the part within the TDMR as reserved */
+		if (pamt_start < TDMR_START(tdmr))
+			pamt_start = TDMR_START(tdmr);
+		if (pamt_end > TDMR_END(tdmr))
+			pamt_end = TDMR_END(tdmr);
+
+		ret = tdmr_add_rsvd_area(tdmr, &rsvd_idx, pamt_start,
+				pamt_end - pamt_start);
+		if (ret)
+			return ret;
+	}
+
+	/* TDX requires reserved areas listed in address ascending order */
+	sort(tdmr->reserved_areas, rsvd_idx, sizeof(struct tdmr_reserved_area),
+			rsvd_area_cmp_func, NULL);
+
+	return 0;
+}
+
+static int tdmrs_setup_rsvd_areas_all(struct tdmr_info **tdmr_array,
+				      int tdmr_num)
+{
+	int i;
+
+	for (i = 0; i < tdmr_num; i++) {
+		int ret;
+
+		ret = tdmr_setup_rsvd_areas(tdmr_array[i], tdmr_array,
+				tdmr_num);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
 static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
 {
 	int ret;
@@ -1128,8 +1268,12 @@ static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
 	if (ret)
 		goto err_free_tdmrs;
 
-	/* Return -EFAULT until constructing TDMRs is done */
-	ret = -EFAULT;
+	ret = tdmrs_setup_rsvd_areas_all(tdmr_array, *tdmr_num);
+	if (ret)
+		goto err_free_pamts;
+
+	return 0;
+err_free_pamts:
 	tdmrs_free_pamt_all(tdmr_array, *tdmr_num);
 err_free_tdmrs:
 	free_tdmrs(tdmr_array, *tdmr_num);

From patchwork Wed Apr  6 04:49:27 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Kai Huang <kai.huang@intel.com>
X-Patchwork-Id: 12803713
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 8D199C433F5
	for <kvm@archiver.kernel.org>; Wed,  6 Apr 2022 16:04:25 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S236847AbiDFQGY (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Wed, 6 Apr 2022 12:06:24 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51562 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S236964AbiDFQFz (ORCPT <rfc822;kvm@vger.kernel.org>);
        Wed, 6 Apr 2022 12:05:55 -0400
Received: from mga18.intel.com (mga18.intel.com [134.134.136.126])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3A04A16249C;
        Tue,  5 Apr 2022 21:50:56 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1649220656; x=1680756656;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=eHwghcfLHXsMY9IOJjx34UuVcdIv13Ep0z+VBduV9pk=;
  b=CcSPUB7o0AZNDulqHZugGTedxNwb4TTc4vrpSxTIrjti9z6bF4ldcjV0
   sW8cfffmW0HiPgiMmphhuvn77CYyi1pDXNOj6ULgdGtzY2SwWlRVEUxVc
   uVolkxa5JFSrRFCv62lXMr0Uaf3x7cTUxqD23/DJH6gLMH7PnXxiKpf+z
   6N6/opq5GRnHFHodzKKWeC5cOQx5z9mNIsTLP+bd9Vu+dsmXk5tUgszzo
   FLCSfWlR0kJAJvNhPMehIXZdCbsqCEV/LphgTOwD1fPlH71LuWUG9P5Fc
   bFMMIeQwXYD1BdoewVpzyqO7DnEBrlMBTQFuunBs1TH+eTSvTHIAEFSGk
   g==;
X-IronPort-AV: E=McAfee;i="6200,9189,10308"; a="243089880"
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="243089880"
Received: from orsmga006.jf.intel.com ([10.7.209.51])
  by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:50:44 -0700
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="524302416"
Received: from dchang1-mobl3.amr.corp.intel.com (HELO
 khuang2-desk.gar.corp.intel.com) ([10.254.29.17])
  by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:50:40 -0700
From: Kai Huang <kai.huang@intel.com>
To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org
Cc: seanjc@google.com, pbonzini@redhat.com, dave.hansen@intel.com,
        len.brown@intel.com, tony.luck@intel.com,
        rafael.j.wysocki@intel.com, reinette.chatre@intel.com,
        dan.j.williams@intel.com, peterz@infradead.org, ak@linux.intel.com,
        kirill.shutemov@linux.intel.com,
        sathyanarayanan.kuppuswamy@linux.intel.com,
        isaku.yamahata@intel.com, kai.huang@intel.com
Subject: [PATCH v3 15/21] x86/virt/tdx: Reserve TDX module global KeyID
Date: Wed,  6 Apr 2022 16:49:27 +1200
Message-Id: 
 <3e4929500a7b5bbaaf8ed23cde088172862acae0.1649219184.git.kai.huang@intel.com>
X-Mailer: git-send-email 2.35.1
In-Reply-To: <cover.1649219184.git.kai.huang@intel.com>
References: <cover.1649219184.git.kai.huang@intel.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

TDX module initialization requires to use one TDX private KeyID as the
global KeyID to crypto protect TDX metadata.  The global KeyID is
configured to the TDX module along with TDMRs.

Just reserve the first TDX private KeyID as the global KeyID.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index bf0d13644898..ecd65f7014e2 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -112,6 +112,9 @@ static struct cmr_info tdx_cmr_array[MAX_CMRS] __aligned(CMR_INFO_ARRAY_ALIGNMEN
 static int tdx_cmr_num;
 static struct tdsysinfo_struct tdx_sysinfo;
 
+/* TDX global KeyID to protect TDX metadata */
+static u32 tdx_global_keyid;
+
 static bool __seamrr_enabled(void)
 {
 	return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
@@ -1320,6 +1323,12 @@ static int init_tdx_module(void)
 	if (ret)
 		goto out_free_tdmrs;
 
+	/*
+	 * Reserve the first TDX KeyID as global KeyID to protect
+	 * TDX module metadata.
+	 */
+	tdx_global_keyid = tdx_keyid_start;
+
 	/*
 	 * Return -EFAULT until all steps of TDX module
 	 * initialization are done.

From patchwork Wed Apr  6 04:49:28 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Kai Huang <kai.huang@intel.com>
X-Patchwork-Id: 12802973
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id ACD0EC433EF
	for <kvm@archiver.kernel.org>; Wed,  6 Apr 2022 09:30:12 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S242337AbiDFJcG (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Wed, 6 Apr 2022 05:32:06 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37822 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1344855AbiDFJZe (ORCPT <rfc822;kvm@vger.kernel.org>);
        Wed, 6 Apr 2022 05:25:34 -0400
Received: from mga18.intel.com (mga18.intel.com [134.134.136.126])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 791CE16249D;
        Tue,  5 Apr 2022 21:50:56 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1649220656; x=1680756656;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=iC68qSmiZTNL7Prs9cAoa+JCPh9pm5fBnRuabYf0ICw=;
  b=b60JoccvlOMWt730h5IrjAiZrGKl+Sek7aldvGAslijwCsXQKA+HVBhN
   NhoYRGdJAMUQMg37Mrdy1K2Myj7YJTkjeQv1y+sONXZMrVGhXgMOSbKY/
   x3HGW1KzeWj8hedszNg2NDITcAEDFqoilmFZBXSwYPxe4SebpAUCSvuN7
   hATxNWi3eyyVuSf+a4HDtLM8JqdThDP/9lq5Fe/OtKmxAh9tPsxy3KYnM
   Zvo+/RZckXfZ+gMqWuTjlgZjgIYaFPi/UX3pJNb5AZ6Z8Er8jmPfRPvRQ
   mOrvIw7XXWOtsuZo0ABOQ5ns6hCbs6eFrwfU345GsqXO5vPbgznDF8RCo
   w==;
X-IronPort-AV: E=McAfee;i="6200,9189,10308"; a="243089887"
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="243089887"
Received: from orsmga006.jf.intel.com ([10.7.209.51])
  by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:50:48 -0700
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="524302447"
Received: from dchang1-mobl3.amr.corp.intel.com (HELO
 khuang2-desk.gar.corp.intel.com) ([10.254.29.17])
  by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:50:44 -0700
From: Kai Huang <kai.huang@intel.com>
To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org
Cc: seanjc@google.com, pbonzini@redhat.com, dave.hansen@intel.com,
        len.brown@intel.com, tony.luck@intel.com,
        rafael.j.wysocki@intel.com, reinette.chatre@intel.com,
        dan.j.williams@intel.com, peterz@infradead.org, ak@linux.intel.com,
        kirill.shutemov@linux.intel.com,
        sathyanarayanan.kuppuswamy@linux.intel.com,
        isaku.yamahata@intel.com, kai.huang@intel.com
Subject: [PATCH v3 16/21] x86/virt/tdx: Configure TDX module with TDMRs and
 global KeyID
Date: Wed,  6 Apr 2022 16:49:28 +1200
Message-Id: 
 <9cbb09c01bc145c580c084b4fc27c54ada771e72.1649219184.git.kai.huang@intel.com>
X-Mailer: git-send-email 2.35.1
In-Reply-To: <cover.1649219184.git.kai.huang@intel.com>
References: <cover.1649219184.git.kai.huang@intel.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

After the TDX usable memory regions are constructed in an array of TDMRs
and the global KeyID is reserved, configure them to the TDX module.  The
configuration is done via TDH.SYS.CONFIG, which is one call and can be
done on any logical cpu.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 42 +++++++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/tdx.h |  2 ++
 2 files changed, 44 insertions(+)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index ecd65f7014e2..2bf49d3d7cfe 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1284,6 +1284,42 @@ static int construct_tdmrs(struct tdmr_info **tdmr_array, int *tdmr_num)
 	return ret;
 }
 
+static int config_tdx_module(struct tdmr_info **tdmr_array, int tdmr_num,
+			     u64 global_keyid)
+{
+	u64 *tdmr_pa_array;
+	int i, array_sz;
+	int ret;
+
+	/*
+	 * TDMR_INFO entries are configured to the TDX module via an
+	 * array of the physical address of each TDMR_INFO.  TDX requires
+	 * the array itself must be 512 aligned.  Round up the array size
+	 * to 512 aligned so the buffer allocated by kzalloc() meets the
+	 * alignment requirement.
+	 */
+	array_sz = ALIGN(tdmr_num * sizeof(u64), TDMR_INFO_PA_ARRAY_ALIGNMENT);
+	tdmr_pa_array = kzalloc(array_sz, GFP_KERNEL);
+	if (!tdmr_pa_array)
+		return -ENOMEM;
+
+	for (i = 0; i < tdmr_num; i++)
+		tdmr_pa_array[i] = __pa(tdmr_array[i]);
+
+	/*
+	 * TDH.SYS.CONFIG fails when TDH.SYS.LP.INIT is not done on all
+	 * BIOS-enabled cpus.  tdx_init() only disables CPU hotplug but
+	 * doesn't do early check whether all BIOS-enabled cpus are
+	 * online, so TDH.SYS.CONFIG can fail here.
+	 */
+	ret = seamcall(TDH_SYS_CONFIG, __pa(tdmr_pa_array), tdmr_num,
+				global_keyid, 0, NULL, NULL);
+	/* Free the array as it is not required any more. */
+	kfree(tdmr_pa_array);
+
+	return ret;
+}
+
 static int init_tdx_module(void)
 {
 	struct tdmr_info **tdmr_array;
@@ -1329,11 +1365,17 @@ static int init_tdx_module(void)
 	 */
 	tdx_global_keyid = tdx_keyid_start;
 
+	/* Config the TDX module with TDMRs and global KeyID */
+	ret = config_tdx_module(tdmr_array, tdmr_num, tdx_global_keyid);
+	if (ret)
+		goto out_free_pamts;
+
 	/*
 	 * Return -EFAULT until all steps of TDX module
 	 * initialization are done.
 	 */
 	ret = -EFAULT;
+out_free_pamts:
 	/*
 	 * Free PAMTs allocated in construct_tdmrs() when TDX module
 	 * initialization fails.
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 05bf9fe6bd00..d8e2800397af 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -95,6 +95,7 @@ struct tdmr_reserved_area {
 } __packed;
 
 #define TDMR_INFO_ALIGNMENT	512
+#define TDMR_INFO_PA_ARRAY_ALIGNMENT	512
 
 struct tdmr_info {
 	u64 base;
@@ -125,6 +126,7 @@ struct tdmr_info {
 #define TDH_SYS_INIT		33
 #define TDH_SYS_LP_INIT		35
 #define TDH_SYS_LP_SHUTDOWN	44
+#define TDH_SYS_CONFIG		45
 
 struct tdx_module_output;
 u64 __seamcall(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,

From patchwork Wed Apr  6 04:49:29 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Kai Huang <kai.huang@intel.com>
X-Patchwork-Id: 12803122
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 32920C433EF
	for <kvm@archiver.kernel.org>; Wed,  6 Apr 2022 10:21:17 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1351080AbiDFKXI (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Wed, 6 Apr 2022 06:23:08 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38588 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S243496AbiDFKWD (ORCPT <rfc822;kvm@vger.kernel.org>);
        Wed, 6 Apr 2022 06:22:03 -0400
Received: from mga18.intel.com (mga18.intel.com [134.134.136.126])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9445216249F;
        Tue,  5 Apr 2022 21:50:57 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1649220657; x=1680756657;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=PxZ+GxynIkoMLEUiFgS85yx4Qd2Ez7/qEzRgLcnxHFY=;
  b=ev6BxZpAiHMSpmj7Bemzw9NgL/mYTQdHLJDqKaZgTdrhgPecD5v6vX2N
   AQtCwWZZdwitVHpuyX9lJgoiuaSXrqo4hWTSShr4Qv5VbmprRzULupuXG
   ugCdqfg3Q2J+InpP7rWMHirv+DastwUQmXGZR1/fgkJqO7PdyGCyk/tmO
   285ngxAmRlQXArQbiyL+ewW+sQlxJVDH2uUdBfQ1NpudF8Npo/mu0uV9A
   DkRmrme5qpxDQo5CSIxftfT/Yv+gyz+lYZoaAm/UDngHPOX+YGxm3CZTV
   VkPhrkVZ3qjKc+Ke6EiVRTwyEqgqx1rLmVKJJkBGRY0IlJ2a2ddmTbvNg
   A==;
X-IronPort-AV: E=McAfee;i="6200,9189,10308"; a="243089893"
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="243089893"
Received: from orsmga006.jf.intel.com ([10.7.209.51])
  by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:50:52 -0700
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="524302458"
Received: from dchang1-mobl3.amr.corp.intel.com (HELO
 khuang2-desk.gar.corp.intel.com) ([10.254.29.17])
  by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:50:48 -0700
From: Kai Huang <kai.huang@intel.com>
To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org
Cc: seanjc@google.com, pbonzini@redhat.com, dave.hansen@intel.com,
        len.brown@intel.com, tony.luck@intel.com,
        rafael.j.wysocki@intel.com, reinette.chatre@intel.com,
        dan.j.williams@intel.com, peterz@infradead.org, ak@linux.intel.com,
        kirill.shutemov@linux.intel.com,
        sathyanarayanan.kuppuswamy@linux.intel.com,
        isaku.yamahata@intel.com, kai.huang@intel.com
Subject: [PATCH v3 17/21] x86/virt/tdx: Configure global KeyID on all packages
Date: Wed,  6 Apr 2022 16:49:29 +1200
Message-Id: 
 <ccf14e7fef0a3eab6ad877204ddc1113763c8a63.1649219184.git.kai.huang@intel.com>
X-Mailer: git-send-email 2.35.1
In-Reply-To: <cover.1649219184.git.kai.huang@intel.com>
References: <cover.1649219184.git.kai.huang@intel.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

Before TDX module can use the global KeyID to access TDX metadata, the
key of the global KeyID must be configured on all physical packages via
TDH.SYS.KEY.CONFIG.  This SEAMCALL cannot run concurrently on different
cpus since it exclusively acquires the TDX module.

Implement a helper to run SEAMCALL on one (any) cpu for all packages in
serialized way, and run TDH.SYS.KEY.CONFIG on all packages using the
helper.

The TDX module uses the global KeyID to initialize its metadata (PAMTs).
Before TDX module can do that, all cachelines of PAMTs must be flushed.
Otherwise, they may silently corrupt the PAMTs later initialized by the
TDX module.

Use wbinvd to flush cache as PAMTs can be potentially large (~1/256th of
system RAM).

Flush cache before configuring the global KeyID on all packages, as
suggested by TDX specification.  In practice, the current generation of
TDX doesn't use the global KeyID in TDH.SYS.KEY.CONFIG.  Therefore in
practice flushing cache can be done after configuring the global KeyID
is done on all packages.  But the future generation of TDX may change
this behaviour, so just follow TDX specification's suggestion to flush
cache before configuring the global KeyID on all packages.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 96 ++++++++++++++++++++++++++++++++++++-
 arch/x86/virt/vmx/tdx/tdx.h |  1 +
 2 files changed, 96 insertions(+), 1 deletion(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 2bf49d3d7cfe..bb15122fb8bd 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -23,6 +23,7 @@
 #include <asm/virtext.h>
 #include <asm/e820/api.h>
 #include <asm/pgtable.h>
+#include <asm/smp.h>
 #include <asm/tdx.h>
 #include "tdx.h"
 
@@ -398,6 +399,46 @@ static int seamcall_on_each_cpu(struct seamcall_ctx *sc)
 	return atomic_read(&sc->err);
 }
 
+/*
+ * Call the SEAMCALL on one (any) cpu for each physical package in
+ * serialized way.  Return immediately in case of any error while
+ * calling SEAMCALL on any cpu.
+ *
+ * Note for serialized calls 'struct seamcall_ctx::err' doesn't have
+ * to be atomic, but for simplicity just reuse it instead of adding
+ * a new one.
+ */
+static int seamcall_on_each_package_serialized(struct seamcall_ctx *sc)
+{
+	cpumask_var_t packages;
+	int cpu, ret = 0;
+
+	if (!zalloc_cpumask_var(&packages, GFP_KERNEL))
+		return -ENOMEM;
+
+	for_each_online_cpu(cpu) {
+		if (cpumask_test_and_set_cpu(topology_physical_package_id(cpu),
+					packages))
+			continue;
+
+		ret = smp_call_function_single(cpu, seamcall_smp_call_function,
+				sc, true);
+		if (ret)
+			break;
+
+		/*
+		 * Doesn't have to use atomic_read(), but it doesn't
+		 * hurt either.
+		 */
+		ret = atomic_read(&sc->err);
+		if (ret)
+			break;
+	}
+
+	free_cpumask_var(packages);
+	return ret;
+}
+
 static inline bool p_seamldr_ready(void)
 {
 	return !!p_seamldr_info.p_seamldr_ready;
@@ -1320,6 +1361,21 @@ static int config_tdx_module(struct tdmr_info **tdmr_array, int tdmr_num,
 	return ret;
 }
 
+static int config_global_keyid(void)
+{
+	struct seamcall_ctx sc = { .fn = TDH_SYS_KEY_CONFIG };
+
+	/*
+	 * Configure the key of the global KeyID on all packages by
+	 * calling TDH.SYS.KEY.CONFIG on all packages.
+	 *
+	 * TDH.SYS.KEY.CONFIG may fail with entropy error (which is
+	 * a recoverable error).  Assume this is exceedingly rare and
+	 * just return error if encountered instead of retrying.
+	 */
+	return seamcall_on_each_package_serialized(&sc);
+}
+
 static int init_tdx_module(void)
 {
 	struct tdmr_info **tdmr_array;
@@ -1370,6 +1426,37 @@ static int init_tdx_module(void)
 	if (ret)
 		goto out_free_pamts;
 
+	/*
+	 * The same physical address associated with different KeyIDs
+	 * has separate cachelines.  Before using the new KeyID to access
+	 * some memory, the cachelines associated with the old KeyID must
+	 * be flushed, otherwise they may later silently corrupt the data
+	 * written with the new KeyID.  After cachelines associated with
+	 * the old KeyID are flushed, CPU speculative fetch using the old
+	 * KeyID is OK since the prefetched cachelines won't be consumed
+	 * by CPU core.
+	 *
+	 * TDX module initializes PAMTs using the global KeyID to crypto
+	 * protect them from malicious host.  Before that, the PAMTs are
+	 * used by kernel (with KeyID 0) and the cachelines associated
+	 * with the PAMTs must be flushed.  Given PAMTs are potentially
+	 * large (~1/256th of system RAM), just use WBINVD on all cpus to
+	 * flush the cache.
+	 *
+	 * In practice, the current generation of TDX doesn't use the
+	 * global KeyID in TDH.SYS.KEY.CONFIG.  Therefore in practice,
+	 * the cachelines can be flushed after configuring the global
+	 * KeyID on all pkgs is done.  But the future generation of TDX
+	 * may change this, so just follow the suggestion of TDX spec to
+	 * flush cache before TDH.SYS.KEY.CONFIG.
+	 */
+	wbinvd_on_all_cpus();
+
+	/* Config the key of global KeyID on all packages */
+	ret = config_global_keyid();
+	if (ret)
+		goto out_free_pamts;
+
 	/*
 	 * Return -EFAULT until all steps of TDX module
 	 * initialization are done.
@@ -1380,8 +1467,15 @@ static int init_tdx_module(void)
 	 * Free PAMTs allocated in construct_tdmrs() when TDX module
 	 * initialization fails.
 	 */
-	if (ret)
+	if (ret) {
+		/*
+		 * Part of PAMTs may already have been initialized by
+		 * TDX module.  Flush cache before returning them back
+		 * to kernel.
+		 */
+		wbinvd_on_all_cpus();
 		tdmrs_free_pamt_all(tdmr_array, tdmr_num);
+	}
 out_free_tdmrs:
 	/*
 	 * TDMRs are only used during initializing TDX module.  Always
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index d8e2800397af..bba8cabea4bb 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -122,6 +122,7 @@ struct tdmr_info {
 /*
  * TDX module SEAMCALL leaf functions
  */
+#define TDH_SYS_KEY_CONFIG	31
 #define TDH_SYS_INFO		32
 #define TDH_SYS_INIT		33
 #define TDH_SYS_LP_INIT		35

From patchwork Wed Apr  6 04:49:30 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Kai Huang <kai.huang@intel.com>
X-Patchwork-Id: 12802657
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 78270C4332F
	for <kvm@archiver.kernel.org>; Wed,  6 Apr 2022 07:24:49 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S240958AbiDFHYM (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Wed, 6 Apr 2022 03:24:12 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56460 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1456665AbiDFGon (ORCPT <rfc822;kvm@vger.kernel.org>);
        Wed, 6 Apr 2022 02:44:43 -0400
Received: from mga18.intel.com (mga18.intel.com [134.134.136.126])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 98F8C1624A2;
        Tue,  5 Apr 2022 21:50:58 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1649220659; x=1680756659;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=I+dWPDiE4xf63aSyJo77PiXg+mtIis/JHI1I6V04uBU=;
  b=VnYeP5a6xZf1ud7WMcSOMT3bckQN5oFJHRqp0knZRRoey2yd7tJON/46
   SYNkVS7ijNxPTFjvUxPYi6AcOp6FM/zh74VamAPn09iW6GM/Uoykah68S
   1SCUU5xJphPVAde5UTEEUJnJVNgI1WAwgfoKPwpWnMADfaEY+92MUev+p
   65FJnx+U8cQE/5wC9ESohiY+UDwXrMXqlAVNQsIJw93KVMLQaMDwBAMo0
   Qw+2Ubg97fOsJTxpJkG7YQaQj0KM5M7HTGm8Dxc4IcRGMQFgv3/hKa8B+
   cUTf/Q175MWycEGy+U5PmeujMFMRTf4nIp5gHbu+fVEwGlJCvVlnhZwRC
   g==;
X-IronPort-AV: E=McAfee;i="6200,9189,10308"; a="243089913"
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="243089913"
Received: from orsmga006.jf.intel.com ([10.7.209.51])
  by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:50:56 -0700
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="524302468"
Received: from dchang1-mobl3.amr.corp.intel.com (HELO
 khuang2-desk.gar.corp.intel.com) ([10.254.29.17])
  by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:50:52 -0700
From: Kai Huang <kai.huang@intel.com>
To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org
Cc: seanjc@google.com, pbonzini@redhat.com, dave.hansen@intel.com,
        len.brown@intel.com, tony.luck@intel.com,
        rafael.j.wysocki@intel.com, reinette.chatre@intel.com,
        dan.j.williams@intel.com, peterz@infradead.org, ak@linux.intel.com,
        kirill.shutemov@linux.intel.com,
        sathyanarayanan.kuppuswamy@linux.intel.com,
        isaku.yamahata@intel.com, kai.huang@intel.com
Subject: [PATCH v3 18/21] x86/virt/tdx: Initialize all TDMRs
Date: Wed,  6 Apr 2022 16:49:30 +1200
Message-Id: 
 <9c180b4f34956a5d43bbf7894c423fcbe75fc7a9.1649219184.git.kai.huang@intel.com>
X-Mailer: git-send-email 2.35.1
In-Reply-To: <cover.1649219184.git.kai.huang@intel.com>
References: <cover.1649219184.git.kai.huang@intel.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

Initialize TDMRs via TDH.SYS.TDMR.INIT as the last step to complete the
TDX initialization.

All TDMRs need to be initialized using TDH.SYS.TDMR.INIT SEAMCALL before
the TDX memory can be used to run any TD guest.  The SEAMCALL internally
uses the global KeyID to initialize PAMTs in order to crypto protect
them from malicious host kernel.  TDH.SYS.TDMR.INIT can be done any cpu.

The time of initializing TDMR is proportional to the size of the TDMR.
To avoid long latency caused in one SEAMCALL, TDH.SYS.TDMR.INIT only
initializes an (implementation-specific) subset of PAMT entries of one
TDMR in one invocation.  The caller is responsible for calling
TDH.SYS.TDMR.INIT iteratively until all PAMT entries of the requested
TDMR are initialized.

Current implementation initializes TDMRs one by one.  It takes ~100ms on
a 2-socket machine with 2.2GHz CPUs and 64GB memory when the system is
idle.  Each TDH.SYS.TDMR.INIT takes ~7us on average.

TDX does allow different TDMRs to be initialized concurrently on
multiple CPUs. This parallel scheme could be introduced later when the
total initialization time becomes a real concern, e.g. on a platform
with a much bigger memory size.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 75 ++++++++++++++++++++++++++++++++++---
 arch/x86/virt/vmx/tdx/tdx.h |  1 +
 2 files changed, 71 insertions(+), 5 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index bb15122fb8bd..11bd1daffee3 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1376,6 +1376,65 @@ static int config_global_keyid(void)
 	return seamcall_on_each_package_serialized(&sc);
 }
 
+/* Initialize one TDMR */
+static int init_tdmr(struct tdmr_info *tdmr)
+{
+	u64 next;
+
+	/*
+	 * Initializing PAMT entries might be time-consuming (in
+	 * proportion to the size of the requested TDMR).  To avoid long
+	 * latency in one SEAMCALL, TDH.SYS.TDMR.INIT only initializes
+	 * an (implementation-defined) subset of PAMT entries in one
+	 * invocation.
+	 *
+	 * Call TDH.SYS.TDMR.INIT iteratively until all PAMT entries
+	 * of the requested TDMR are initialized (if next-to-initialize
+	 * address matches the end address of the TDMR).
+	 */
+	do {
+		struct tdx_module_output out;
+		int ret;
+
+		ret = seamcall(TDH_SYS_TDMR_INIT, tdmr->base, 0, 0, 0,
+				NULL, &out);
+		if (ret)
+			return ret;
+		/*
+		 * RDX contains 'next-to-initialize' address if
+		 * TDH.SYS.TDMR.INT succeeded.
+		 */
+		next = out.rdx;
+		if (need_resched())
+			cond_resched();
+	} while (next < tdmr->base + tdmr->size);
+
+	return 0;
+}
+
+/* Initialize all TDMRs */
+static int init_tdmrs(struct tdmr_info **tdmr_array, int tdmr_num)
+{
+	int i;
+
+	/*
+	 * Initialize TDMRs one-by-one for simplicity, though the TDX
+	 * architecture does allow different TDMRs to be initialized in
+	 * parallel on multiple CPUs.  Parallel initialization could
+	 * be added later when the time spent in the serialized scheme
+	 * becomes a real concern.
+	 */
+	for (i = 0; i < tdmr_num; i++) {
+		int ret;
+
+		ret = init_tdmr(tdmr_array[i]);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
 static int init_tdx_module(void)
 {
 	struct tdmr_info **tdmr_array;
@@ -1457,11 +1516,12 @@ static int init_tdx_module(void)
 	if (ret)
 		goto out_free_pamts;
 
-	/*
-	 * Return -EFAULT until all steps of TDX module
-	 * initialization are done.
-	 */
-	ret = -EFAULT;
+	/* Initialize TDMRs to complete the TDX module initialization */
+	ret = init_tdmrs(tdmr_array, tdmr_num);
+	if (ret)
+		goto out_free_pamts;
+
+	tdx_module_status = TDX_MODULE_INITIALIZED;
 out_free_pamts:
 	/*
 	 * Free PAMTs allocated in construct_tdmrs() when TDX module
@@ -1484,6 +1544,11 @@ static int init_tdx_module(void)
 	free_tdmrs(tdmr_array, tdmr_num);
 	kfree(tdmr_array);
 out:
+	if (ret)
+		pr_info("Failed to initialize TDX module.\n");
+	else
+		pr_info("TDX module initialized.\n");
+
 	return ret;
 }
 
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index bba8cabea4bb..212f83374c0a 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -126,6 +126,7 @@ struct tdmr_info {
 #define TDH_SYS_INFO		32
 #define TDH_SYS_INIT		33
 #define TDH_SYS_LP_INIT		35
+#define TDH_SYS_TDMR_INIT	36
 #define TDH_SYS_LP_SHUTDOWN	44
 #define TDH_SYS_CONFIG		45
 

From patchwork Wed Apr  6 04:49:31 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Kai Huang <kai.huang@intel.com>
X-Patchwork-Id: 12802655
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 6101AC433F5
	for <kvm@archiver.kernel.org>; Wed,  6 Apr 2022 07:24:49 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1384180AbiDFHYR (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Wed, 6 Apr 2022 03:24:17 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57010 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1456650AbiDFGon (ORCPT <rfc822;kvm@vger.kernel.org>);
        Wed, 6 Apr 2022 02:44:43 -0400
Received: from mga18.intel.com (mga18.intel.com [134.134.136.126])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2A4DE1624A7;
        Tue,  5 Apr 2022 21:51:00 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1649220661; x=1680756661;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=Zl4wuB3hvdv3gjR+XZpjE+wS0F3OX5rinqQh5T1Gs7Q=;
  b=nsNEaaPGAznUK92tiArA+QoXkZ70etVsHd8dH501Gjdp5ytgCI9C/QKe
   SvCgRfXfKlTeQ3hfB+83IToINMXcRPG8i1IuacOMZGlXlb/RLby4kDdO7
   iOZT31N7Cl8o04JDjEywXqO6qtAm4chyancpE/Pb2UggXHyLgA3zsFxJc
   y4djIutIAiDSXD4U5RO+I3LKUzzCmNw1Yg6O/x+W8ltXxPTb3ZFls2pw9
   jLZUHPdw9mSAvHo3t9mRyio6QCaQmtGJCT2o0+iJXuSJOVreAqC/VvSDG
   pHCObcX2sjRKY8Tm/heID/YC1JZibiojisAgNxxZn5zFC0/NIRkLEdUjJ
   A==;
X-IronPort-AV: E=McAfee;i="6200,9189,10308"; a="243089918"
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="243089918"
Received: from orsmga006.jf.intel.com ([10.7.209.51])
  by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:50:59 -0700
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="524302481"
Received: from dchang1-mobl3.amr.corp.intel.com (HELO
 khuang2-desk.gar.corp.intel.com) ([10.254.29.17])
  by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:50:56 -0700
From: Kai Huang <kai.huang@intel.com>
To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org
Cc: seanjc@google.com, pbonzini@redhat.com, dave.hansen@intel.com,
        len.brown@intel.com, tony.luck@intel.com,
        rafael.j.wysocki@intel.com, reinette.chatre@intel.com,
        dan.j.williams@intel.com, peterz@infradead.org, ak@linux.intel.com,
        kirill.shutemov@linux.intel.com,
        sathyanarayanan.kuppuswamy@linux.intel.com,
        isaku.yamahata@intel.com, kai.huang@intel.com
Subject: [PATCH v3 19/21] x86: Flush cache of TDX private memory during
 kexec()
Date: Wed,  6 Apr 2022 16:49:31 +1200
Message-Id: 
 <2a2134dede7ab456154567b01d0c8f4243d9d3d6.1649219184.git.kai.huang@intel.com>
X-Mailer: git-send-email 2.35.1
In-Reply-To: <cover.1649219184.git.kai.huang@intel.com>
References: <cover.1649219184.git.kai.huang@intel.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

If TDX is ever enabled and/or used to run any TD guests, the cachelines
of TDX private memory, including PAMTs, used by TDX module need to be
flushed before transiting to the new kernel otherwise they may silently
corrupt the new kernel.

TDX module can only be initialized once during its lifetime.  TDX does
not have interface to reset TDX module to an uninitialized state so it
could be initialized again.  If the old kernel has enabled TDX, the new
kernel won't be able to use TDX again.  Therefore, ideally the old
kernel should shut down the TDX module if it is ever initialized so that
no SEAMCALLs can be made to it again.

However, shutting down the TDX module requires calling SEAMCALL, which
requires cpu being in VMX operation (VMXON has been done).  Currently,
only KVM does entering/leaving VMX operation, so there's no guarantee
that all cpus are in VMX operation during kexec().  Therefore, this
implementation doesn't shut down the TDX module, but only does cache
flush and leaves the TDX module open.

And it's fine to leave the module open.  If the new kernel wants to use
TDX, it needs to go through the initialization process, and it will fail
at the first SEAMCALL due to the TDX module is not in the uninitialized
state.  If the new kernel doesn't want to use TDX, then the TDX module
won't run at all.

Following the implementation of SME support, use wbinvd() to flush cache
in stop_this_cpu().  Introduce a new function platform_has_tdx() to only
check whether the platform is TDX-capable and do wbinvd() when it is
true.  platform_has_tdx() returns true when SEAMRR is enabled and there
are enough TDX private KeyIDs to run at least one TD guest (both of
which are detected at boot time).  TDX is enabled on demand at runtime
and it has a state machine with mutex to protect multiple callers to
initialize TDX in parallel.  Getting TDX module state needs to hold the
mutex but stop_this_cpu() runs in interrupt context, so just check
whether platform supports TDX and flush cache.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/include/asm/tdx.h  |  2 ++
 arch/x86/kernel/process.c   | 15 ++++++++++++++-
 arch/x86/virt/vmx/tdx/tdx.c | 14 ++++++++++++++
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index c8af2ba6bb8a..513b9ce9a870 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -94,10 +94,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
 void tdx_detect_cpu(struct cpuinfo_x86 *c);
 int tdx_detect(void);
 int tdx_init(void);
+bool platform_has_tdx(void);
 #else
 static inline void tdx_detect_cpu(struct cpuinfo_x86 *c) { }
 static inline int tdx_detect(void) { return -ENODEV; }
 static inline int tdx_init(void) { return -ENODEV; }
+static inline bool platform_has_tdx(void) { return false; }
 #endif /* CONFIG_INTEL_TDX_HOST */
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index dbaf12c43fe1..0238bd29af8a 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -769,8 +769,21 @@ void __noreturn stop_this_cpu(void *dummy)
 	 *
 	 * Test the CPUID bit directly because the machine might've cleared
 	 * X86_FEATURE_SME due to cmdline options.
+	 *
+	 * In case of kexec, similar to SME, if TDX is ever enabled, the
+	 * cachelines of TDX private memory (including PAMTs) used by TDX
+	 * module need to be flushed before transiting to the new kernel,
+	 * otherwise they may silently corrupt the new kernel.
+	 *
+	 * Note TDX is enabled on demand at runtime, and enabling TDX has a
+	 * state machine protected with a mutex to prevent concurrent calls
+	 * from multiple callers.  Holding the mutex is required to get the
+	 * TDX enabling status, but this function runs in interrupt context.
+	 * So to make it simple, always flush cache when platform supports
+	 * TDX (detected at boot time), regardless whether TDX is truly
+	 * enabled by kernel.
 	 */
-	if (cpuid_eax(0x8000001f) & BIT(0))
+	if ((cpuid_eax(0x8000001f) & BIT(0)) || platform_has_tdx())
 		native_wbinvd();
 	for (;;) {
 		/*
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 11bd1daffee3..031af7b83cea 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -1687,3 +1687,17 @@ int tdx_init(void)
 	return ret;
 }
 EXPORT_SYMBOL_GPL(tdx_init);
+
+/**
+ * platform_has_tdx - Whether platform supports TDX
+ *
+ * Check whether platform supports TDX (i.e. TDX is enabled in BIOS),
+ * regardless whether TDX is truly enabled by kernel.
+ *
+ * Return true if SEAMRR is enabled, and there are sufficient TDX private
+ * KeyIDs to run TD guests.
+ */
+bool platform_has_tdx(void)
+{
+	return seamrr_enabled() && tdx_keyid_sufficient();
+}

From patchwork Wed Apr  6 04:49:32 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Kai Huang <kai.huang@intel.com>
X-Patchwork-Id: 12802656
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 881CBC433FE
	for <kvm@archiver.kernel.org>; Wed,  6 Apr 2022 07:24:49 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1385845AbiDFHYU (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Wed, 6 Apr 2022 03:24:20 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56518 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1456892AbiDFGor (ORCPT <rfc822;kvm@vger.kernel.org>);
        Wed, 6 Apr 2022 02:44:47 -0400
Received: from mga18.intel.com (mga18.intel.com [134.134.136.126])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 33A15D39B1;
        Tue,  5 Apr 2022 21:51:04 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1649220664; x=1680756664;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=9gK6ukJipXBNl/xV9In8MOBFfm49FnqwURztoNRxISs=;
  b=WgP5884SOAmV69zoKxAFLYLOdoAGQ0pcE1NwgB2kX6cP47lqv1jDRCai
   gBt0LyaecNdTGwBKhA+PuIlRpnXcsrhl8KstAdL7+sY57hfjYdZXDHCG5
   NmeZIDLITW/iz6dj/h0iUiaW1HmvAubv+l+Lx54QAN3LB5Sv1LFj7KhP6
   ABtmoQOrTz9ZTrxEM2DP+2wjobVZweSUlmFFaXQIF0SDD+lYuxOvdvoGA
   Yo4rrMH1fqdJoS2wH5MiwNkMnrz548rJmcMceKG1q3jr5+ybeEOphjRFQ
   XQ19NO+6PIp1f/cEuuxVWv1U6oTdm61vFHi6HKnYZpc4T870DAnoqgoVr
   A==;
X-IronPort-AV: E=McAfee;i="6200,9189,10308"; a="243089923"
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="243089923"
Received: from orsmga006.jf.intel.com ([10.7.209.51])
  by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:51:03 -0700
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="524302499"
Received: from dchang1-mobl3.amr.corp.intel.com (HELO
 khuang2-desk.gar.corp.intel.com) ([10.254.29.17])
  by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:51:00 -0700
From: Kai Huang <kai.huang@intel.com>
To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org
Cc: seanjc@google.com, pbonzini@redhat.com, dave.hansen@intel.com,
        len.brown@intel.com, tony.luck@intel.com,
        rafael.j.wysocki@intel.com, reinette.chatre@intel.com,
        dan.j.williams@intel.com, peterz@infradead.org, ak@linux.intel.com,
        kirill.shutemov@linux.intel.com,
        sathyanarayanan.kuppuswamy@linux.intel.com,
        isaku.yamahata@intel.com, kai.huang@intel.com
Subject: [PATCH v3 20/21] x86/virt/tdx: Add kernel command line to opt-in TDX
 host support
Date: Wed,  6 Apr 2022 16:49:32 +1200
Message-Id: 
 <0d50d13e5f9bd590ee97ff150f1393c4d99a8fa0.1649219184.git.kai.huang@intel.com>
X-Mailer: git-send-email 2.35.1
In-Reply-To: <cover.1649219184.git.kai.huang@intel.com>
References: <cover.1649219184.git.kai.huang@intel.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

Enabling TDX consumes additional memory (used by TDX as metadata) and
additional initialization time.  Introduce a kernel command line to
allow to opt-in TDX host kernel support when user truly wants to use
TDX.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 Documentation/admin-guide/kernel-parameters.txt |  6 ++++++
 arch/x86/virt/vmx/tdx/tdx.c                     | 14 ++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 3f1cc5e317ed..cfa5b36890ea 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5790,6 +5790,12 @@
 
 	tdfx=		[HW,DRM]
 
+	tdx_host=	[X86-64, TDX]
+			Format: {on|off}
+			on: Enable TDX host kernel support
+			off: Disable TDX host kernel support
+			Default is off.
+
 	test_suspend=	[SUSPEND][,N]
 			Specify "mem" (for Suspend-to-RAM) or "standby" (for
 			standby suspend) or "freeze" (for suspend type freeze)
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 031af7b83cea..fee243cd454f 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -116,6 +116,16 @@ static struct tdsysinfo_struct tdx_sysinfo;
 /* TDX global KeyID to protect TDX metadata */
 static u32 tdx_global_keyid;
 
+static bool enable_tdx_host;
+
+static int __init tdx_host_setup(char *s)
+{
+	if (!strcmp(s, "on"))
+		enable_tdx_host = true;
+	return 1;
+}
+__setup("tdx_host=", tdx_host_setup);
+
 static bool __seamrr_enabled(void)
 {
 	return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
@@ -500,6 +510,10 @@ static int detect_p_seamldr(void)
 
 static int __tdx_detect(void)
 {
+	/* Disabled by kernel command line */
+	if (!enable_tdx_host)
+		goto no_tdx_module;
+
 	/* The TDX module is not loaded if SEAMRR is disabled */
 	if (!seamrr_enabled()) {
 		pr_info("SEAMRR not enabled.\n");

From patchwork Wed Apr  6 04:49:33 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Kai Huang <kai.huang@intel.com>
X-Patchwork-Id: 12802658
Return-Path: <kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 72CEFC433EF
	for <kvm@archiver.kernel.org>; Wed,  6 Apr 2022 07:24:57 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1386221AbiDFHYY (ORCPT <rfc822;kvm@archiver.kernel.org>);
        Wed, 6 Apr 2022 03:24:24 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56480 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1456759AbiDFGop (ORCPT <rfc822;kvm@vger.kernel.org>);
        Wed, 6 Apr 2022 02:44:45 -0400
Received: from mga18.intel.com (mga18.intel.com [134.134.136.126])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E07DC163E1C;
        Tue,  5 Apr 2022 21:51:07 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1649220668; x=1680756668;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=qjcIKogOhU+tgZr1dG7TF9hG/mjII+hb5Kitd+0tJt4=;
  b=JIj3TZj6PLBeqnan/rKszpcRxIhgiH83eC0heGotrOVtIyyPdKpptwyU
   U7vt3F656ycugaY3O3XzCodGK/LgBjL+d1uApfygz8UymmKjVcQcvFbm5
   jCmG0KFya6AKAoa4QqTX2pyEaqacMMkcIITtzSKoxk2rJm8RWBVlnznAO
   1lHUil6PUgvWeeLup4onkFkYVYCFsk5F+DFXDv4cPGXHC21m08kQa5p4h
   sIR/jeSCFifseaI81oToqwSCIMcf2mF80lliOJCjSXJa4uwRBurU5NqkY
   RAOz2/YAZl+Q+Qdw6Vts3uypW2I8ngVhJ2JkUnCm0ba2Nj+rpQSPLA7OS
   g==;
X-IronPort-AV: E=McAfee;i="6200,9189,10308"; a="243089928"
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="243089928"
Received: from orsmga006.jf.intel.com ([10.7.209.51])
  by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:51:07 -0700
X-IronPort-AV: E=Sophos;i="5.90,239,1643702400";
   d="scan'208";a="524302516"
Received: from dchang1-mobl3.amr.corp.intel.com (HELO
 khuang2-desk.gar.corp.intel.com) ([10.254.29.17])
  by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Apr 2022 21:51:04 -0700
From: Kai Huang <kai.huang@intel.com>
To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org
Cc: seanjc@google.com, pbonzini@redhat.com, dave.hansen@intel.com,
        len.brown@intel.com, tony.luck@intel.com,
        rafael.j.wysocki@intel.com, reinette.chatre@intel.com,
        dan.j.williams@intel.com, peterz@infradead.org, ak@linux.intel.com,
        kirill.shutemov@linux.intel.com,
        sathyanarayanan.kuppuswamy@linux.intel.com,
        isaku.yamahata@intel.com, kai.huang@intel.com
Subject: [PATCH v3 21/21] Documentation/x86: Add documentation for TDX host
 support
Date: Wed,  6 Apr 2022 16:49:33 +1200
Message-Id: 
 <62ca658dd87d6ccb56f4be44d077c6e47edbab02.1649219184.git.kai.huang@intel.com>
X-Mailer: git-send-email 2.35.1
In-Reply-To: <cover.1649219184.git.kai.huang@intel.com>
References: <cover.1649219184.git.kai.huang@intel.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

Add documentation for TDX host kernel support.  There is already one
file Documentation/x86/tdx.rst containing documentation for TDX guest
internals.  Also reuse it for TDX host kernel support.

Introduce a new level menu "TDX Guest Internals" and move existing
materials under it, and add a new menu for TDX host kernel support.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---
 Documentation/x86/tdx.rst | 326 ++++++++++++++++++++++++++++++++++++--
 1 file changed, 313 insertions(+), 13 deletions(-)

diff --git a/Documentation/x86/tdx.rst b/Documentation/x86/tdx.rst
index 8ca60256511b..d52ba7cf982d 100644
--- a/Documentation/x86/tdx.rst
+++ b/Documentation/x86/tdx.rst
@@ -7,8 +7,308 @@ Intel Trust Domain Extensions (TDX)
 Intel's Trust Domain Extensions (TDX) protect confidential guest VMs from
 the host and physical attacks by isolating the guest register state and by
 encrypting the guest memory. In TDX, a special TDX module sits between the
-host and the guest, and runs in a special mode and manages the guest/host
-separation.
+host and the guest, and runs in a special Secure Arbitration Mode (SEAM)
+and manages the guest/host separation.
+
+TDX Host Kernel Support
+=======================
+
+SEAM is an extension to the VMX architecture to define a new VMX root
+operation called 'SEAM VMX root' and a new VMX non-root operation called
+'VMX non-root'. Collectively, the SEAM VMX root and SEAM VMX non-root
+execution modes are called operation in SEAM.
+
+SEAM VMX root operation is designed to host a CPU-attested, software
+module called 'Intel TDX module' to manage virtual machine (VM) guests
+called Trust Domains (TD). The TDX module implements the functions to
+build, tear down, and start execution of TD VMs. SEAM VMX root is also
+designed to additionally host a CPU-attested, software module called the
+'Intel Persistent SEAMLDR (Intel P-SEAMLDR)' module to load and update
+the Intel TDX module.
+
+The software in SEAM VMX root runs in the memory region defined by the
+SEAM range register (SEAMRR). Access to this range is restricted to SEAM
+VMX root operation. Code fetches outside of SEAMRR when in SEAM VMX root
+operation are meant to be disallowed and lead to an unbreakable shutdown.
+
+TDX leverages Intel Multi-Key Total Memory Encryption (MKTME) to crypto
+protect TD guests. TDX reserves part of MKTME KeyID space as TDX private
+KeyIDs, which can only be used by software runs in SEAM. The physical
+address bits reserved for encoding TDX private KeyID are treated as
+reserved bits when not in SEAM operation. The partitioning of MKTME
+KeyIDs and TDX private KeyIDs is configured by BIOS.
+
+Host kernel transits to either the P-SEAMLDR or the TDX module via the
+new SEAMCALL instruction. SEAMCALL leaf functions are host-side interface
+functions defined by the P-SEAMLDR and the TDX module around the new
+SEAMCALL instruction. They are similar to a hypercall, except they are
+made by host kernel to the SEAM software modules.
+
+Before being able to manage TD guests, the TDX module must be loaded
+into SEAMRR and properly initialized using SEAMCALLs defined by TDX
+architecture. The current implementation assumes both P-SEAMLDR and
+TDX module are loaded by BIOS before the kernel boots.
+
+Detection and Initialization
+----------------------------
+
+The presence of SEAMRR is reported via a new SEAMRR bit (15) of the
+IA32_MTRRCAP MSR. The SEAMRR range registers consist of a pair of MSRs:
+IA32_SEAMRR_PHYS_BASE (0x1400) and IA32_SEAMRR_PHYS_MASK (0x1401).
+SEAMRR is enabled when bit 3 of IA32_SEAMRR_PHYS_BASE is set and
+bit 10/11 of IA32_SEAMRR_PHYS_MASK are set.
+
+However, there is no CPUID or MSR for querying the presence of the TDX
+module or the P-SEAMLDR. SEAMCALL fails with VMfailInvalid when SEAM
+software is not loaded, so SEAMCALL can be used to detect P-SEAMLDR and
+TDX module. SEAMLDR.INFO SEAMCALL is used to detect both P-SEAMLDR and
+TDX module.  Success of the SEAMCALL means P-SEAMLDR is loaded, and the
+P-SEAMLDR information returned by the SEAMCALL further tells whether TDX
+module is loaded or not.
+
+User can check whether the TDX module is initialized via dmesg:
+
+|  [..] tdx: P-SEAMLDR: version 0x0, vendor_id: 0x8086, build_date: 20211209, build_num 160, major 1, minor 0
+|  [..] tdx: TDX module detected.
+|  [..] tdx: TDX module: vendor_id 0x8086, major_version 1, minor_version 0, build_date 20211209, build_num 160
+|  [..] tdx: TDX module initialized.
+
+Initializing TDX takes time (in seconds) and additional memory space (for
+metadata). Both are affected by the size of total usable memory which the
+TDX module is configured with. In particular, the TDX metadata consumes
+~1/256 of TDX usable memory. This leads to a non-negligible burden as the
+current implementation simply treats all E820 RAM ranges as TDX usable
+memory (all system RAM meets the security requirements on the first
+generation of TDX-capable platforms).
+
+Therefore, kernel uses lazy TDX initialization to avoid such burden for
+all users on a TDX-capable platform. The software component (e.g. KVM)
+which wants to use TDX is expected to call two helpers below to detect
+and initialize the TDX module until TDX is truly needed:
+
+        if (tdx_detect())
+                goto no_tdx;
+        if (tdx_init())
+                goto no_tdx;
+
+TDX detection and initialization are done via SEAMCALLs which require the
+CPU in VMX operation. The caller of the above two helpers should ensure
+that condition.
+
+Currently, only KVM is the only user of TDX and KVM already handles
+entering/leaving VMX operation. Letting KVM initialize TDX on demand
+avoids handling entering/leaving VMX operation, which isn't trivial, in
+core-kernel.
+
+In addition, a new kernel parameter 'tdx_host={on/off}' can be used to
+force disabling the TDX capability by the admin.
+
+TDX initialization includes a step where certain SEAMCALL must be called
+on every BIOS-enabled CPU (with a ACPI MADT entry marked as enabled).  As
+a result, CPU hotplug is temporarily disabled during initializing the TDX
+module.  Also, user should avoid using kernel command lines which impact
+kernel usable cpus and/or online cpus (such as 'maxcpus', 'nr_cpus' and
+'possible_cpus'), or offlining CPUs before initializing TDX. Doing so
+will lead to the mismatch between online CPUs and BIOS-enabled CPUs,
+resulting TDX module initialization failure.
+
+TDX Memory Management
+---------------------
+
+TDX architecture manages TDX memory via below data structures:
+
+- Convertible Memory Regions (CMRs)
+
+TDX provides increased levels of memory confidentiality and integrity.
+This requires special hardware support for features like memory
+encryption and storage of memory integrity checksums. A CMR represents a
+memory range that meets those requirements and can be used as TDX memory.
+The list of CMRs can be queried from TDX module.
+
+- TD Memory Regions (TDMRs)
+
+The TDX module manages TDX usable memory via TD Memory Regions (TDMR).
+Each TDMR has information of its base and size, its metadata (PAMT)'s
+base and size, and an array of reserved areas to hold the memory region
+address holes and PAMTs. TDMR must be 1G aligned and in 1G granularity.
+
+Host kernel is responsible for choosing which convertible memory regions
+(reside in CMRs) to use as TDX memory, and constructing a list of TDMRs
+to cover all those memory regions, and configure the TDMRs to TDX module.
+
+- Physical Address Metadata Tables (PAMTs)
+
+This metadata essentially serves as the 'struct page' for the TDX module,
+recording things like which TD guest 'owns' a given page of memory. Each
+TDMR has a dedicated PAMT.
+
+PAMT is not reserved by the hardware upfront and must be allocated by the
+kernel and given to the TDX module. PAMT for a given TDMR doesn't have
+to be within that TDMR, but a PAMT must be within one CMR.  Additionally,
+if a PAMT overlaps with a TDMR, the overlapping part must be marked as
+reserved in that particular TDMR.
+
+Kernel Policy of TDX Memory
+---------------------------
+
+The first generation of TDX essentially guarantees that all system RAM
+memory regions (excluding the memory below 1MB) are covered by CMRs.
+Currently, to avoid having to modify the page allocator to support both
+TDX and non-TDX allocation, the kernel choose to use all system RAM as
+TDX memory. A list of TDMRs are constructed based on all RAM entries in
+e820 table and configured to the TDX module.
+
+Limitations
+-----------
+
+Constructing TDMRs
+~~~~~~~~~~~~~~~~~~
+
+Currently, the kernel tries to create one TDMR for each RAM entry in
+e820. 'e820_table' is used to find all RAM entries to honor 'mem' and
+'memmap' kernel command line. However, 'memmap' command line may also
+result in many discrete RAM entries. TDX architecturally only supports a
+limited number of TDMRs (currently 64). In this case, constructing TDMRs
+may fail due to exceeding the maximum number of TDMRs. The user is
+responsible for not doing so otherwise TDX may not be available. This
+can be further enhanced by supporting merging adjacent TDMRs.
+
+PAMT allocation
+~~~~~~~~~~~~~~~
+
+Currently, the kernel allocates PAMT for each TDMR separately using
+alloc_contig_pages(). alloc_contig_pages() only guarantees the PAMT is
+allocated from a given NUMA node, but doesn't have control over
+allocating PAMT from a given TDMR range. This may result in all PAMTs
+on one NUMA node being within one single TDMR. PAMTs overlapping with
+a given TDMR must be put into the TDMR's reserved areas too. However TDX
+only supports a limited number of reserved areas per TDMR (currently 16),
+thus too many PAMTs in one NUMA node may result in constructing TDMR
+failure due to exceeding TDMR's maximum reserved areas.
+
+The user is responsible for not creating too many discrete RAM entries
+on one NUMA node, which may result in having too many TDMRs on one node,
+which eventually results in constructing TDMR failure due to exceeding
+the maximum reserved areas. This can be further enhanced to support
+per-NUMA-node PAMT allocation, which could reduce the number of PAMT to
+1 for each node.
+
+TDMR initialization
+~~~~~~~~~~~~~~~~~~~
+
+Currently, the kernel initialize TDMRs one by one. This may take couple
+of seconds to finish on large memory systems (TBs). This can be further
+enhanced by allowing initializing different TDMRs in parallel on multiple
+cpus.
+
+CPU hotplug
+~~~~~~~~~~~
+
+The first generation of TDX architecturally doesn't support ACPI CPU
+hotplug. All logical cpus are enabled by BIOS in MADT table. Also, the
+first generation of TDX-capable platforms don't support ACPI CPU hotplug
+either. Since this physically cannot happen, currently kernel doesn't
+have any check in ACPI CPU hotplug code path to disable it.
+
+Also, only TDX module initialization requires all BIOS-enabled cpus are
+online. After the initialization, any logical cpu can be brought down
+and brought up to online again later. Therefore this series doesn't
+change logical CPU hotplug either.
+
+This can be enhanced when any future generation of TDX starts to support
+ACPI cpu hotplug.
+
+Memory hotplug
+~~~~~~~~~~~~~~
+
+The first generation of TDX architecturally doesn't support memory
+hotplug. The CMRs are generated by BIOS during boot and it is fixed
+during machine's runtime.
+
+However, the first generation of TDX-capable platforms don't support ACPI
+memory hotplug. Since it physically cannot happen, currently kernel
+doesn't have any check in ACPI memory hotplug code path to disable it.
+
+A special case of memory hotplug is adding NVDIMM as system RAM using
+kmem driver. However the first generation of TDX-capable platforms
+cannot turn on TDX and NVDIMM simultaneously, so in practice this cannot
+happen either.
+
+Another case is admin can use 'memmap' kernel command line to create
+legacy PMEMs and use them as TD guest memory, or theoretically, can use
+kmem driver to add them as system RAM. Current implementation always
+includes legacy PMEMs when constructing TDMRs so they are also TDX memory.
+So legacy PMEMs can either be used as TD guest memory directly or can be
+converted to system RAM via kmem driver.
+
+This can be enhanced when future generation of TDX starts to support ACPI
+memory hotplug, or NVDIMM and TDX can be enabled simultaneously on the
+same platform.
+
+Kexec interaction
+~~~~~~~~~~~~~~~~~
+
+The TDX module can be initialized only once during its lifetime. The
+first generation of TDX doesn't have interface to reset TDX module to
+uninitialized state so it can be initialized again.
+
+This implies:
+
+  - If the old kernel fails to initialize TDX, the new kernel cannot
+    use TDX too unless the new kernel fixes the bug which leads to
+    initialization failure in the old kernel and can resume from where
+    the old kernel stops. This requires certain coordination between
+    the two kernels.
+
+  - If the old kernel has initialized TDX successfully, the new kernel
+    may be able to use TDX if the two kernels have exactly the same
+    configurations on the TDX module. It further requires the new kernel
+    to reserve the TDX metadata pages (allocated by the old kernel) in
+    its page allocator. It also requires coordination between the two
+    kernels. Furthermore, if kexec() is done when there are active TD
+    guests running, the new kernel cannot use TDX because it's extremely
+    hard for the old kernel to pass all TDX private pages to the new
+    kernel.
+
+Given that, the current implementation doesn't support TDX after kexec()
+(except the old kernel hasn't initialized TDX at all).
+
+The current implementation doesn't shut down TDX module but leaves it
+open during kexec().  This is because shutting down TDX module requires
+CPU being in VMX operation but there's no guarantee of this during
+kexec(). Leaving the TDX module open is not the best case, but it is OK
+since the new kernel won't be able to use TDX anyway (therefore TDX
+module won't run at all).
+
+This can be further enhanced when core-kernele (non-KVM) can handle
+VMXON.
+
+If TDX is ever enabled and/or used to run any TD guests, the cachelines
+of TDX private memory, including PAMTs, used by TDX module need to be
+flushed before transiting to the new kernel otherwise they may silently
+corrupt the new kernel. Similar to SME, the current implementation
+flushes cache in stop_this_cpu().
+
+Initialization errors
+~~~~~~~~~~~~~~~~~~~~~
+
+Currently, any error happened during TDX initialization moves the TDX
+module to the SHUTDOWN state. No SEAMCALL is allowed in this state, and
+the TDX module cannot be re-initialized without a hard reset.
+
+This can be further enhanced to treat some errors as recoverable errors
+and let the caller retry later. A more detailed state machine can be
+added to record the internal state of TDX module, and the initialization
+can resume from that state in the next try.
+
+Specifically, there are three cases that can be treated as recoverable
+error: 1) -ENOMEM (i.e. due to PAMT allocation); 2) TDH.SYS.CONFIG error
+due to TDH.SYS.LP.INIT is not called on all cpus (i.e. due to offline
+cpus); 3) -EPERM when the caller doesn't guarantee all cpus are in VMX
+operation.
+
+TDX Guest Internals
+===================
 
 Since the host cannot directly access guest registers or memory, much
 normal functionality of a hypervisor must be moved into the guest. This is
@@ -20,7 +320,7 @@ TDX includes new hypercall-like mechanisms for communicating from the
 guest to the hypervisor or the TDX module.
 
 New TDX Exceptions
-==================
+------------------
 
 TDX guests behave differently from bare-metal and traditional VMX guests.
 In TDX guests, otherwise normal instructions or memory accesses can cause
@@ -30,7 +330,7 @@ Instructions marked with an '*' conditionally cause exceptions.  The
 details for these instructions are discussed below.
 
 Instruction-based #VE
----------------------
+~~~~~~~~~~~~~~~~~~~~~
 
 - Port I/O (INS, OUTS, IN, OUT)
 - HLT
@@ -41,7 +341,7 @@ Instruction-based #VE
 - CPUID*
 
 Instruction-based #GP
----------------------
+~~~~~~~~~~~~~~~~~~~~~
 
 - All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH,
   VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON
@@ -52,7 +352,7 @@ Instruction-based #GP
 - RDMSR*,WRMSR*
 
 RDMSR/WRMSR Behavior
---------------------
+~~~~~~~~~~~~~~~~~~~~
 
 MSR access behavior falls into three categories:
 
@@ -73,7 +373,7 @@ trapping and handling in the TDX module.  Other than possibly being slow,
 these MSRs appear to function just as they would on bare metal.
 
 CPUID Behavior
---------------
+~~~~~~~~~~~~~~
 
 For some CPUID leaves and sub-leaves, the virtualized bit fields of CPUID
 return values (in guest EAX/EBX/ECX/EDX) are configurable by the
@@ -91,7 +391,7 @@ how to handle. The guest kernel may ask the hypervisor for the value with
 a hypercall.
 
 #VE on Memory Accesses
-======================
+----------------------
 
 There are essentially two classes of TDX memory: private and shared.
 Private memory receives full TDX protections.  Its content is protected
@@ -104,7 +404,7 @@ entries.  This helps ensure that a guest does not place sensitive
 information in shared memory, exposing it to the untrusted hypervisor.
 
 #VE on Shared Memory
---------------------
+~~~~~~~~~~~~~~~~~~~~
 
 Access to shared mappings can cause a #VE.  The hypervisor ultimately
 controls whether a shared memory access causes a #VE, so the guest must be
@@ -124,7 +424,7 @@ be careful not to access device MMIO regions unless it is also prepared to
 handle a #VE.
 
 #VE on Private Pages
---------------------
+~~~~~~~~~~~~~~~~~~~~
 
 Accesses to private mappings can also cause #VEs.  Since all kernel memory
 is also private memory, the kernel might theoretically need to handle a
@@ -142,7 +442,7 @@ The hypervisor is permitted to unilaterally move accepted pages to a
 to handle the exception.
 
 Linux #VE handler
-=================
+-----------------
 
 Just like page faults or #GP's, #VE exceptions can be either handled or be
 fatal.  Typically, unhandled userspace #VE's result in a SIGSEGV.
@@ -163,7 +463,7 @@ While the block is in place, #VE's are elevated to double faults (#DF)
 which are not recoverable.
 
 MMIO handling
-=============
+-------------
 
 In non-TDX VMs, MMIO is usually implemented by giving a guest access to
 a mapping which will cause a VMEXIT on access, and then the hypervisor emulates
@@ -185,7 +485,7 @@ MMIO access via other means (like structure overlays) may result in an
 oops.
 
 Shared Memory Conversions
-=========================
+-------------------------
 
 All TDX guest memory starts out as private at boot.  This memory can not
 be accessed by the hypervisor.  However some kernel users like device