From patchwork Fri Dec  9 06:52:28 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Kai Huang <kai.huang@intel.com>
X-Patchwork-Id: 13069301
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 886E6C4332F
	for <linux-mm@archiver.kernel.org>; Fri,  9 Dec 2022 06:53:39 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 2C540940008; Fri,  9 Dec 2022 01:53:39 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 27542940007; Fri,  9 Dec 2022 01:53:39 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 11562940008; Fri,  9 Dec 2022 01:53:39 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com
 [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 02726940007
	for <linux-mm@kvack.org>; Fri,  9 Dec 2022 01:53:39 -0500 (EST)
Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id C6D0B1C6A40
	for <linux-mm@kvack.org>; Fri,  9 Dec 2022 06:53:38 +0000 (UTC)
X-FDA: 80221852116.04.ACF15F8
Received: from mga09.intel.com (mga09.intel.com [134.134.136.24])
	by imf20.hostedemail.com (Postfix) with ESMTP id D085C1C0016
	for <linux-mm@kvack.org>; Fri,  9 Dec 2022 06:53:36 +0000 (UTC)
Authentication-Results: imf20.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=UChMu3oI;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf20.hostedemail.com: domain of kai.huang@intel.com designates
 134.134.136.24 as permitted sender) smtp.mailfrom=kai.huang@intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1670568817;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=k9nxVJdCDFi28INMKTZakeHoA60I8cDIKXpbbtMEfnc=;
	b=hzUhyZXv38HsRCYGwIvChwfIBKaBxL9DRofr8plaLO2tHwa+tnZOd7Dt2gPE3evPOXg+SE
	ILFOzgR+i4h6Cv87rmJs8aAiMQdpCFpt2lUkBz0jd2hjgbLOOEYZ4i/AOLzMuPDEbaL6kf
	ch3eDm6H5lxz4IZKUnFMfjsnxfuJ8Uc=
ARC-Authentication-Results: i=1;
	imf20.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=UChMu3oI;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf20.hostedemail.com: domain of kai.huang@intel.com designates
 134.134.136.24 as permitted sender) smtp.mailfrom=kai.huang@intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1670568817; a=rsa-sha256;
	cv=none;
	b=AXGY35VAqVdcvQfKa9ejvirX0QOVnMZTX7+ZImKTUXLj/bxokfpQI3YRQe6bqFTQmT/sEL
	lhpy4B834fsJ0D1TZZPesgE4uWgm1SAr2vgzvrJemmTJkeRyyiGbTXlsyM4JpxM8A9fdnP
	Tuie0hGqTrtXe+suIwQ9zgPEOaLeLGI=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1670568816; x=1702104816;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=tvIf0B+prIZRtNQbaZDnuRSnc2zmlZb3aueyjPdO5ho=;
  b=UChMu3oIWJKYEJ8FtfnytYPf6Egv1/LEpbLz4Rwr33cJj+yjSHKgOjqq
   QbVUUIFghoyIpL1XUF4j7w+1jBRAFjiVsrDddSTFY/PrWhhMUnw7Y3EV0
   p/MARbunectKUZrjy2jDuKkoRuGDpAcMZ2x57BNUiiSsEUMu53mmGf4X+
   Rzdvo+V4z09PSCe1xhuH1PTZdSZ0twz0b9XUbTQ0p6QuSctVqrM7bT+Sg
   1nfxGqKyqIVbQkcqLiHKrujo49nHtYfYPBNUp44yf7QsNCbe7MbZGWTUV
   1TYGXSAGJD25QKBqKn96fXMJCyL3bgo5QMkQ85gpyYvNz0BrrCmEXTgBo
   Q==;
X-IronPort-AV: E=McAfee;i="6500,9779,10555"; a="318551357"
X-IronPort-AV: E=Sophos;i="5.96,230,1665471600";
   d="scan'208";a="318551357"
Received: from orsmga001.jf.intel.com ([10.7.209.18])
  by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 08 Dec 2022 22:53:36 -0800
X-IronPort-AV: E=McAfee;i="6500,9779,10555"; a="679836937"
X-IronPort-AV: E=Sophos;i="5.96,230,1665471600";
   d="scan'208";a="679836937"
Received: from omiramon-mobl1.amr.corp.intel.com (HELO
 khuang2-desk.gar.corp.intel.com) ([10.212.28.82])
  by orsmga001-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 08 Dec 2022 22:53:31 -0800
From: Kai Huang <kai.huang@intel.com>
To: linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org
Cc: linux-mm@kvack.org,
	dave.hansen@intel.com,
	peterz@infradead.org,
	tglx@linutronix.de,
	seanjc@google.com,
	pbonzini@redhat.com,
	dan.j.williams@intel.com,
	rafael.j.wysocki@intel.com,
	kirill.shutemov@linux.intel.com,
	ying.huang@intel.com,
	reinette.chatre@intel.com,
	len.brown@intel.com,
	tony.luck@intel.com,
	ak@linux.intel.com,
	isaku.yamahata@intel.com,
	chao.gao@intel.com,
	sathyanarayanan.kuppuswamy@linux.intel.com,
	bagasdotme@gmail.com,
	sagis@google.com,
	imammedo@redhat.com,
	kai.huang@intel.com
Subject: [PATCH v8 07/16] x86/virt/tdx: Use all system memory when
 initializing TDX module as TDX memory
Date: Fri,  9 Dec 2022 19:52:28 +1300
Message-Id: 
 <8aab33a7db7a408beb403950e21f693b0b0f1f2b.1670566861.git.kai.huang@intel.com>
X-Mailer: git-send-email 2.38.1
In-Reply-To: <cover.1670566861.git.kai.huang@intel.com>
References: <cover.1670566861.git.kai.huang@intel.com>
MIME-Version: 1.0
X-Rspamd-Queue-Id: D085C1C0016
X-Rspamd-Server: rspam09
X-Rspam-User: 
X-Stat-Signature: ecmyw6oh9tc1pwftiu5pwnpgexrx615z
X-HE-Tag: 1670568816-500684
X-HE-Meta: 
 U2FsdGVkX1/4Nc30N4sgA2hJmr4cDURtqkSgZcQY6/W02U6w+31wwR4iT6Fal85jw3kcuR1nlFgiCecZcbKrJvmHyUd2AZ4qE3hbMGjM3BBUx/O1xpoNEToaG3ErfDiAcjD9rZ8yIRZmTrNf2TVMQQKOvBgvYdDw0ciqPPhygPi1DfCnPdRQuxrdaaQVMW9ILnEQCDVgqhOLsS6TKsj4vxS3sKLjk2TN5m/lu4sNCTf6dAUxelWPc00F2N1K+9YxP+f4GBPXwvGZVA5nrZjXJywxhvpc7nM/j2At2jpxDYhbTThsJoKziBP6xLk/UwivMkkZVhP8Mj7vZ4/bXXLAq3kLyVJ6giRUGyZbyK4O8iz3ntTwSUv7w1WjfUrFIvozG2dZ8oyqs+zk9f094TF9uvcXvOd46/zmw9zTGMN8jY8xctKZ5TKfMnABNnGoog4wen9U9JLHJyHUvoAXDvIkTFj5RCr58i1zqiQ0HHTsVYFAp6SUbI2YIESHsC6pYrY1kdJZ77tPudby7lEUWYSjp61Ehs003rPLLMqk0/ZWQyzM0n1TsioD/azMs8O2cWJq9nzyjb0y/NPgJkSrGfR27XZg7K3kXAKokMiBLxBnim2QCU6XH1P0RN2kqiaQLehtaL6U1lzgZ0aBc1PzsKf8Qp5MPGYPQ+ODPV1PY8V/AH602PAUcnsUfkm00DJmgOswBleuSS4fbn9VS1TGEdGVYgxGgPUOSKGbnkn8UhmnR30ERmPOBvFKYClJDK6k0YF9Kq+y9j9SbwAlKCK46Z9xb4DVTZHj6XRYysXTF3XvxHLpl3lBMH+AvDjfmXm8L28uq3fcqiwBC3kdzCikRb6ViYZyPSN9ncDy8jWOhQpGLhr3eOgJ0yd3hcMHEms4oVm0TeiPpJKa7iMzzwBa7AF+Nw==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

As a step of initializing the TDX module, the kernel needs to tell the
TDX module which memory regions can be used by the TDX module as TDX
guest memory.

TDX reports a list of "Convertible Memory Region" (CMR) to tell the
kernel which memory is TDX compatible.  The kernel needs to build a list
of memory regions (out of CMRs) as "TDX-usable" memory and pass them to
the TDX module.  Once this is done, those "TDX-usable" memory regions
are fixed during module's lifetime.

The initial support of TDX guests will only allocate TDX guest memory
from the global page allocator.  To keep things simple, just make sure
all pages in the page allocator are TDX memory.

To guarantee that, stash off the memblock memory regions at the time of
initializing the TDX module as TDX's own usable memory regions, and in
the meantime, register a TDX memory notifier to reject to online any new
memory in memory hotplug.

This approach works as in practice all boot-time present DIMMs are TDX
convertible memory.  However, if any non-TDX-convertible memory has been
hot-added (i.e. CXL memory via kmem driver) before initializing the TDX
module, the module initialization will fail.

This can also be enhanced in the future, i.e. by allowing adding non-TDX
memory to a separate NUMA node.  In this case, the "TDX-capable" nodes
and the "non-TDX-capable" nodes can co-exist, but the kernel/userspace
needs to guarantee memory pages for TDX guests are always allocated from
the "TDX-capable" nodes.

Signed-off-by: Kai Huang <kai.huang@intel.com>
---

v7 -> v8:
 - Trimed down changelog (Dave).
 - Changed to use PHYS_PFN() and PFN_PHYS() throughout this series
   (Ying).
 - Moved memory hotplug handling from add_arch_memory() to
   memory_notifier (Dan/David).
 - Removed 'nid' from 'struct tdx_memblock' to later patch (Dave).
 - {build|free}_tdx_memory() -> {build|}free_tdx_memlist() (Dave).
 - Removed pfn_covered_by_cmr() check as no code to trim CMRs now.
 - Improve the comment around first 1MB (Dave).
 - Added a comment around reserve_real_mode() to point out TDX code
   relies on first 1MB being reserved (Ying).
 - Added comment to explain why the new online memory range cannot
   cross multiple TDX memory blocks (Dave).
 - Improved other comments (Dave).

---
 arch/x86/Kconfig            |   1 +
 arch/x86/kernel/setup.c     |   2 +
 arch/x86/virt/vmx/tdx/tdx.c | 160 +++++++++++++++++++++++++++++++++++-
 3 files changed, 162 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index dd333b46fafb..b36129183035 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1959,6 +1959,7 @@ config INTEL_TDX_HOST
 	depends on X86_64
 	depends on KVM_INTEL
 	depends on X86_X2APIC
+	select ARCH_KEEP_MEMBLOCK
 	help
 	  Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
 	  host and certain physical attacks.  This option enables necessary TDX
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 216fee7144ee..3a841a77fda4 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1174,6 +1174,8 @@ void __init setup_arch(char **cmdline_p)
 	 *
 	 * Moreover, on machines with SandyBridge graphics or in setups that use
 	 * crashkernel the entire 1M is reserved anyway.
+	 *
+	 * Note the host kernel TDX also requires the first 1MB being reserved.
 	 */
 	reserve_real_mode();
 
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 6fe505c32599..f010402f443d 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -13,6 +13,13 @@
 #include <linux/errno.h>
 #include <linux/printk.h>
 #include <linux/mutex.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/memblock.h>
+#include <linux/memory.h>
+#include <linux/minmax.h>
+#include <linux/sizes.h>
+#include <linux/pfn.h>
 #include <asm/pgtable_types.h>
 #include <asm/msr.h>
 #include <asm/tdx.h>
@@ -25,6 +32,12 @@ enum tdx_module_status_t {
 	TDX_MODULE_ERROR
 };
 
+struct tdx_memblock {
+	struct list_head list;
+	unsigned long start_pfn;
+	unsigned long end_pfn;
+};
+
 static u32 tdx_keyid_start __ro_after_init;
 static u32 nr_tdx_keyids __ro_after_init;
 
@@ -32,6 +45,9 @@ static enum tdx_module_status_t tdx_module_status;
 /* Prevent concurrent attempts on TDX detection and initialization */
 static DEFINE_MUTEX(tdx_module_lock);
 
+/* All TDX-usable memory regions */
+static LIST_HEAD(tdx_memlist);
+
 /*
  * tdx_keyid_start and nr_tdx_keyids indicate that TDX is uninitialized.
  * This is used in TDX initialization error paths to take it from
@@ -69,6 +85,50 @@ static int __init record_keyid_partitioning(void)
 	return 0;
 }
 
+static bool is_tdx_memory(unsigned long start_pfn, unsigned long end_pfn)
+{
+	struct tdx_memblock *tmb;
+
+	/* Empty list means TDX isn't enabled. */
+	if (list_empty(&tdx_memlist))
+		return true;
+
+	list_for_each_entry(tmb, &tdx_memlist, list) {
+		/*
+		 * The new range is TDX memory if it is fully covered by
+		 * any TDX memory block.
+		 *
+		 * Note TDX memory blocks are originated from memblock
+		 * memory regions, which can only be contiguous when two
+		 * regions have different NUMA nodes or flags.  Therefore
+		 * the new range cannot cross multiple TDX memory blocks.
+		 */
+		if (start_pfn >= tmb->start_pfn && end_pfn <= tmb->end_pfn)
+			return true;
+	}
+	return false;
+}
+
+static int tdx_memory_notifier(struct notifier_block *nb, unsigned long action,
+			       void *v)
+{
+	struct memory_notify *mn = v;
+
+	if (action != MEM_GOING_ONLINE)
+		return NOTIFY_OK;
+
+	/*
+	 * Not all memory is compatible with TDX.  Reject
+	 * to online any incompatible memory.
+	 */
+	return is_tdx_memory(mn->start_pfn, mn->start_pfn + mn->nr_pages) ?
+		NOTIFY_OK : NOTIFY_BAD;
+}
+
+static struct notifier_block tdx_memory_nb = {
+	.notifier_call = tdx_memory_notifier,
+};
+
 static int __init tdx_init(void)
 {
 	int err;
@@ -89,6 +149,13 @@ static int __init tdx_init(void)
 		goto no_tdx;
 	}
 
+	err = register_memory_notifier(&tdx_memory_nb);
+	if (err) {
+		pr_info("initialization failed: register_memory_notifier() failed (%d)\n",
+				err);
+		goto no_tdx;
+	}
+
 	return 0;
 no_tdx:
 	clear_tdx();
@@ -209,6 +276,77 @@ static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo,
 	return 0;
 }
 
+/*
+ * Add a memory region as a TDX memory block.  The caller must make sure
+ * all memory regions are added in address ascending order and don't
+ * overlap.
+ */
+static int add_tdx_memblock(struct list_head *tmb_list, unsigned long start_pfn,
+			    unsigned long end_pfn)
+{
+	struct tdx_memblock *tmb;
+
+	tmb = kmalloc(sizeof(*tmb), GFP_KERNEL);
+	if (!tmb)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&tmb->list);
+	tmb->start_pfn = start_pfn;
+	tmb->end_pfn = end_pfn;
+
+	list_add_tail(&tmb->list, tmb_list);
+	return 0;
+}
+
+static void free_tdx_memlist(struct list_head *tmb_list)
+{
+	while (!list_empty(tmb_list)) {
+		struct tdx_memblock *tmb = list_first_entry(tmb_list,
+				struct tdx_memblock, list);
+
+		list_del(&tmb->list);
+		kfree(tmb);
+	}
+}
+
+/*
+ * Ensure that all memblock memory regions are convertible to TDX
+ * memory.  Once this has been established, stash the memblock
+ * ranges off in a secondary structure because memblock is modified
+ * in memory hotplug while TDX memory regions are fixed.
+ */
+static int build_tdx_memlist(struct list_head *tmb_list)
+{
+	unsigned long start_pfn, end_pfn;
+	int i, ret;
+
+	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
+		/*
+		 * The first 1MB is not reported as TDX convertible memory.
+		 * Although the first 1MB is always reserved and won't end up
+		 * to the page allocator, it is still in memblock's memory
+		 * regions.  Skip them manually to exclude them as TDX memory.
+		 */
+		start_pfn = max(start_pfn, PHYS_PFN(SZ_1M));
+		if (start_pfn >= end_pfn)
+			continue;
+
+		/*
+		 * Add the memory regions as TDX memory.  The regions in
+		 * memblock has already guaranteed they are in address
+		 * ascending order and don't overlap.
+		 */
+		ret = add_tdx_memblock(tmb_list, start_pfn, end_pfn);
+		if (ret)
+			goto err;
+	}
+
+	return 0;
+err:
+	free_tdx_memlist(tmb_list);
+	return ret;
+}
+
 static int init_tdx_module(void)
 {
 	/*
@@ -226,10 +364,25 @@ static int init_tdx_module(void)
 	if (ret)
 		goto out;
 
+	/*
+	 * The initial support of TDX guests only allocates memory from
+	 * the global page allocator.  To keep things simple, just make
+	 * sure all pages in the page allocator are TDX memory.
+	 *
+	 * Build the list of "TDX-usable" memory regions which cover all
+	 * pages in the page allocator to guarantee that.  Do it while
+	 * holding mem_hotplug_lock read-lock as the memory hotplug code
+	 * path reads the @tdx_memlist to reject any new memory.
+	 */
+	get_online_mems();
+
+	ret = build_tdx_memlist(&tdx_memlist);
+	if (ret)
+		goto out;
+
 	/*
 	 * TODO:
 	 *
-	 *  - Build the list of TDX-usable memory regions.
 	 *  - Construct a list of TDMRs to cover all TDX-usable memory
 	 *    regions.
 	 *  - Pick up one TDX private KeyID as the global KeyID.
@@ -241,6 +394,11 @@ static int init_tdx_module(void)
 	 */
 	ret = -EINVAL;
 out:
+	/*
+	 * @tdx_memlist is written here and read at memory hotplug time.
+	 * Lock out memory hotplug code while building it.
+	 */
+	put_online_mems();
 	return ret;
 }