From patchwork Thu Dec  8 19:38:35 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Matlack <dmatlack@google.com>
X-Patchwork-Id: 13068786
Return-Path: 
 <linux-riscv-bounces+linux-riscv=archiver.kernel.org@lists.infradead.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from bombadil.infradead.org (bombadil.infradead.org
 [198.137.202.133])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id 51DB7C4332F
	for <linux-riscv@archiver.kernel.org>; Thu,  8 Dec 2022 19:44:53 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=bombadil.20210309; h=Sender:
	Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
	List-Archive:List-Unsubscribe:List-Id:Cc:To:From:Subject:Message-ID:
	References:Mime-Version:In-Reply-To:Date:Reply-To:Content-ID:
	Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc
	:Resent-Message-ID:List-Owner;
	bh=uQWTxZyqVUyc4WGoXon/SVN9i5kxXD9ch/kBp7pUsSI=; b=0jW9rc7LDFwQg4G/8ZYM5iRJPU
	W7X1QErb1wheN8Ohu7k02tArjUsjkf8ft06b0hlw+XmxTO/owa3DI1ubP7LWk+G8/7c2xXd0XeCAB
	jY/ieuCIfhq8DsbY7nIluHnJlkiDDOUqBlPo5ZYJ9j0sGZnsS4AqlNphFEWPteuw+Fg4XKWAlecNq
	ULXrDzPs3zEQm2bTD650rNlNXT5VfIFK9g2JB1s1o3Gs6nKykK1NuM1/EXrZZy86ETeAOX3G+jahy
	mPOCFAcnLL27uwDJ+2sF5hR5enOgm0Z10lNgjAZxG6APLXw/38VPVocfYVDKrohEzDfDLz1IQDaIp
	4jOhAUQQ==;
Received: from localhost ([::1] helo=bombadil.infradead.org)
	by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux))
	id 1p3Mp1-009zdu-Fn; Thu, 08 Dec 2022 19:44:43 +0000
Received: from mail-yw1-x1149.google.com ([2607:f8b0:4864:20::1149])
	by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux))
	id 1p3Mox-009zW3-7N
	for linux-riscv@lists.infradead.org; Thu, 08 Dec 2022 19:44:42 +0000
Received: by mail-yw1-x1149.google.com with SMTP id
 00721157ae682-3d0465d32deso25149137b3.20
        for <linux-riscv@lists.infradead.org>;
 Thu, 08 Dec 2022 11:44:38 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=7lWdJkZrv9HoLXNU6i6MkXoxSJstT9R2oUrEFqSahss=;
        b=sg2xkDze7H4yV2lU/1/FpCU1cOSTt2gM7RO9hSQj+EGldzLsfkQyRpFJO1Dbgvo3W9
         PfOPbmNzoC3iMvnIaOngSfZ5LH9xb9dgFCEw9aEBhfSkfZa0p2LGFJNWmPsP0YUfDno9
         CQwSUJBX0kvzRAUmX2Me/KJgguCLZ+SfH2HWLEdPYDNzfjltbTnDARaVTFS5Gq443Bmi
         +QopmUV51BioUs/4b9KMy4gkmjkOujFVUjs6U8RTkqsv/HoGkTY5Ees6a9Iry1tmsH00
         Td9EkW5zbDJ4OlpNw/cjDC4i3Hi3OifKlsJcAx0502DGt1TlOCLOeHM0BRmhVJInz4qJ
         qvxA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=7lWdJkZrv9HoLXNU6i6MkXoxSJstT9R2oUrEFqSahss=;
        b=CbeWkAVt+/jNrbjYYDcEU+4PBv1XNf3ry6gU12KM7Nqa5OnMdMGcoMV9tdf+qMhjr2
         lq/+PPKoMqvCZ1f3lmdl4fBIopBncA0Rq0pVk+t4v/VPR1b6Yicp9wWAFYKtv/7c7lGC
         IvnkjPif2HJ5I1jLHL+2NiXqtTdHSSNPrvbFrKbWK1ZXulbYbZMTAr4gTUsSfH66A/It
         2HR+ZpP7+R8o3tp1qnIqzXRitIM+ChxV/T0WMza7RBxy4qhV2Zfo9AQuPOEb1quf+9LN
         8wcNCL4XSf3roycTe7C4P7JOElOTQfuBBm/F2Zmt+pjS+1PLv/OiQHXQ/Y4v50d9C42p
         WnoA==
X-Gm-Message-State: ANoB5pkBWhRGgKUqCGV/jRsrqkrE2MZCntN5iylcYR+UaHGIIwlyUM77
	pNy7IuJGVuUAPOHb7d0DwbZlEwqOc5ZeMg==
X-Google-Smtp-Source: 
 AA0mqf63KmdX+eJrD5nf7+2lHNPi/RXEFfkP5c0ill5ei/9dZE+wqXzEkU41M9xqISk+TkkYSOsMnafoSWus4A==
X-Received: from dmatlack-n2d-128.c.googlers.com
 ([fda3:e722:ac3:cc00:20:ed76:c0a8:1309])
 (user=dmatlack job=sendgmr) by 2002:a25:1884:0:b0:6ef:c66f:e0fa with SMTP id
 126-20020a251884000000b006efc66fe0famr68123962yby.3.1670528371737; Thu, 08
 Dec 2022 11:39:31 -0800 (PST)
Date: Thu,  8 Dec 2022 11:38:35 -0800
In-Reply-To: <20221208193857.4090582-1-dmatlack@google.com>
Mime-Version: 1.0
References: <20221208193857.4090582-1-dmatlack@google.com>
X-Mailer: git-send-email 2.39.0.rc1.256.g54fd8350bd-goog
Message-ID: <20221208193857.4090582-16-dmatlack@google.com>
Subject: [RFC PATCH 15/37] KVM: x86/mmu: Add a common API for
 inspecting/modifying TDP PTEs
From: David Matlack <dmatlack@google.com>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: Marc Zyngier <maz@kernel.org>, James Morse <james.morse@arm.com>,
	Alexandru Elisei <alexandru.elisei@arm.com>,
 Suzuki K Poulose <suzuki.poulose@arm.com>,
	Oliver Upton <oliver.upton@linux.dev>, Huacai Chen <chenhuacai@kernel.org>,
	Aleksandar Markovic <aleksandar.qemu.devel@gmail.com>,
 Anup Patel <anup@brainfault.org>,
	Atish Patra <atishp@atishpatra.org>,
 Paul Walmsley <paul.walmsley@sifive.com>,
	Palmer Dabbelt <palmer@dabbelt.com>, Albert Ou <aou@eecs.berkeley.edu>,
	Sean Christopherson <seanjc@google.com>,
 Andrew Morton <akpm@linux-foundation.org>,
	David Matlack <dmatlack@google.com>,
 Anshuman Khandual <anshuman.khandual@arm.com>,
	Nadav Amit <namit@vmware.com>,
 "Matthew Wilcox (Oracle)" <willy@infradead.org>,
 Vlastimil Babka <vbabka@suse.cz>,
	"Liam R. Howlett" <Liam.Howlett@Oracle.com>,
 Suren Baghdasaryan <surenb@google.com>,
	Peter Xu <peterx@redhat.com>, xu xin <cgel.zte@gmail.com>,
 Arnd Bergmann <arnd@arndb.de>,
	Yu Zhao <yuzhao@google.com>, Colin Cross <ccross@google.com>,
 Hugh Dickins <hughd@google.com>,
	Ben Gardon <bgardon@google.com>, Mingwei Zhang <mizhang@google.com>,
	Krish Sadhukhan <krish.sadhukhan@oracle.com>,
 Ricardo Koller <ricarkol@google.com>,
	Jing Zhang <jingzhangos@google.com>, linux-arm-kernel@lists.infradead.org,
	kvmarm@lists.linux.dev, kvmarm@lists.cs.columbia.edu,
	linux-mips@vger.kernel.org, kvm@vger.kernel.org,
	kvm-riscv@lists.infradead.org, linux-riscv@lists.infradead.org
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20221208_114439_295364_EB095902 
X-CRM114-Status: GOOD (  28.90  )
X-BeenThere: linux-riscv@lists.infradead.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: <linux-riscv.lists.infradead.org>
List-Unsubscribe: <http://lists.infradead.org/mailman/options/linux-riscv>,
 <mailto:linux-riscv-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-riscv/>
List-Post: <mailto:linux-riscv@lists.infradead.org>
List-Help: <mailto:linux-riscv-request@lists.infradead.org?subject=help>
List-Subscribe: <http://lists.infradead.org/mailman/listinfo/linux-riscv>,
 <mailto:linux-riscv-request@lists.infradead.org?subject=subscribe>
Sender: "linux-riscv" <linux-riscv-bounces@lists.infradead.org>
Errors-To: 
 linux-riscv-bounces+linux-riscv=archiver.kernel.org@lists.infradead.org

Introduce an API for inspecting and modifying TDP PTEs from common code.
This will be used in future commits to move the TDP MMU to common code.

Specifically, introduce the following API that can be used in common
code:

  /* Inspection API */
  tdp_pte_is_present()
  tdp_pte_is_writable()
  tdp_pte_is_huge()
  tdp_pte_is_leaf()
  tdp_pte_is_accessed()
  tdp_pte_is_dirty()
  tdp_pte_is_mmio()
  tdp_pte_is_volatile()
  tdp_pte_to_pfn()
  tdp_pte_check_leaf_invariants()

  /* Modification API */
  tdp_pte_clear_writable()
  tdp_pte_clear_mmu_writable()
  tdp_pte_clear_dirty()
  tdp_pte_clear_accessed()

Note that this does not cover constructing PTEs from scratch (e.g.
during page fault handling). This will be added in a subsequent commit.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/include/asm/kvm/tdp_pgtable.h |  58 +++++++++
 arch/x86/kvm/Makefile                  |   2 +-
 arch/x86/kvm/mmu/spte.c                |   3 +-
 arch/x86/kvm/mmu/spte.h                |  22 ----
 arch/x86/kvm/mmu/tdp_iter.c            |   4 +-
 arch/x86/kvm/mmu/tdp_iter.h            |   5 +-
 arch/x86/kvm/mmu/tdp_mmu.c             | 171 +++++++++++--------------
 arch/x86/kvm/mmu/tdp_pgtable.c         |  72 +++++++++++
 include/kvm/tdp_pgtable.h              |  18 +++
 9 files changed, 231 insertions(+), 124 deletions(-)
 create mode 100644 arch/x86/include/asm/kvm/tdp_pgtable.h
 create mode 100644 arch/x86/kvm/mmu/tdp_pgtable.c

diff --git a/arch/x86/include/asm/kvm/tdp_pgtable.h b/arch/x86/include/asm/kvm/tdp_pgtable.h
new file mode 100644
index 000000000000..cebc4bc44b49
--- /dev/null
+++ b/arch/x86/include/asm/kvm/tdp_pgtable.h
@@ -0,0 +1,58 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_KVM_TDP_PGTABLE_H
+#define __ASM_KVM_TDP_PGTABLE_H
+
+#include <linux/types.h>
+#include <linux/kvm_types.h>
+
+/*
+ * Use a semi-arbitrary value that doesn't set RWX bits, i.e. is not-present on
+ * both AMD and Intel CPUs, and doesn't set PFN bits, i.e. doesn't create a L1TF
+ * vulnerability.  Use only low bits to avoid 64-bit immediates.
+ */
+#define REMOVED_TDP_PTE		0x5a0ULL
+
+#define TDP_PTE_WRITABLE_MASK	BIT_ULL(1)
+#define TDP_PTE_HUGE_PAGE_MASK	BIT_ULL(7)
+#define TDP_PTE_PRESENT_MASK	BIT_ULL(11)
+
+static inline bool tdp_pte_is_writable(u64 pte)
+{
+	return pte & TDP_PTE_WRITABLE_MASK;
+}
+
+static inline bool tdp_pte_is_huge(u64 pte)
+{
+	return pte & TDP_PTE_HUGE_PAGE_MASK;
+}
+
+static inline bool tdp_pte_is_present(u64 pte)
+{
+	return pte & TDP_PTE_PRESENT_MASK;
+}
+
+bool tdp_pte_is_accessed(u64 pte);
+bool tdp_pte_is_dirty(u64 pte);
+bool tdp_pte_is_mmio(u64 pte);
+bool tdp_pte_is_volatile(u64 pte);
+
+static inline u64 tdp_pte_clear_writable(u64 pte)
+{
+	return pte & ~TDP_PTE_WRITABLE_MASK;
+}
+
+static inline u64 tdp_pte_clear_mmu_writable(u64 pte)
+{
+	extern u64 __read_mostly shadow_mmu_writable_mask;
+
+	return pte & ~(TDP_PTE_WRITABLE_MASK | shadow_mmu_writable_mask);
+}
+
+u64 tdp_pte_clear_dirty(u64 pte, bool force_wrprot);
+u64 tdp_pte_clear_accessed(u64 pte);
+
+kvm_pfn_t tdp_pte_to_pfn(u64 pte);
+
+void tdp_pte_check_leaf_invariants(u64 pte);
+
+#endif /* !__ASM_KVM_TDP_PGTABLE_H */
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 80e3fe184d17..c294ae51caba 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -18,7 +18,7 @@ ifdef CONFIG_HYPERV
 kvm-y			+= kvm_onhyperv.o
 endif
 
-kvm-$(CONFIG_X86_64) += mmu/tdp_iter.o mmu/tdp_mmu.o
+kvm-$(CONFIG_X86_64) += mmu/tdp_pgtable.o mmu/tdp_iter.o mmu/tdp_mmu.o
 kvm-$(CONFIG_KVM_XEN)	+= xen.o
 kvm-$(CONFIG_KVM_SMM)	+= smm.o
 
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index fe4b626cb431..493e109f1105 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -10,6 +10,7 @@
 
 
 #include <linux/kvm_host.h>
+#include <kvm/tdp_pgtable.h>
 #include "mmu.h"
 #include "mmu_internal.h"
 #include "x86.h"
@@ -401,7 +402,7 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask)
 	 * not set any RWX bits.
 	 */
 	if (WARN_ON((mmio_value & mmio_mask) != mmio_value) ||
-	    WARN_ON(mmio_value && (REMOVED_SPTE & mmio_mask) == mmio_value))
+	    WARN_ON(mmio_value && (REMOVED_TDP_PTE & mmio_mask) == mmio_value))
 		mmio_value = 0;
 
 	if (!mmio_value)
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 4c5d518e3ac6..a1b7d7730583 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -183,28 +183,6 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
  */
 #define SHADOW_NONPRESENT_OR_RSVD_MASK_LEN 5
 
-/*
- * If a thread running without exclusive control of the MMU lock must perform a
- * multi-part operation on an SPTE, it can set the SPTE to REMOVED_SPTE as a
- * non-present intermediate value. Other threads which encounter this value
- * should not modify the SPTE.
- *
- * Use a semi-arbitrary value that doesn't set RWX bits, i.e. is not-present on
- * both AMD and Intel CPUs, and doesn't set PFN bits, i.e. doesn't create a L1TF
- * vulnerability.  Use only low bits to avoid 64-bit immediates.
- *
- * Only used by the TDP MMU.
- */
-#define REMOVED_SPTE	0x5a0ULL
-
-/* Removed SPTEs must not be misconstrued as shadow present PTEs. */
-static_assert(!(REMOVED_SPTE & SPTE_MMU_PRESENT_MASK));
-
-static inline bool is_removed_spte(u64 spte)
-{
-	return spte == REMOVED_SPTE;
-}
-
 /* Get an SPTE's index into its parent's page table (and the spt array). */
 static inline int spte_index(u64 *sptep)
 {
diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
index d6328dac9cd3..d5f024b7f6e4 100644
--- a/arch/x86/kvm/mmu/tdp_iter.c
+++ b/arch/x86/kvm/mmu/tdp_iter.c
@@ -69,10 +69,10 @@ tdp_ptep_t spte_to_child_pt(u64 spte, int level)
 	 * There's no child entry if this entry isn't present or is a
 	 * last-level entry.
 	 */
-	if (!is_shadow_present_pte(spte) || is_last_spte(spte, level))
+	if (!tdp_pte_is_present(spte) || tdp_pte_is_leaf(spte, level))
 		return NULL;
 
-	return (tdp_ptep_t)__va(spte_to_pfn(spte) << PAGE_SHIFT);
+	return (tdp_ptep_t)__va(tdp_pte_to_pfn(spte) << PAGE_SHIFT);
 }
 
 /*
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index bfac83ab52db..6e3c38532d1d 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -45,8 +45,9 @@ static inline u64 kvm_tdp_mmu_write_spte(tdp_ptep_t sptep, u64 old_spte,
 	 * logic needs to be reassessed if KVM were to use non-leaf Accessed
 	 * bits, e.g. to skip stepping down into child SPTEs when aging SPTEs.
 	 */
-	if (is_shadow_present_pte(old_spte) && is_last_spte(old_spte, level) &&
-	    spte_has_volatile_bits(old_spte))
+	if (tdp_pte_is_present(old_spte) &&
+	    tdp_pte_is_leaf(old_spte, level) &&
+	    tdp_pte_is_volatile(old_spte))
 		return kvm_tdp_mmu_write_spte_atomic(sptep, new_spte);
 
 	__kvm_tdp_mmu_write_spte(sptep, new_spte);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index a6d6e393c009..fea42bbac984 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -334,13 +334,13 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 
 static void handle_changed_spte_acc_track(u64 old_spte, u64 new_spte, int level)
 {
-	if (!is_shadow_present_pte(old_spte) || !is_last_spte(old_spte, level))
+	if (!tdp_pte_is_present(old_spte) || !tdp_pte_is_leaf(old_spte, level))
 		return;
 
-	if (is_accessed_spte(old_spte) &&
-	    (!is_shadow_present_pte(new_spte) || !is_accessed_spte(new_spte) ||
-	     spte_to_pfn(old_spte) != spte_to_pfn(new_spte)))
-		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
+	if (tdp_pte_is_accessed(old_spte) &&
+	    (!tdp_pte_is_present(new_spte) || !tdp_pte_is_accessed(new_spte) ||
+	     tdp_pte_to_pfn(old_spte) != tdp_pte_to_pfn(new_spte)))
+		kvm_set_pfn_accessed(tdp_pte_to_pfn(old_spte));
 }
 
 static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn,
@@ -352,10 +352,10 @@ static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn,
 	if (level > PG_LEVEL_PTE)
 		return;
 
-	pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
+	pfn_changed = tdp_pte_to_pfn(old_spte) != tdp_pte_to_pfn(new_spte);
 
-	if ((!is_writable_pte(old_spte) || pfn_changed) &&
-	    is_writable_pte(new_spte)) {
+	if ((!tdp_pte_is_writable(old_spte) || pfn_changed) &&
+	    tdp_pte_is_writable(new_spte)) {
 		slot = __gfn_to_memslot(__kvm_memslots(kvm, as_id), gfn);
 		mark_page_dirty_in_slot(kvm, slot, gfn);
 	}
@@ -445,8 +445,8 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
 			 * value to the removed SPTE value.
 			 */
 			for (;;) {
-				old_spte = kvm_tdp_mmu_write_spte_atomic(sptep, REMOVED_SPTE);
-				if (!is_removed_spte(old_spte))
+				old_spte = kvm_tdp_mmu_write_spte_atomic(sptep, REMOVED_TDP_PTE);
+				if (!tdp_pte_is_removed(old_spte))
 					break;
 				cpu_relax();
 			}
@@ -461,7 +461,7 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
 			 * unreachable.
 			 */
 			old_spte = kvm_tdp_mmu_read_spte(sptep);
-			if (!is_shadow_present_pte(old_spte))
+			if (!tdp_pte_is_present(old_spte))
 				continue;
 
 			/*
@@ -481,7 +481,8 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
 			 * strictly necessary for the same reason, but using
 			 * the remove SPTE value keeps the shared/exclusive
 			 * paths consistent and allows the handle_changed_spte()
-			 * call below to hardcode the new value to REMOVED_SPTE.
+			 * call below to hardcode the new value to
+			 * REMOVED_TDP_PTE.
 			 *
 			 * Note, even though dropping a Dirty bit is the only
 			 * scenario where a non-atomic update could result in a
@@ -493,10 +494,11 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
 			 * it here.
 			 */
 			old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte,
-							  REMOVED_SPTE, level);
+							  REMOVED_TDP_PTE,
+							  level);
 		}
 		handle_changed_spte(kvm, sp->role.as_id, gfn, old_spte,
-				    REMOVED_SPTE, level, shared);
+				    REMOVED_TDP_PTE, level, shared);
 	}
 
 	call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
@@ -521,11 +523,11 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 				  u64 old_spte, u64 new_spte, int level,
 				  bool shared)
 {
-	bool was_present = is_shadow_present_pte(old_spte);
-	bool is_present = is_shadow_present_pte(new_spte);
-	bool was_leaf = was_present && is_last_spte(old_spte, level);
-	bool is_leaf = is_present && is_last_spte(new_spte, level);
-	bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
+	bool was_present = tdp_pte_is_present(old_spte);
+	bool is_present = tdp_pte_is_present(new_spte);
+	bool was_leaf = was_present && tdp_pte_is_leaf(old_spte, level);
+	bool is_leaf = is_present && tdp_pte_is_leaf(new_spte, level);
+	bool pfn_changed = tdp_pte_to_pfn(old_spte) != tdp_pte_to_pfn(new_spte);
 
 	WARN_ON(level > TDP_ROOT_MAX_LEVEL);
 	WARN_ON(level < PG_LEVEL_PTE);
@@ -560,7 +562,7 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 	trace_kvm_tdp_mmu_spte_changed(as_id, gfn, level, old_spte, new_spte);
 
 	if (is_leaf)
-		check_spte_writable_invariants(new_spte);
+		tdp_pte_check_leaf_invariants(new_spte);
 
 	/*
 	 * The only times a SPTE should be changed from a non-present to
@@ -574,9 +576,9 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 		 * impact the guest since both the former and current SPTEs
 		 * are nonpresent.
 		 */
-		if (WARN_ON(!is_mmio_spte(old_spte) &&
-			    !is_mmio_spte(new_spte) &&
-			    !is_removed_spte(new_spte)))
+		if (WARN_ON(!tdp_pte_is_mmio(old_spte) &&
+			    !tdp_pte_is_mmio(new_spte) &&
+			    !tdp_pte_is_removed(new_spte)))
 			pr_err("Unexpected SPTE change! Nonpresent SPTEs\n"
 			       "should not be replaced with another,\n"
 			       "different nonpresent SPTE, unless one or both\n"
@@ -590,9 +592,9 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 	if (is_leaf != was_leaf)
 		kvm_update_page_stats(kvm, level, is_leaf ? 1 : -1);
 
-	if (was_leaf && is_dirty_spte(old_spte) &&
-	    (!is_present || !is_dirty_spte(new_spte) || pfn_changed))
-		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
+	if (was_leaf && tdp_pte_is_dirty(old_spte) &&
+	    (!is_present || !tdp_pte_is_dirty(new_spte) || pfn_changed))
+		kvm_set_pfn_dirty(tdp_pte_to_pfn(old_spte));
 
 	/*
 	 * Recursively handle child PTs if the change removed a subtree from
@@ -645,7 +647,7 @@ static inline int tdp_mmu_set_spte_atomic(struct kvm *kvm,
 	 * and pre-checking before inserting a new SPTE is advantageous as it
 	 * avoids unnecessary work.
 	 */
-	WARN_ON_ONCE(iter->yielded || is_removed_spte(iter->old_spte));
+	WARN_ON_ONCE(iter->yielded || tdp_pte_is_removed(iter->old_spte));
 
 	lockdep_assert_held_read(&kvm->mmu_lock);
 
@@ -674,7 +676,7 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
 	 * immediately installing a present entry in its place
 	 * before the TLBs are flushed.
 	 */
-	ret = tdp_mmu_set_spte_atomic(kvm, iter, REMOVED_SPTE);
+	ret = tdp_mmu_set_spte_atomic(kvm, iter, REMOVED_TDP_PTE);
 	if (ret)
 		return ret;
 
@@ -730,7 +732,7 @@ static u64 __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
 	 * should be used. If operating under the MMU lock in write mode, the
 	 * use of the removed SPTE should not be necessary.
 	 */
-	WARN_ON(is_removed_spte(old_spte) || is_removed_spte(new_spte));
+	WARN_ON(tdp_pte_is_removed(old_spte) || tdp_pte_is_removed(new_spte));
 
 	old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);
 
@@ -781,8 +783,8 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
 
 #define tdp_root_for_each_leaf_pte(_iter, _root, _start, _end)	\
 	tdp_root_for_each_pte(_iter, _root, _start, _end)		\
-		if (!is_shadow_present_pte(_iter.old_spte) ||		\
-		    !is_last_spte(_iter.old_spte, _iter.level))		\
+		if (!tdp_pte_is_present(_iter.old_spte) ||		\
+		    !tdp_pte_is_leaf(_iter.old_spte, _iter.level))		\
 			continue;					\
 		else
 
@@ -858,7 +860,7 @@ static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
 		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
 			continue;
 
-		if (!is_shadow_present_pte(iter.old_spte))
+		if (!tdp_pte_is_present(iter.old_spte))
 			continue;
 
 		if (iter.level > zap_level)
@@ -919,7 +921,7 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
 		return false;
 
 	old_spte = kvm_tdp_mmu_read_spte(sp->ptep);
-	if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte)))
+	if (WARN_ON_ONCE(!tdp_pte_is_present(old_spte)))
 		return false;
 
 	__tdp_mmu_set_spte(kvm, sp->role.as_id, sp->ptep, old_spte, 0,
@@ -953,8 +955,8 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
 			continue;
 		}
 
-		if (!is_shadow_present_pte(iter.old_spte) ||
-		    !is_last_spte(iter.old_spte, iter.level))
+		if (!tdp_pte_is_present(iter.old_spte) ||
+		    !tdp_pte_is_leaf(iter.old_spte, iter.level))
 			continue;
 
 		tdp_mmu_set_spte(kvm, &iter, 0);
@@ -1074,8 +1076,8 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 		ret = RET_PF_SPURIOUS;
 	else if (tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte))
 		return RET_PF_RETRY;
-	else if (is_shadow_present_pte(iter->old_spte) &&
-		 !is_last_spte(iter->old_spte, iter->level))
+	else if (tdp_pte_is_present(iter->old_spte) &&
+		 !tdp_pte_is_leaf(iter->old_spte, iter->level))
 		kvm_flush_remote_tlbs_with_address(vcpu->kvm, sp->gfn,
 						   TDP_PAGES_PER_LEVEL(iter->level + 1));
 
@@ -1090,7 +1092,7 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 	}
 
 	/* If a MMIO SPTE is installed, the MMIO will need to be emulated. */
-	if (unlikely(is_mmio_spte(new_spte))) {
+	if (unlikely(tdp_pte_is_mmio(new_spte))) {
 		vcpu->stat.pf_mmio_spte_created++;
 		trace_mark_mmio_spte(rcu_dereference(iter->sptep), iter->gfn,
 				     new_spte);
@@ -1168,12 +1170,12 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		 * If SPTE has been frozen by another thread, just give up and
 		 * retry, avoiding unnecessary page table allocation and free.
 		 */
-		if (is_removed_spte(iter.old_spte))
+		if (tdp_pte_is_removed(iter.old_spte))
 			goto retry;
 
 		/* Step down into the lower level page table if it exists. */
-		if (is_shadow_present_pte(iter.old_spte) &&
-		    !is_large_pte(iter.old_spte))
+		if (tdp_pte_is_present(iter.old_spte) &&
+		    !tdp_pte_is_huge(iter.old_spte))
 			continue;
 
 		/*
@@ -1185,7 +1187,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 
 		sp->arch.nx_huge_page_disallowed = fault->arch.huge_page_disallowed;
 
-		if (is_shadow_present_pte(iter.old_spte))
+		if (tdp_pte_is_present(iter.old_spte))
 			r = tdp_mmu_split_huge_page(kvm, &iter, sp, true);
 		else
 			r = tdp_mmu_link_sp(kvm, &iter, sp, true);
@@ -1207,6 +1209,15 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		}
 	}
 
+	/*
+	 * Force the guest to retry the access if the upper level SPTEs aren't
+	 * in place, or if the target leaf SPTE is frozen by another CPU.
+	 */
+	if (iter.level != fault->goal_level || tdp_pte_is_removed(iter.old_spte)) {
+		rcu_read_unlock();
+		return RET_PF_RETRY;
+	}
+
 	ret = tdp_mmu_map_handle_target_level(vcpu, fault, &iter);
 
 retry:
@@ -1255,27 +1266,13 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
 static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,
 			  struct kvm_gfn_range *range)
 {
-	u64 new_spte = 0;
+	u64 new_spte;
 
 	/* If we have a non-accessed entry we don't need to change the pte. */
-	if (!is_accessed_spte(iter->old_spte))
+	if (!tdp_pte_is_accessed(iter->old_spte))
 		return false;
 
-	new_spte = iter->old_spte;
-
-	if (spte_ad_enabled(new_spte)) {
-		new_spte &= ~shadow_accessed_mask;
-	} else {
-		/*
-		 * Capture the dirty status of the page, so that it doesn't get
-		 * lost when the SPTE is marked for access tracking.
-		 */
-		if (is_writable_pte(new_spte))
-			kvm_set_pfn_dirty(spte_to_pfn(new_spte));
-
-		new_spte = mark_spte_for_access_track(new_spte);
-	}
-
+	new_spte = tdp_pte_clear_accessed(iter->old_spte);
 	tdp_mmu_set_spte_no_acc_track(kvm, iter, new_spte);
 
 	return true;
@@ -1289,7 +1286,7 @@ bool kvm_tdp_mmu_age_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 static bool test_age_gfn(struct kvm *kvm, struct tdp_iter *iter,
 			 struct kvm_gfn_range *range)
 {
-	return is_accessed_spte(iter->old_spte);
+	return tdp_pte_is_accessed(iter->old_spte);
 }
 
 bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
@@ -1306,7 +1303,7 @@ static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
 	WARN_ON(pte_huge(range->pte) || range->start + 1 != range->end);
 
 	if (iter->level != PG_LEVEL_PTE ||
-	    !is_shadow_present_pte(iter->old_spte))
+	    !tdp_pte_is_present(iter->old_spte))
 		return false;
 
 	/*
@@ -1364,12 +1361,12 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
 			continue;
 
-		if (!is_shadow_present_pte(iter.old_spte) ||
-		    !is_last_spte(iter.old_spte, iter.level) ||
-		    !(iter.old_spte & PT_WRITABLE_MASK))
+		if (!tdp_pte_is_present(iter.old_spte) ||
+		    !tdp_pte_is_leaf(iter.old_spte, iter.level) ||
+		    !tdp_pte_is_writable(iter.old_spte))
 			continue;
 
-		new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
+		new_spte = tdp_pte_clear_writable(iter.old_spte);
 
 		if (tdp_mmu_set_spte_atomic(kvm, &iter, new_spte))
 			goto retry;
@@ -1525,7 +1522,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
 			continue;
 
-		if (!is_shadow_present_pte(iter.old_spte) || !is_large_pte(iter.old_spte))
+		if (!tdp_pte_is_present(iter.old_spte) || !tdp_pte_is_huge(iter.old_spte))
 			continue;
 
 		if (!sp) {
@@ -1607,20 +1604,12 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
 			continue;
 
-		if (!is_shadow_present_pte(iter.old_spte))
+		if (!tdp_pte_is_present(iter.old_spte))
 			continue;
 
-		if (spte_ad_need_write_protect(iter.old_spte)) {
-			if (is_writable_pte(iter.old_spte))
-				new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
-			else
-				continue;
-		} else {
-			if (iter.old_spte & shadow_dirty_mask)
-				new_spte = iter.old_spte & ~shadow_dirty_mask;
-			else
-				continue;
-		}
+		new_spte = tdp_pte_clear_dirty(iter.old_spte, false);
+		if (new_spte == iter.old_spte)
+			continue;
 
 		if (tdp_mmu_set_spte_atomic(kvm, &iter, new_spte))
 			goto retry;
@@ -1680,17 +1669,9 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,
 
 		mask &= ~(1UL << (iter.gfn - gfn));
 
-		if (wrprot || spte_ad_need_write_protect(iter.old_spte)) {
-			if (is_writable_pte(iter.old_spte))
-				new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
-			else
-				continue;
-		} else {
-			if (iter.old_spte & shadow_dirty_mask)
-				new_spte = iter.old_spte & ~shadow_dirty_mask;
-			else
-				continue;
-		}
+		new_spte = tdp_pte_clear_dirty(iter.old_spte, wrprot);
+		if (new_spte == iter.old_spte)
+			continue;
 
 		tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte);
 	}
@@ -1734,7 +1715,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 			continue;
 
 		if (iter.level > TDP_MAX_HUGEPAGE_LEVEL ||
-		    !is_shadow_present_pte(iter.old_spte))
+		    !tdp_pte_is_present(iter.old_spte))
 			continue;
 
 		/*
@@ -1742,7 +1723,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 		 * a large page size, then its parent would have been zapped
 		 * instead of stepping down.
 		 */
-		if (is_last_spte(iter.old_spte, iter.level))
+		if (tdp_pte_is_leaf(iter.old_spte, iter.level))
 			continue;
 
 		/*
@@ -1800,13 +1781,11 @@ static bool write_protect_gfn(struct kvm *kvm, struct kvm_mmu_page *root,
 	rcu_read_lock();
 
 	for_each_tdp_pte_min_level(iter, root, min_level, gfn, gfn + 1) {
-		if (!is_shadow_present_pte(iter.old_spte) ||
-		    !is_last_spte(iter.old_spte, iter.level))
+		if (!tdp_pte_is_present(iter.old_spte) ||
+		    !tdp_pte_is_leaf(iter.old_spte, iter.level))
 			continue;
 
-		new_spte = iter.old_spte &
-			~(PT_WRITABLE_MASK | shadow_mmu_writable_mask);
-
+		new_spte = tdp_pte_clear_mmu_writable(iter.old_spte);
 		if (new_spte == iter.old_spte)
 			break;
 
diff --git a/arch/x86/kvm/mmu/tdp_pgtable.c b/arch/x86/kvm/mmu/tdp_pgtable.c
new file mode 100644
index 000000000000..cf3b692d8e21
--- /dev/null
+++ b/arch/x86/kvm/mmu/tdp_pgtable.c
@@ -0,0 +1,72 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/kvm_types.h>
+#include <kvm/tdp_pgtable.h>
+
+#include "mmu.h"
+#include "spte.h"
+
+/* Removed SPTEs must not be misconstrued as shadow present PTEs. */
+static_assert(!(REMOVED_TDP_PTE & SPTE_MMU_PRESENT_MASK));
+
+static_assert(TDP_PTE_WRITABLE_MASK == PT_WRITABLE_MASK);
+static_assert(TDP_PTE_HUGE_PAGE_MASK == PT_PAGE_SIZE_MASK);
+static_assert(TDP_PTE_PRESENT_MASK == SPTE_MMU_PRESENT_MASK);
+
+bool tdp_pte_is_accessed(u64 pte)
+{
+	return is_accessed_spte(pte);
+}
+
+bool tdp_pte_is_dirty(u64 pte)
+{
+	return is_dirty_spte(pte);
+}
+
+bool tdp_pte_is_mmio(u64 pte)
+{
+	return is_mmio_spte(pte);
+}
+
+bool tdp_pte_is_volatile(u64 pte)
+{
+	return spte_has_volatile_bits(pte);
+}
+
+u64 tdp_pte_clear_dirty(u64 pte, bool force_wrprot)
+{
+	if (force_wrprot || spte_ad_need_write_protect(pte)) {
+		if (tdp_pte_is_writable(pte))
+			pte &= ~PT_WRITABLE_MASK;
+	} else if (pte & shadow_dirty_mask) {
+		pte &= ~shadow_dirty_mask;
+	}
+
+	return pte;
+}
+
+u64 tdp_pte_clear_accessed(u64 old_spte)
+{
+	if (spte_ad_enabled(old_spte))
+		return old_spte & ~shadow_accessed_mask;
+
+	/*
+	 * Capture the dirty status of the page, so that it doesn't get lost
+	 * when the SPTE is marked for access tracking.
+	 */
+	if (tdp_pte_is_writable(old_spte))
+		kvm_set_pfn_dirty(tdp_pte_to_pfn(old_spte));
+
+	return mark_spte_for_access_track(old_spte);
+}
+
+kvm_pfn_t tdp_pte_to_pfn(u64 pte)
+{
+	return spte_to_pfn(pte);
+}
+
+void tdp_pte_check_leaf_invariants(u64 pte)
+{
+	check_spte_writable_invariants(pte);
+}
+
diff --git a/include/kvm/tdp_pgtable.h b/include/kvm/tdp_pgtable.h
index 968be8d92350..a24c45ac7765 100644
--- a/include/kvm/tdp_pgtable.h
+++ b/include/kvm/tdp_pgtable.h
@@ -5,6 +5,8 @@
 #include <linux/log2.h>
 #include <linux/mm_types.h>
 
+#include <asm/kvm/tdp_pgtable.h>
+
 #define TDP_ROOT_MAX_LEVEL	5
 #define TDP_MAX_HUGEPAGE_LEVEL	PG_LEVEL_PUD
 #define TDP_PTES_PER_PAGE	(PAGE_SIZE / sizeof(u64))
@@ -18,4 +20,20 @@
 #define TDP_PTE_INDEX(gfn, level) \
 	(((gfn) >> TDP_LEVEL_SHIFT(level)) & TDP_LEVEL_MASK)
 
+/*
+ * If a thread running without exclusive control of the MMU lock must perform a
+ * multi-part operation on a PTE, it can set the PTE to REMOVED_TDP_PTE as a
+ * non-present intermediate value. Other threads which encounter this value
+ * should not modify the PTE.
+ */
+static inline bool tdp_pte_is_removed(u64 pte)
+{
+	return pte == REMOVED_TDP_PTE;
+}
+
+static inline bool tdp_pte_is_leaf(u64 pte, int level)
+{
+	return tdp_pte_is_huge(pte) || level == PG_LEVEL_PTE;
+}
+
 #endif /* !__KVM_TDP_PGTABLE_H */