[RFC,07/73] KVM: x86/mmu: Adapt shadow MMU for PVM

Message ID	20240226143630.33643-8-jiangshanlai@gmail.com (mailing list archive)
State	New, archived
Headers	show Received: from mail-pf1-f174.google.com (mail-pf1-f174.google.com [209.85.210.174]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0860012EBF1; Mon, 26 Feb 2024 14:35:00 +0000 (UTC) From: Lai Jiangshan <jiangshanlai@gmail.com> To: linux-kernel@vger.kernel.org Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>, Hou Wenlong <houwenlong.hwl@antgroup.com>, Linus Torvalds <torvalds@linux-foundation.org>, Peter Zijlstra <peterz@infradead.org>, Sean Christopherson <seanjc@google.com>, Thomas Gleixner <tglx@linutronix.de>, Borislav Petkov <bp@alien8.de>, Ingo Molnar <mingo@redhat.com>, kvm@vger.kernel.org, Paolo Bonzini <pbonzini@redhat.com>, x86@kernel.org, Kees Cook <keescook@chromium.org>, Juergen Gross <jgross@suse.com>, Dave Hansen <dave.hansen@linux.intel.com>, "H. Peter Anvin" <hpa@zytor.com> Subject: [RFC PATCH 07/73] KVM: x86/mmu: Adapt shadow MMU for PVM Date: Mon, 26 Feb 2024 22:35:24 +0800 Message-Id: <20240226143630.33643-8-jiangshanlai@gmail.com> In-Reply-To: <20240226143630.33643-1-jiangshanlai@gmail.com> References: <20240226143630.33643-1-jiangshanlai@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	KVM: x86/PVM: Introduce a new hypervisor \| expand [RFC,00/73] KVM: x86/PVM: Introduce a new hypervisor [RFC,01/73] KVM: Documentation: Add the specification for PVM [RFC,02/73] x86/ABI/PVM: Add PVM-specific ABI header file [RFC,03/73] x86/entry: Implement switcher for PVM VM enter/exit [RFC,04/73] x86/entry: Implement direct switching for the switcher [RFC,05/73] KVM: x86: Set 'vcpu->arch.exception.injected' as true before vendor callback [RFC,06/73] KVM: x86: Move VMX interrupt/nmi handling into kvm.ko [RFC,07/73] KVM: x86/mmu: Adapt shadow MMU for PVM [RFC,08/73] KVM: x86: Allow hypercall handling to not skip the instruction [RFC,09/73] KVM: x86: Add PVM virtual MSRs into emulated_msrs_all[] [RFC,10/73] KVM: x86: Introduce vendor feature to expose vendor-specific CPUID [RFC,11/73] KVM: x86: Implement gpc refresh for guest usage [RFC,12/73] KVM: x86: Add NR_VCPU_SREG in SREG enum [RFC,13/73] KVM: x86/emulator: Reinject #GP if instruction emulation failed for PVM [RFC,14/73] KVM: x86: Create stubs for PVM module as a new vendor [RFC,15/73] mm/vmalloc: Add a helper to reserve a contiguous and aligned kernel virtual area [RFC,16/73] KVM: x86/PVM: Implement host mmu initialization [RFC,17/73] KVM: x86/PVM: Implement module initialization related callbacks [RFC,18/73] KVM: x86/PVM: Implement VM/VCPU initialization related callbacks [RFC,19/73] x86/entry: Export 32-bit ignore syscall entry and __ia32_enabled variable [RFC,20/73] KVM: x86/PVM: Implement vcpu_load()/vcpu_put() related callbacks [RFC,21/73] KVM: x86/PVM: Implement vcpu_run() callbacks [RFC,22/73] KVM: x86/PVM: Handle some VM exits before enable interrupts [RFC,23/73] KVM: x86/PVM: Handle event handling related MSR read/write operation [RFC,24/73] KVM: x86/PVM: Introduce PVM mode switching [RFC,25/73] KVM: x86/PVM: Implement APIC emulation related callbacks [RFC,26/73] KVM: x86/PVM: Implement event delivery flags related callbacks [RFC,27/73] KVM: x86/PVM: Implement event injection related callbacks [RFC,28/73] KVM: x86/PVM: Handle syscall from user mode [RFC,29/73] KVM: x86/PVM: Implement allowed range checking for #PF [RFC,30/73] KVM: x86/PVM: Implement segment related callbacks [RFC,31/73] KVM: x86/PVM: Implement instruction emulation for #UD and #GP [RFC,32/73] KVM: x86/PVM: Enable guest debugging functions [RFC,33/73] KVM: x86/PVM: Handle VM-exit due to hardware exceptions [RFC,34/73] KVM: x86/PVM: Handle ERETU/ERETS synthetic instruction [RFC,35/73] KVM: x86/PVM: Handle PVM_SYNTHETIC_CPUID synthetic instruction [RFC,36/73] KVM: x86/PVM: Handle KVM hypercall [RFC,37/73] KVM: x86/PVM: Use host PCID to reduce guest TLB flushing [RFC,38/73] KVM: x86/PVM: Handle hypercalls for privilege instruction emulation [RFC,39/73] KVM: x86/PVM: Handle hypercall for CR3 switching [RFC,40/73] KVM: x86/PVM: Handle hypercall for loading GS selector [RFC,41/73] KVM: x86/PVM: Allow to load guest TLS in host GDT [RFC,42/73] KVM: x86/PVM: Support for kvm_exit() tracepoint [RFC,43/73] KVM: x86/PVM: Enable direct switching [RFC,44/73] KVM: x86/PVM: Implement TSC related callbacks [RFC,45/73] KVM: x86/PVM: Add dummy PMU related callbacks [RFC,46/73] KVM: x86/PVM: Support for CPUID faulting [RFC,47/73] KVM: x86/PVM: Handle the left supported MSRs in msrs_to_save_base[] [RFC,48/73] KVM: x86/PVM: Implement system registers setting callbacks [RFC,49/73] KVM: x86/PVM: Implement emulation for non-PVM mode [RFC,50/73] x86/tools/relocs: Cleanup cmdline options [RFC,51/73] x86/tools/relocs: Append relocations into input file [RFC,52/73] x86/boot: Allow to do relocation for uncompressed kernel [RFC,53/73] x86/pvm: Add Kconfig option and the CPU feature bit for PVM guest [RFC,54/73] x86/pvm: Detect PVM hypervisor support [RFC,55/73] x86/pvm: Relocate kernel image to specific virtual address range [RFC,56/73] x86/pvm: Relocate kernel image early in PVH entry [RFC,57/73] x86/pvm: Make cpu entry area and vmalloc area variable [RFC,58/73] x86/pvm: Relocate kernel address space layout [RFC,59/73] x86/pti: Force enabling KPTI for PVM guest [RFC,60/73] x86/pvm: Add event entry/exit and dispatch code [RFC,61/73] x86/pvm: Allow to install a system interrupt handler [RFC,62/73] x86/pvm: Add early kernel event entry and dispatch code [RFC,63/73] x86/pvm: Add hypercall support [RFC,64/73] x86/pvm: Enable PVM event delivery [RFC,65/73] x86/kvm: Patch KVM hypercall as PVM hypercall [RFC,66/73] x86/pvm: Use new cpu feature to describe XENPV and PVM [RFC,67/73] x86/pvm: Implement cpu related PVOPS [RFC,68/73] x86/pvm: Implement irq related PVOPS [RFC,69/73] x86/pvm: Implement mmu related PVOPS [RFC,70/73] x86/pvm: Don't use SWAPGS for gsbase read/write [RFC,71/73] x86/pvm: Adapt pushf/popf in this_cpu_cmpxchg16b_emu() [RFC,72/73] x86/pvm: Use RDTSCP as default in vdso_read_cpunode() [RFC,73/73] x86/pvm: Disable some unsupported syscalls and features

Message ID

20240226143630.33643-8-jiangshanlai@gmail.com (mailing list archive)

State

New, archived

Headers

From: Lai Jiangshan <jiangshanlai@gmail.com>
To: linux-kernel@vger.kernel.org
Cc: Lai Jiangshan <jiangshan.ljs@antgroup.com>,
	Hou Wenlong <houwenlong.hwl@antgroup.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Sean Christopherson <seanjc@google.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Borislav Petkov <bp@alien8.de>,
	Ingo Molnar <mingo@redhat.com>,
	kvm@vger.kernel.org,
	Paolo Bonzini <pbonzini@redhat.com>,
	x86@kernel.org,
	Kees Cook <keescook@chromium.org>,
	Juergen Gross <jgross@suse.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	"H. Peter Anvin" <hpa@zytor.com>
Subject: [RFC PATCH 07/73] KVM: x86/mmu: Adapt shadow MMU for PVM
Date: Mon, 26 Feb 2024 22:35:24 +0800
Message-Id: <20240226143630.33643-8-jiangshanlai@gmail.com>
In-Reply-To: <20240226143630.33643-1-jiangshanlai@gmail.com>
References: <20240226143630.33643-1-jiangshanlai@gmail.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

KVM: x86/PVM: Introduce a new hypervisor | expand

Commit Message

Lai Jiangshan Feb. 26, 2024, 2:35 p.m. UTC

From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

In PVM, shadow MMU is used for guest MMU virtualization. However, it
needs some changes to adapt for PVM:

1. In PVM, hardware CR4.LA57 is not changed, so the paging level of
   shadow MMU should be same as host. If the guest paging level is 4 and
   host paging level is 5, then it performs like shadow NPT MMU and
   'root_role.passthrough' is set as true.

2. PVM guest needs to access the host switcher, so some host mapping PGD
   entries will be cloned into the guest shadow paging table during the
   root SP allocation. These cloned host PGD entries are not marked as MMU
   present, so they can't be cleared by write-protecting. Additionally, in
   order to avoid modifying those cloned host PGD entries in the #PF
   handling path, a new callback is introduced to check the fault of the
   guest virtual address before walking the guest page table. This ensures
   that the guest cannot overwrite the host entries in the root SP.

3. If the guest paging level is 4 and the host paging level is 5, then the
   last PGD entry in the root SP is allowed to be overwritten if the guest
   tries to build a new allowed mapping under this PGD entry. In this case,
   the host P4D entries in the table pointed to by the last PGD entry
   should also be cloned during the new P4D SP allocation. These cloned P4D
   entries are also not marked as MMU present. A new bit in the
   'kvm_mmu_page_role' is used to mark this special SP. When zapping this
   SP, its parent PTE will be set to the original host PGD PTEs instead of
   clearing it.

4. The user bit in the SPTE of guest mapping should be forced to be set
   for PVM, as the guest is always running in hardware CPL3.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
---
 arch/x86/include/asm/kvm-x86-ops.h |  1 +
 arch/x86/include/asm/kvm_host.h    |  6 ++++-
 arch/x86/kvm/mmu/mmu.c             | 35 +++++++++++++++++++++++++++++-
 arch/x86/kvm/mmu/paging_tmpl.h     |  3 +++
 arch/x86/kvm/mmu/spte.c            |  4 ++++
 5 files changed, 47 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 26b628d84594..32e5473b499d 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -93,6 +93,7 @@  KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
 KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
 KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
 KVM_X86_OP(load_mmu_pgd)
+KVM_X86_OP_OPTIONAL_RET0(disallowed_va)
 KVM_X86_OP(has_wbinvd_exit)
 KVM_X86_OP(get_l2_tsc_offset)
 KVM_X86_OP(get_l2_tsc_multiplier)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d7036982332e..c76bafe9c7e2 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -346,7 +346,8 @@  union kvm_mmu_page_role {
 		unsigned ad_disabled:1;
 		unsigned guest_mode:1;
 		unsigned passthrough:1;
-		unsigned :5;
+		unsigned host_mmu_la57_top_p4d:1;
+		unsigned :4;
 
 		/*
 		 * This is left at the top of the word so that
@@ -1429,6 +1430,7 @@  struct kvm_arch {
 	 * the thread holds the MMU lock in write mode.
 	 */
 	spinlock_t tdp_mmu_pages_lock;
+	u64 *host_mmu_root_pgd;
 #endif /* CONFIG_X86_64 */
 
 	/*
@@ -1679,6 +1681,8 @@  struct kvm_x86_ops {
 	void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
 			     int root_level);
 
+	bool (*disallowed_va)(struct kvm_vcpu *vcpu, u64 la);
+
 	bool (*has_wbinvd_exit)(void);
 
 	u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c57e181bba21..80406666d7da 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1745,6 +1745,18 @@  static unsigned kvm_page_table_hashfn(gfn_t gfn)
 	return hash_64(gfn, KVM_MMU_HASH_SHIFT);
 }
 
+#define HOST_ROOT_LEVEL (pgtable_l5_enabled() ? PT64_ROOT_5LEVEL : PT64_ROOT_4LEVEL)
+
+static inline bool pvm_mmu_p4d_at_la57_pgd511(struct kvm *kvm, u64 *sptep)
+{
+	if (!pgtable_l5_enabled())
+		return false;
+	if (!kvm->arch.host_mmu_root_pgd)
+		return false;
+
+	return sptep_to_sp(sptep)->role.level == 5 && spte_index(sptep) == 511;
+}
+
 static void mmu_page_add_parent_pte(struct kvm_mmu_memory_cache *cache,
 				    struct kvm_mmu_page *sp, u64 *parent_pte)
 {
@@ -1764,7 +1776,10 @@  static void drop_parent_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
 			    u64 *parent_pte)
 {
 	mmu_page_remove_parent_pte(kvm, sp, parent_pte);
-	mmu_spte_clear_no_track(parent_pte);
+	if (!unlikely(sp->role.host_mmu_la57_top_p4d))
+		mmu_spte_clear_no_track(parent_pte);
+	else
+		__update_clear_spte_fast(parent_pte, kvm->arch.host_mmu_root_pgd[511]);
 }
 
 static void mark_unsync(u64 *spte);
@@ -2253,6 +2268,15 @@  static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
 	list_add(&sp->link, &kvm->arch.active_mmu_pages);
 	kvm_account_mmu_page(kvm, sp);
 
+	/* install host mmu entries when PVM */
+	if (kvm->arch.host_mmu_root_pgd && role.level == HOST_ROOT_LEVEL) {
+		memcpy(sp->spt, kvm->arch.host_mmu_root_pgd, PAGE_SIZE);
+	} else if (role.host_mmu_la57_top_p4d) {
+		u64 *p4d = __va(kvm->arch.host_mmu_root_pgd[511] & SPTE_BASE_ADDR_MASK);
+
+		memcpy(sp->spt, p4d, PAGE_SIZE);
+	}
+
 	sp->gfn = gfn;
 	sp->role = role;
 	hlist_add_head(&sp->hash_link, sp_list);
@@ -2354,6 +2378,9 @@  static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
 		return ERR_PTR(-EEXIST);
 
 	role = kvm_mmu_child_role(sptep, direct, access);
+	if (unlikely(pvm_mmu_p4d_at_la57_pgd511(vcpu->kvm, sptep)))
+		role.host_mmu_la57_top_p4d = 1;
+
 	return kvm_mmu_get_shadow_page(vcpu, gfn, role);
 }
 
@@ -5271,6 +5298,12 @@  static void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu,
 	/* KVM uses PAE paging whenever the guest isn't using 64-bit paging. */
 	root_role.level = max_t(u32, root_role.level, PT32E_ROOT_LEVEL);
 
+	/* Shadow MMU level should be the same as host for PVM */
+	if (vcpu->kvm->arch.host_mmu_root_pgd && root_role.level != HOST_ROOT_LEVEL) {
+		root_role.level = HOST_ROOT_LEVEL;
+		root_role.passthrough = 1;
+	}
+
 	/*
 	 * KVM forces EFER.NX=1 when TDP is disabled, reflect it in the MMU role.
 	 * KVM uses NX when TDP is disabled to handle a variety of scenarios,
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index c85255073f67..8ea3dca940ad 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -336,6 +336,9 @@  static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 			goto error;
 		--walker->level;
 	}
+
+	if (static_call(kvm_x86_disallowed_va)(vcpu, addr))
+		goto error;
 #endif
 	walker->max_level = walker->level;
 
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 4a599130e9c9..e302f7b5c696 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -186,6 +186,10 @@  bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	if (pte_access & ACC_USER_MASK)
 		spte |= shadow_user_mask;
 
+	/* PVM guest is always running in hardware CPL3. */
+	if (vcpu->kvm->arch.host_mmu_root_pgd)
+		spte |= shadow_user_mask;
+
 	if (level > PG_LEVEL_4K)
 		spte |= PT_PAGE_SIZE_MASK;

[RFC,07/73] KVM: x86/mmu: Adapt shadow MMU for PVM

Commit Message

Patch