diff mbox series

[v11,8/8] KVM: x86/mmu: Handle non-refcounted pages

Message ID 20240229025759.1187910-9-stevensd@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: allow mapping non-refcounted pages | expand

Commit Message

David Stevens Feb. 29, 2024, 2:57 a.m. UTC
From: David Stevens <stevensd@chromium.org>

Handle non-refcounted pages in __kvm_faultin_pfn. This allows the
host to map memory into the guest that is backed by non-refcounted
struct pages - for example, the tail pages of higher order non-compound
pages allocated by the amdgpu driver via ttm_pool_alloc_page.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 arch/x86/kvm/mmu/mmu.c          | 24 +++++++++++++++++-------
 arch/x86/kvm/mmu/mmu_internal.h |  2 ++
 arch/x86/kvm/mmu/paging_tmpl.h  |  2 +-
 arch/x86/kvm/mmu/tdp_mmu.c      |  3 ++-
 include/linux/kvm_host.h        |  6 ++++--
 virt/kvm/guest_memfd.c          |  8 ++++----
 virt/kvm/kvm_main.c             | 10 ++++++++--
 7 files changed, 38 insertions(+), 17 deletions(-)

Comments

Dmitry Osipenko April 4, 2024, 4:03 p.m. UTC | #1
Hi David,

On 2/29/24 05:57, David Stevens wrote:
> From: David Stevens <stevensd@chromium.org>
> 
> Handle non-refcounted pages in __kvm_faultin_pfn. This allows the
> host to map memory into the guest that is backed by non-refcounted
> struct pages - for example, the tail pages of higher order non-compound
> pages allocated by the amdgpu driver via ttm_pool_alloc_page.
> 
> Signed-off-by: David Stevens <stevensd@chromium.org>

This patch has a problem on v6.8 kernel. Pierre-Eric of AMD found that
Qemu crashes with "kvm bad address" error when booting Ubuntu 23.10 ISO
with a disabled virtio-gpu and I was able to reproduce it. Pierre-Eric
said this problem didn't exist with v6.7 kernel and using v10 kvm
patches. Could you please take a look at this issue?

To reproduce the bug, run Qemu like this and load the Ubuntu installer:

  qemu-system-x86_64 -boot d -cdrom ubuntu-23.10.1-desktop-amd64.iso -m
4G --enable-kvm -display gtk -smp 1 -machine q35

Qemu fails with "error: kvm run failed Bad address"

On the host kernel there is this warning:

 ------------[ cut here ]------------
 WARNING: CPU: 19 PID: 11696 at mm/gup.c:229 try_grab_page+0x64/0x100
 Call Trace:
  <TASK>
  ? try_grab_page+0x64/0x100
  ? __warn+0x81/0x130
  ? try_grab_page+0x64/0x100
  ? report_bug+0x171/0x1a0
  ? handle_bug+0x3c/0x80
  ? exc_invalid_op+0x17/0x70
  ? asm_exc_invalid_op+0x1a/0x20
  ? try_grab_page+0x64/0x100
  follow_page_pte+0xfa/0x4b0
  __get_user_pages+0xe5/0x6e0
  get_user_pages_unlocked+0xe7/0x370
  hva_to_pfn+0xa2/0x760 [kvm]
  ? free_unref_page+0xf9/0x180
  kvm_faultin_pfn+0x112/0x610 [kvm]
  kvm_tdp_page_fault+0x104/0x150 [kvm]
  kvm_mmu_page_fault+0x298/0x860 [kvm]
  kvm_arch_vcpu_ioctl_run+0xc7d/0x16b0 [kvm]
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? kvm_arch_vcpu_put+0x128/0x190 [kvm]
  ? srso_alias_return_thunk+0x5/0xfbef5
  kvm_vcpu_ioctl+0x199/0x700 [kvm]
  __x64_sys_ioctl+0x94/0xd0
  do_syscall_64+0x86/0x170
  ? kvm_on_user_return+0x64/0x90 [kvm]
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? fire_user_return_notifiers+0x37/0x70
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? syscall_exit_to_user_mode+0x80/0x230
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? do_syscall_64+0x96/0x170
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? do_syscall_64+0x96/0x170
  ? do_syscall_64+0x96/0x170
  ? do_syscall_64+0x96/0x170
  ? srso_alias_return_thunk+0x5/0xfbef5
  ? do_syscall_64+0x96/0x170
  ? srso_alias_return_thunk+0x5/0xfbef5
  entry_SYSCALL_64_after_hwframe+0x6e/0x76
 ---[ end trace 0000000000000000 ]---
David Stevens April 15, 2024, 7:28 a.m. UTC | #2
On Fri, Apr 5, 2024 at 1:03 AM Dmitry Osipenko
<dmitry.osipenko@collabora.com> wrote:
>
> Hi David,
>
> On 2/29/24 05:57, David Stevens wrote:
> > From: David Stevens <stevensd@chromium.org>
> >
> > Handle non-refcounted pages in __kvm_faultin_pfn. This allows the
> > host to map memory into the guest that is backed by non-refcounted
> > struct pages - for example, the tail pages of higher order non-compound
> > pages allocated by the amdgpu driver via ttm_pool_alloc_page.
> >
> > Signed-off-by: David Stevens <stevensd@chromium.org>
>
> This patch has a problem on v6.8 kernel. Pierre-Eric of AMD found that
> Qemu crashes with "kvm bad address" error when booting Ubuntu 23.10 ISO
> with a disabled virtio-gpu and I was able to reproduce it. Pierre-Eric
> said this problem didn't exist with v6.7 kernel and using v10 kvm
> patches. Could you please take a look at this issue?

This failure is due to a minor conflict with:

Fixes: d02c357e5bfa ("KVM: x86/mmu: Retry fault before acquiring
mmu_lock if mapping is changing")

My patch series makes __kvm_faultin_pfn no longer take a reference to
the page associated with the returned pfn. That conflicts with the
call to kvm_release_pfn_clean added to kvm_faultin_pfn, since there is
no longer a reference to release. Replacing that call with
kvm_set_page_accessed fixes the failure.

Sean, is there any path towards getting this series merged, or is it
blocked on cleaning up the issues in KVM code raised by Christoph? I'm
no longer working on the same projects I was when I first started
trying to upstream this code 3-ish years ago, so if there is a
significant amount of work left to upstream this, I need to pass
things on to someone else.

-David
Paolo Bonzini April 15, 2024, 9:36 a.m. UTC | #3
On Mon, Apr 15, 2024 at 9:29 AM David Stevens <stevensd@chromium.org> wrote:
> Sean, is there any path towards getting this series merged, or is it
> blocked on cleaning up the issues in KVM code raised by Christoph?

I think after discussing that we can proceed. The series is making
things _more_ consistent in not using refcounts at all for the
secondary page tables, and Sean even had ideas on how to avoid the
difference between 32- and 64-bit versions.

Paolo
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4936a8c5829b..f9046912bb43 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2924,6 +2924,11 @@  static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 	bool host_writable = !fault || fault->map_writable;
 	bool prefetch = !fault || fault->prefetch;
 	bool write_fault = fault && fault->write;
+	/*
+	 * Prefetching uses gfn_to_page_many_atomic, which never gets
+	 * non-refcounted pages.
+	 */
+	bool is_refcounted = !fault || !!fault->accessed_page;
 
 	if (unlikely(is_noslot_pfn(pfn))) {
 		vcpu->stat.pf_mmio_spte_created++;
@@ -2951,7 +2956,7 @@  static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 	}
 
 	wrprot = make_spte(vcpu, sp, slot, pte_access, gfn, pfn, *sptep, prefetch,
-			   true, host_writable, true, &spte);
+			   true, host_writable, is_refcounted, &spte);
 
 	if (*sptep == spte) {
 		ret = RET_PF_SPURIOUS;
@@ -4319,8 +4324,8 @@  static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
 		return -EFAULT;
 	}
 
-	r = kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn, &fault->pfn,
-			     &max_order);
+	r = kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn,
+			     &fault->pfn, &fault->accessed_page, &max_order);
 	if (r) {
 		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
 		return r;
@@ -4330,6 +4335,9 @@  static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
 			       fault->max_level);
 	fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);
 
+	/* kvm_gmem_get_pfn takes a refcount, but accessed_page doesn't need it. */
+	put_page(fault->accessed_page);
+
 	return RET_PF_CONTINUE;
 }
 
@@ -4339,10 +4347,10 @@  static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	struct kvm_follow_pfn kfp = {
 		.slot = slot,
 		.gfn = fault->gfn,
-		.flags = FOLL_GET | (fault->write ? FOLL_WRITE : 0),
+		.flags = fault->write ? FOLL_WRITE : 0,
 		.try_map_writable = true,
 		.guarded_by_mmu_notifier = true,
-		.allow_non_refcounted_struct_page = false,
+		.allow_non_refcounted_struct_page = shadow_refcounted_mask,
 	};
 
 	/*
@@ -4359,6 +4367,7 @@  static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 			fault->slot = NULL;
 			fault->pfn = KVM_PFN_NOSLOT;
 			fault->map_writable = false;
+			fault->accessed_page = NULL;
 			return RET_PF_CONTINUE;
 		}
 		/*
@@ -4422,6 +4431,7 @@  static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 success:
 	fault->hva = kfp.hva;
 	fault->map_writable = kfp.writable;
+	fault->accessed_page = kfp.refcounted_page;
 	return RET_PF_CONTINUE;
 }
 
@@ -4510,8 +4520,8 @@  static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	r = direct_map(vcpu, fault);
 
 out_unlock:
+	kvm_set_page_accessed(fault->accessed_page);
 	write_unlock(&vcpu->kvm->mmu_lock);
-	kvm_release_pfn_clean(fault->pfn);
 	return r;
 }
 
@@ -4586,8 +4596,8 @@  static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
 	r = kvm_tdp_mmu_map(vcpu, fault);
 
 out_unlock:
+	kvm_set_page_accessed(fault->accessed_page);
 	read_unlock(&vcpu->kvm->mmu_lock);
-	kvm_release_pfn_clean(fault->pfn);
 	return r;
 }
 #endif
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 0669a8a668ca..0b05183600af 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -240,6 +240,8 @@  struct kvm_page_fault {
 	kvm_pfn_t pfn;
 	hva_t hva;
 	bool map_writable;
+	/* Does NOT have an elevated refcount */
+	struct page *accessed_page;
 
 	/*
 	 * Indicates the guest is trying to write a gfn that contains one or
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index c965f77ac4d5..b39dce802394 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -847,8 +847,8 @@  static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	r = FNAME(fetch)(vcpu, fault, &walker);
 
 out_unlock:
+	kvm_set_page_accessed(fault->accessed_page);
 	write_unlock(&vcpu->kvm->mmu_lock);
-	kvm_release_pfn_clean(fault->pfn);
 	return r;
 }
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index ee497fb78d90..0524be7c0796 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -958,7 +958,8 @@  static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 	else
 		wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn,
 				   fault->pfn, iter->old_spte, fault->prefetch, true,
-				   fault->map_writable, true, &new_spte);
+				   fault->map_writable, !!fault->accessed_page,
+				   &new_spte);
 
 	if (new_spte == iter->old_spte)
 		ret = RET_PF_SPURIOUS;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d19a418df04b..ea34eae6cfa4 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2426,11 +2426,13 @@  static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
 
 #ifdef CONFIG_KVM_PRIVATE_MEM
 int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
-		     gfn_t gfn, kvm_pfn_t *pfn, int *max_order);
+		     gfn_t gfn, kvm_pfn_t *pfn, struct page **page,
+		     int *max_order);
 #else
 static inline int kvm_gmem_get_pfn(struct kvm *kvm,
 				   struct kvm_memory_slot *slot, gfn_t gfn,
-				   kvm_pfn_t *pfn, int *max_order)
+				   kvm_pfn_t *pfn, struct page **page,
+				   int *max_order)
 {
 	KVM_BUG_ON(1, kvm);
 	return -EIO;
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 0f4e0cf4f158..dabcca2ecc37 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -483,12 +483,12 @@  void kvm_gmem_unbind(struct kvm_memory_slot *slot)
 }
 
 int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
-		     gfn_t gfn, kvm_pfn_t *pfn, int *max_order)
+		     gfn_t gfn, kvm_pfn_t *pfn, struct page **page,
+		     int *max_order)
 {
 	pgoff_t index = gfn - slot->base_gfn + slot->gmem.pgoff;
 	struct kvm_gmem *gmem;
 	struct folio *folio;
-	struct page *page;
 	struct file *file;
 	int r;
 
@@ -514,9 +514,9 @@  int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 		goto out_unlock;
 	}
 
-	page = folio_file_page(folio, index);
+	*page = folio_file_page(folio, index);
 
-	*pfn = page_to_pfn(page);
+	*pfn = page_to_pfn(*page);
 	if (max_order)
 		*max_order = 0;
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 235c92830cdc..1f5d2a1e63a9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3284,11 +3284,17 @@  void kvm_set_page_dirty(struct page *page)
 }
 EXPORT_SYMBOL_GPL(kvm_set_page_dirty);
 
-void kvm_set_page_accessed(struct page *page)
+static void __kvm_set_page_accessed(struct page *page)
 {
 	if (kvm_is_ad_tracked_page(page))
 		mark_page_accessed(page);
 }
+
+void kvm_set_page_accessed(struct page *page)
+{
+	if (page)
+		__kvm_set_page_accessed(page);
+}
 EXPORT_SYMBOL_GPL(kvm_set_page_accessed);
 
 void kvm_release_page_clean(struct page *page)
@@ -3298,7 +3304,7 @@  void kvm_release_page_clean(struct page *page)
 	if (!page)
 		return;
 
-	kvm_set_page_accessed(page);
+	__kvm_set_page_accessed(page);
 	put_page(page);
 }
 EXPORT_SYMBOL_GPL(kvm_release_page_clean);