Message ID | 20240229025759.1187910-9-stevensd@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: allow mapping non-refcounted pages | expand |
Hi David, On 2/29/24 05:57, David Stevens wrote: > From: David Stevens <stevensd@chromium.org> > > Handle non-refcounted pages in __kvm_faultin_pfn. This allows the > host to map memory into the guest that is backed by non-refcounted > struct pages - for example, the tail pages of higher order non-compound > pages allocated by the amdgpu driver via ttm_pool_alloc_page. > > Signed-off-by: David Stevens <stevensd@chromium.org> This patch has a problem on v6.8 kernel. Pierre-Eric of AMD found that Qemu crashes with "kvm bad address" error when booting Ubuntu 23.10 ISO with a disabled virtio-gpu and I was able to reproduce it. Pierre-Eric said this problem didn't exist with v6.7 kernel and using v10 kvm patches. Could you please take a look at this issue? To reproduce the bug, run Qemu like this and load the Ubuntu installer: qemu-system-x86_64 -boot d -cdrom ubuntu-23.10.1-desktop-amd64.iso -m 4G --enable-kvm -display gtk -smp 1 -machine q35 Qemu fails with "error: kvm run failed Bad address" On the host kernel there is this warning: ------------[ cut here ]------------ WARNING: CPU: 19 PID: 11696 at mm/gup.c:229 try_grab_page+0x64/0x100 Call Trace: <TASK> ? try_grab_page+0x64/0x100 ? __warn+0x81/0x130 ? try_grab_page+0x64/0x100 ? report_bug+0x171/0x1a0 ? handle_bug+0x3c/0x80 ? exc_invalid_op+0x17/0x70 ? asm_exc_invalid_op+0x1a/0x20 ? try_grab_page+0x64/0x100 follow_page_pte+0xfa/0x4b0 __get_user_pages+0xe5/0x6e0 get_user_pages_unlocked+0xe7/0x370 hva_to_pfn+0xa2/0x760 [kvm] ? free_unref_page+0xf9/0x180 kvm_faultin_pfn+0x112/0x610 [kvm] kvm_tdp_page_fault+0x104/0x150 [kvm] kvm_mmu_page_fault+0x298/0x860 [kvm] kvm_arch_vcpu_ioctl_run+0xc7d/0x16b0 [kvm] ? srso_alias_return_thunk+0x5/0xfbef5 ? kvm_arch_vcpu_put+0x128/0x190 [kvm] ? srso_alias_return_thunk+0x5/0xfbef5 kvm_vcpu_ioctl+0x199/0x700 [kvm] __x64_sys_ioctl+0x94/0xd0 do_syscall_64+0x86/0x170 ? kvm_on_user_return+0x64/0x90 [kvm] ? srso_alias_return_thunk+0x5/0xfbef5 ? fire_user_return_notifiers+0x37/0x70 ? srso_alias_return_thunk+0x5/0xfbef5 ? syscall_exit_to_user_mode+0x80/0x230 ? srso_alias_return_thunk+0x5/0xfbef5 ? do_syscall_64+0x96/0x170 ? srso_alias_return_thunk+0x5/0xfbef5 ? do_syscall_64+0x96/0x170 ? do_syscall_64+0x96/0x170 ? do_syscall_64+0x96/0x170 ? srso_alias_return_thunk+0x5/0xfbef5 ? do_syscall_64+0x96/0x170 ? srso_alias_return_thunk+0x5/0xfbef5 entry_SYSCALL_64_after_hwframe+0x6e/0x76 ---[ end trace 0000000000000000 ]---
On Fri, Apr 5, 2024 at 1:03 AM Dmitry Osipenko <dmitry.osipenko@collabora.com> wrote: > > Hi David, > > On 2/29/24 05:57, David Stevens wrote: > > From: David Stevens <stevensd@chromium.org> > > > > Handle non-refcounted pages in __kvm_faultin_pfn. This allows the > > host to map memory into the guest that is backed by non-refcounted > > struct pages - for example, the tail pages of higher order non-compound > > pages allocated by the amdgpu driver via ttm_pool_alloc_page. > > > > Signed-off-by: David Stevens <stevensd@chromium.org> > > This patch has a problem on v6.8 kernel. Pierre-Eric of AMD found that > Qemu crashes with "kvm bad address" error when booting Ubuntu 23.10 ISO > with a disabled virtio-gpu and I was able to reproduce it. Pierre-Eric > said this problem didn't exist with v6.7 kernel and using v10 kvm > patches. Could you please take a look at this issue? This failure is due to a minor conflict with: Fixes: d02c357e5bfa ("KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing") My patch series makes __kvm_faultin_pfn no longer take a reference to the page associated with the returned pfn. That conflicts with the call to kvm_release_pfn_clean added to kvm_faultin_pfn, since there is no longer a reference to release. Replacing that call with kvm_set_page_accessed fixes the failure. Sean, is there any path towards getting this series merged, or is it blocked on cleaning up the issues in KVM code raised by Christoph? I'm no longer working on the same projects I was when I first started trying to upstream this code 3-ish years ago, so if there is a significant amount of work left to upstream this, I need to pass things on to someone else. -David
On Mon, Apr 15, 2024 at 9:29 AM David Stevens <stevensd@chromium.org> wrote: > Sean, is there any path towards getting this series merged, or is it > blocked on cleaning up the issues in KVM code raised by Christoph? I think after discussing that we can proceed. The series is making things _more_ consistent in not using refcounts at all for the secondary page tables, and Sean even had ideas on how to avoid the difference between 32- and 64-bit versions. Paolo
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 4936a8c5829b..f9046912bb43 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -2924,6 +2924,11 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot, bool host_writable = !fault || fault->map_writable; bool prefetch = !fault || fault->prefetch; bool write_fault = fault && fault->write; + /* + * Prefetching uses gfn_to_page_many_atomic, which never gets + * non-refcounted pages. + */ + bool is_refcounted = !fault || !!fault->accessed_page; if (unlikely(is_noslot_pfn(pfn))) { vcpu->stat.pf_mmio_spte_created++; @@ -2951,7 +2956,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot, } wrprot = make_spte(vcpu, sp, slot, pte_access, gfn, pfn, *sptep, prefetch, - true, host_writable, true, &spte); + true, host_writable, is_refcounted, &spte); if (*sptep == spte) { ret = RET_PF_SPURIOUS; @@ -4319,8 +4324,8 @@ static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu, return -EFAULT; } - r = kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn, &fault->pfn, - &max_order); + r = kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn, + &fault->pfn, &fault->accessed_page, &max_order); if (r) { kvm_mmu_prepare_memory_fault_exit(vcpu, fault); return r; @@ -4330,6 +4335,9 @@ static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu, fault->max_level); fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY); + /* kvm_gmem_get_pfn takes a refcount, but accessed_page doesn't need it. */ + put_page(fault->accessed_page); + return RET_PF_CONTINUE; } @@ -4339,10 +4347,10 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault struct kvm_follow_pfn kfp = { .slot = slot, .gfn = fault->gfn, - .flags = FOLL_GET | (fault->write ? FOLL_WRITE : 0), + .flags = fault->write ? FOLL_WRITE : 0, .try_map_writable = true, .guarded_by_mmu_notifier = true, - .allow_non_refcounted_struct_page = false, + .allow_non_refcounted_struct_page = shadow_refcounted_mask, }; /* @@ -4359,6 +4367,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault fault->slot = NULL; fault->pfn = KVM_PFN_NOSLOT; fault->map_writable = false; + fault->accessed_page = NULL; return RET_PF_CONTINUE; } /* @@ -4422,6 +4431,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault success: fault->hva = kfp.hva; fault->map_writable = kfp.writable; + fault->accessed_page = kfp.refcounted_page; return RET_PF_CONTINUE; } @@ -4510,8 +4520,8 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault r = direct_map(vcpu, fault); out_unlock: + kvm_set_page_accessed(fault->accessed_page); write_unlock(&vcpu->kvm->mmu_lock); - kvm_release_pfn_clean(fault->pfn); return r; } @@ -4586,8 +4596,8 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu, r = kvm_tdp_mmu_map(vcpu, fault); out_unlock: + kvm_set_page_accessed(fault->accessed_page); read_unlock(&vcpu->kvm->mmu_lock); - kvm_release_pfn_clean(fault->pfn); return r; } #endif diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h index 0669a8a668ca..0b05183600af 100644 --- a/arch/x86/kvm/mmu/mmu_internal.h +++ b/arch/x86/kvm/mmu/mmu_internal.h @@ -240,6 +240,8 @@ struct kvm_page_fault { kvm_pfn_t pfn; hva_t hva; bool map_writable; + /* Does NOT have an elevated refcount */ + struct page *accessed_page; /* * Indicates the guest is trying to write a gfn that contains one or diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index c965f77ac4d5..b39dce802394 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -847,8 +847,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault r = FNAME(fetch)(vcpu, fault, &walker); out_unlock: + kvm_set_page_accessed(fault->accessed_page); write_unlock(&vcpu->kvm->mmu_lock); - kvm_release_pfn_clean(fault->pfn); return r; } diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index ee497fb78d90..0524be7c0796 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -958,7 +958,8 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu, else wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn, fault->pfn, iter->old_spte, fault->prefetch, true, - fault->map_writable, true, &new_spte); + fault->map_writable, !!fault->accessed_page, + &new_spte); if (new_spte == iter->old_spte) ret = RET_PF_SPURIOUS; diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index d19a418df04b..ea34eae6cfa4 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -2426,11 +2426,13 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn) #ifdef CONFIG_KVM_PRIVATE_MEM int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, - gfn_t gfn, kvm_pfn_t *pfn, int *max_order); + gfn_t gfn, kvm_pfn_t *pfn, struct page **page, + int *max_order); #else static inline int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn, - kvm_pfn_t *pfn, int *max_order) + kvm_pfn_t *pfn, struct page **page, + int *max_order) { KVM_BUG_ON(1, kvm); return -EIO; diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 0f4e0cf4f158..dabcca2ecc37 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -483,12 +483,12 @@ void kvm_gmem_unbind(struct kvm_memory_slot *slot) } int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, - gfn_t gfn, kvm_pfn_t *pfn, int *max_order) + gfn_t gfn, kvm_pfn_t *pfn, struct page **page, + int *max_order) { pgoff_t index = gfn - slot->base_gfn + slot->gmem.pgoff; struct kvm_gmem *gmem; struct folio *folio; - struct page *page; struct file *file; int r; @@ -514,9 +514,9 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, goto out_unlock; } - page = folio_file_page(folio, index); + *page = folio_file_page(folio, index); - *pfn = page_to_pfn(page); + *pfn = page_to_pfn(*page); if (max_order) *max_order = 0; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 235c92830cdc..1f5d2a1e63a9 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -3284,11 +3284,17 @@ void kvm_set_page_dirty(struct page *page) } EXPORT_SYMBOL_GPL(kvm_set_page_dirty); -void kvm_set_page_accessed(struct page *page) +static void __kvm_set_page_accessed(struct page *page) { if (kvm_is_ad_tracked_page(page)) mark_page_accessed(page); } + +void kvm_set_page_accessed(struct page *page) +{ + if (page) + __kvm_set_page_accessed(page); +} EXPORT_SYMBOL_GPL(kvm_set_page_accessed); void kvm_release_page_clean(struct page *page) @@ -3298,7 +3304,7 @@ void kvm_release_page_clean(struct page *page) if (!page) return; - kvm_set_page_accessed(page); + __kvm_set_page_accessed(page); put_page(page); } EXPORT_SYMBOL_GPL(kvm_release_page_clean);