Message ID | 20230718234512.1690985-2-seanjc@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: guest_memfd() and per-page attributes | expand |
On Wed Jul 19, 2023 at 2:44 AM EEST, Sean Christopherson wrote: > /* Huge pages aren't expected to be modified without first being zapped. */ > - WARN_ON(pte_huge(range->pte) || range->start + 1 != range->end); > + WARN_ON(pte_huge(range->arg.pte) || range->start + 1 != range->end); Not familiar with this code. Just checking whether whether instead pr_{warn,err}() combined with return false would be a more graceful option? BR, Jarkko
On Wed, Jul 19, 2023, Jarkko Sakkinen wrote: > On Wed Jul 19, 2023 at 2:44 AM EEST, Sean Christopherson wrote: > > /* Huge pages aren't expected to be modified without first being zapped. */ > > - WARN_ON(pte_huge(range->pte) || range->start + 1 != range->end); > > + WARN_ON(pte_huge(range->arg.pte) || range->start + 1 != range->end); > > Not familiar with this code. Just checking whether whether instead > pr_{warn,err}() The "full" WARN is desirable, this is effecitvely an assert on the contract between the primary MMU, generic KVM code, and x86's TDP MMU. The .change_pte() mmu_notifier callback doesn't allow for hugepages, i.e. it's a (likely fatal) kernel bug if a hugepage is encountered at this point. Ditto for the "start + 1 == end" check, if that fails then generic KVM likely has a fatal bug. > combined with return false would be a more graceful option? The return value communicates whether or not a TLB flush is needed, not whether or not the operation was successful, i.e. there is no way to cancel the unexpected PTE change.
On 7/19/23 01:44, Sean Christopherson wrote: > + BUILD_BUG_ON(sizeof(gfn_range.arg) != sizeof(gfn_range.arg.raw)); > + BUILD_BUG_ON(sizeof(range->arg) != sizeof(range->arg.raw)); I think these should be static assertions near the definition of the structs. However another possibility is to remove 'raw' and just assign the whole union. Apart from this, Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Paolo > + BUILD_BUG_ON(sizeof(gfn_range.arg) != sizeof(range->arg));
On Tue, Jul 18, 2023 at 04:44:44PM -0700, Sean Christopherson wrote: May I know why KVM now needs to register to callback .change_pte()? As also commented in kvm_mmu_notifier_change_pte(), .change_pte() must be surrounded by .invalidate_range_{start,end}(). While kvm_mmu_notifier_invalidate_range_start() has called kvm_unmap_gfn_range() to zap all leaf SPTEs, and page fault path will not install new SPTEs successfully before kvm_mmu_notifier_invalidate_range_end(), kvm_set_spte_gfn() should not be able to find any shadow present leaf entries to update PFN. Or could we just delete completely "kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);" from kvm_mmu_notifier_change_pte() ? > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > index 6db9ef288ec3..55f03a68f1cd 100644 > --- a/arch/arm64/kvm/mmu.c > +++ b/arch/arm64/kvm/mmu.c > @@ -1721,7 +1721,7 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) > > bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) > { > - kvm_pfn_t pfn = pte_pfn(range->pte); > + kvm_pfn_t pfn = pte_pfn(range->arg.pte); > > if (!kvm->arch.mmu.pgt) > return false; > diff --git a/arch/mips/kvm/mmu.c b/arch/mips/kvm/mmu.c > index e8c08988ed37..7b2ac1319d70 100644 > --- a/arch/mips/kvm/mmu.c > +++ b/arch/mips/kvm/mmu.c > @@ -447,7 +447,7 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) > bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) > { > gpa_t gpa = range->start << PAGE_SHIFT; > - pte_t hva_pte = range->pte; > + pte_t hva_pte = range->arg.pte; > pte_t *gpa_pte = kvm_mips_pte_for_gpa(kvm, NULL, gpa); > pte_t old_pte; > > diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c > index f2eb47925806..857f4312b0f8 100644 > --- a/arch/riscv/kvm/mmu.c > +++ b/arch/riscv/kvm/mmu.c > @@ -559,7 +559,7 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) > bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) > { > int ret; > - kvm_pfn_t pfn = pte_pfn(range->pte); > + kvm_pfn_t pfn = pte_pfn(range->arg.pte); > > if (!kvm->arch.pgd) > return false; > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index ec169f5c7dce..d72f2b20f430 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -1588,7 +1588,7 @@ static __always_inline bool kvm_handle_gfn_range(struct kvm *kvm, > for_each_slot_rmap_range(range->slot, PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL, > range->start, range->end - 1, &iterator) > ret |= handler(kvm, iterator.rmap, range->slot, iterator.gfn, > - iterator.level, range->pte); > + iterator.level, range->arg.pte); > > return ret; > } > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index 512163d52194..6250bd3d20c1 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -1241,7 +1241,7 @@ static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter, > u64 new_spte; > > /* Huge pages aren't expected to be modified without first being zapped. */ > - WARN_ON(pte_huge(range->pte) || range->start + 1 != range->end); > + WARN_ON(pte_huge(range->arg.pte) || range->start + 1 != range->end); > > if (iter->level != PG_LEVEL_4K || > !is_shadow_present_pte(iter->old_spte)) > @@ -1255,9 +1255,9 @@ static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter, > */ > tdp_mmu_iter_set_spte(kvm, iter, 0); > > - if (!pte_write(range->pte)) { > + if (!pte_write(range->arg.pte)) { > new_spte = kvm_mmu_changed_pte_notifier_make_spte(iter->old_spte, > - pte_pfn(range->pte)); > + pte_pfn(range->arg.pte)); > > tdp_mmu_iter_set_spte(kvm, iter, new_spte); > }
On 2023-07-21 at 14:26:11 +0800, Yan Zhao wrote: > On Tue, Jul 18, 2023 at 04:44:44PM -0700, Sean Christopherson wrote: > > May I know why KVM now needs to register to callback .change_pte()? I can see the original purpose is to "setting a pte in the shadow page table directly, instead of flushing the shadow page table entry and then getting vmexit to set it"[1]. IIUC, KVM is expected to directly make the new pte present for new pages in this callback, like for COW. > As also commented in kvm_mmu_notifier_change_pte(), .change_pte() must be > surrounded by .invalidate_range_{start,end}(). > > While kvm_mmu_notifier_invalidate_range_start() has called kvm_unmap_gfn_range() > to zap all leaf SPTEs, and page fault path will not install new SPTEs > successfully before kvm_mmu_notifier_invalidate_range_end(), > kvm_set_spte_gfn() should not be able to find any shadow present leaf entries to > update PFN. I also failed to figure out how the kvm_set_spte_gfn() could pass several !is_shadow_present_pte(iter.old_spte) check then write the new pte. [1] https://lore.kernel.org/all/200909222039.n8MKd4TL002696@imap1.linux-foundation.org/ Thanks, Yilun > > Or could we just delete completely > "kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);" > from kvm_mmu_notifier_change_pte() ?
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 6db9ef288ec3..55f03a68f1cd 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1721,7 +1721,7 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { - kvm_pfn_t pfn = pte_pfn(range->pte); + kvm_pfn_t pfn = pte_pfn(range->arg.pte); if (!kvm->arch.mmu.pgt) return false; diff --git a/arch/mips/kvm/mmu.c b/arch/mips/kvm/mmu.c index e8c08988ed37..7b2ac1319d70 100644 --- a/arch/mips/kvm/mmu.c +++ b/arch/mips/kvm/mmu.c @@ -447,7 +447,7 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { gpa_t gpa = range->start << PAGE_SHIFT; - pte_t hva_pte = range->pte; + pte_t hva_pte = range->arg.pte; pte_t *gpa_pte = kvm_mips_pte_for_gpa(kvm, NULL, gpa); pte_t old_pte; diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c index f2eb47925806..857f4312b0f8 100644 --- a/arch/riscv/kvm/mmu.c +++ b/arch/riscv/kvm/mmu.c @@ -559,7 +559,7 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { int ret; - kvm_pfn_t pfn = pte_pfn(range->pte); + kvm_pfn_t pfn = pte_pfn(range->arg.pte); if (!kvm->arch.pgd) return false; diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index ec169f5c7dce..d72f2b20f430 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -1588,7 +1588,7 @@ static __always_inline bool kvm_handle_gfn_range(struct kvm *kvm, for_each_slot_rmap_range(range->slot, PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL, range->start, range->end - 1, &iterator) ret |= handler(kvm, iterator.rmap, range->slot, iterator.gfn, - iterator.level, range->pte); + iterator.level, range->arg.pte); return ret; } diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 512163d52194..6250bd3d20c1 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -1241,7 +1241,7 @@ static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter, u64 new_spte; /* Huge pages aren't expected to be modified without first being zapped. */ - WARN_ON(pte_huge(range->pte) || range->start + 1 != range->end); + WARN_ON(pte_huge(range->arg.pte) || range->start + 1 != range->end); if (iter->level != PG_LEVEL_4K || !is_shadow_present_pte(iter->old_spte)) @@ -1255,9 +1255,9 @@ static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter, */ tdp_mmu_iter_set_spte(kvm, iter, 0); - if (!pte_write(range->pte)) { + if (!pte_write(range->arg.pte)) { new_spte = kvm_mmu_changed_pte_notifier_make_spte(iter->old_spte, - pte_pfn(range->pte)); + pte_pfn(range->arg.pte)); tdp_mmu_iter_set_spte(kvm, iter, new_spte); } diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 9d3ac7720da9..b901571ab61e 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -260,7 +260,10 @@ struct kvm_gfn_range { struct kvm_memory_slot *slot; gfn_t start; gfn_t end; - pte_t pte; + union { + pte_t pte; + u64 raw; + } arg; bool may_block; }; bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index dfbaafbe3a00..d58b7a506d27 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -526,7 +526,10 @@ typedef void (*on_unlock_fn_t)(struct kvm *kvm); struct kvm_hva_range { unsigned long start; unsigned long end; - pte_t pte; + union { + pte_t pte; + u64 raw; + } arg; hva_handler_t handler; on_lock_fn_t on_lock; on_unlock_fn_t on_unlock; @@ -562,6 +565,10 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm, struct kvm_memslots *slots; int i, idx; + BUILD_BUG_ON(sizeof(gfn_range.arg) != sizeof(gfn_range.arg.raw)); + BUILD_BUG_ON(sizeof(range->arg) != sizeof(range->arg.raw)); + BUILD_BUG_ON(sizeof(gfn_range.arg) != sizeof(range->arg)); + if (WARN_ON_ONCE(range->end <= range->start)) return 0; @@ -591,7 +598,7 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm, * bother making these conditional (to avoid writes on * the second or later invocation of the handler). */ - gfn_range.pte = range->pte; + gfn_range.arg.raw = range->arg.raw; gfn_range.may_block = range->may_block; /* @@ -639,7 +646,7 @@ static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn, const struct kvm_hva_range range = { .start = start, .end = end, - .pte = pte, + .arg.pte = pte, .handler = handler, .on_lock = (void *)kvm_null_fn, .on_unlock = (void *)kvm_null_fn, @@ -659,7 +666,6 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn const struct kvm_hva_range range = { .start = start, .end = end, - .pte = __pte(0), .handler = handler, .on_lock = (void *)kvm_null_fn, .on_unlock = (void *)kvm_null_fn, @@ -747,7 +753,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, const struct kvm_hva_range hva_range = { .start = range->start, .end = range->end, - .pte = __pte(0), .handler = kvm_unmap_gfn_range, .on_lock = kvm_mmu_invalidate_begin, .on_unlock = kvm_arch_guest_memory_reclaimed, @@ -812,7 +817,6 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn, const struct kvm_hva_range hva_range = { .start = range->start, .end = range->end, - .pte = __pte(0), .handler = (void *)kvm_null_fn, .on_lock = kvm_mmu_invalidate_end, .on_unlock = (void *)kvm_null_fn,
Signed-off-by: Sean Christopherson <seanjc@google.com> --- arch/arm64/kvm/mmu.c | 2 +- arch/mips/kvm/mmu.c | 2 +- arch/riscv/kvm/mmu.c | 2 +- arch/x86/kvm/mmu/mmu.c | 2 +- arch/x86/kvm/mmu/tdp_mmu.c | 6 +++--- include/linux/kvm_host.h | 5 ++++- virt/kvm/kvm_main.c | 16 ++++++++++------ 7 files changed, 21 insertions(+), 14 deletions(-)