Message ID | 20240805233114.4060019-8-dmatlack@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: x86/mmu: Optimize TDP MMU huge page recovery during disable-dirty-log | expand |
On Mon, Aug 05, 2024, David Matlack wrote: > Recheck the iter.old_spte still points to a page table when recovering > huge pages. Since mmu_lock is held for read and tdp_iter_step_up() > re-reads iter.sptep, it's possible the SPTE was zapped or recovered by > another CPU in between stepping down and back up. > > This could avoids a useless cmpxchg (and possibly a remote TLB flush) if > another CPU is recovering huge SPTEs in parallel (e.g. the NX huge page > recovery worker, or vCPUs taking faults on the huge page region). > > This also makes it clear that tdp_iter_step_up() re-reads the SPTE and > thus can see a different value, which is not immediately obvious when > reading the code. > > Signed-off-by: David Matlack <dmatlack@google.com> > --- > arch/x86/kvm/mmu/tdp_mmu.c | 11 +++++++++++ > 1 file changed, 11 insertions(+) > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index 07d5363c9db7..bdc7fd476721 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -1619,6 +1619,17 @@ static void recover_huge_pages_range(struct kvm *kvm, > while (max_mapping_level > iter.level) > tdp_iter_step_up(&iter); > > + /* > + * Re-check that iter.old_spte still points to a page table. > + * Since mmu_lock is held for read and tdp_iter_step_up() > + * re-reads iter.sptep, it's possible the SPTE was zapped or > + * recovered by another CPU in between stepping down and > + * stepping back up. > + */ > + if (!is_shadow_present_pte(iter.old_spte) || > + is_last_spte(iter.old_spte, iter.level)) > + continue; This is the part of the step-up logic that I do not like. Even this check doesn't guarantee that the SPTE that is being replaced is the same non-leaf SPTE that was used to reach the leaf SPTE. E.g. in an absurdly theoretical situation, the SPTE could be zapped and then re-set with another non-leaf SPTE. Which is fine, but only because of several very subtle mechanisms. kvm_mmu_max_mapping_level() ensures that there are no write-tracked gfns anywhere in the huge range, so it's safe to propagate any and all WRITABLE bits. This requires knowing/remembering that KVM disallows huge pages when a gfn is write- tracked, and relies on that never changing (which is a fairly safe bet, but the behavior isn't fully set in stone). not set. And the presence of a shadow-present leaf SPTE ensures there are no in-flight mmu_notifier invalidations, as kvm_mmu_notifier_invalidate_range_start() won't return until all relevant leaf SPTEs have been zapped. And even more subtly, recover_huge_pages_range() can install a huge SPTE while tdp_mmu_zap_leafs() is running, e.g. if tdp_mmu_zap_leafs() is processing 4KiB SPTEs because the greater 2MiB page is being unmapped by the primary MMU, and tdp_mmu_zap_leafs() yields. That's again safe only because upon regaining control, tdp_mmu_zap_leafs() will restart at the root and thus observe and zap the new huge SPTE. So while I'm pretty sure this logic is functionally ok, its safety is extremely dependent on a number of behaviors in KVM. That said, I can't tell which option I dislike less. E.g. we could do something like this, where kvm_mmu_name_tbd() grabs the pfn+writable information from the primary MMU's PTE/PMD/PUD. Ideally, KVM would use GUP, but then KVM wouldn't be able to create huge SPTEs for non-GUP-able memory, e.g. PFNMAP'd memory. I don't love this either, primarily because not using GUP makes this yet another custom flow, i.e. isn't any less tricky than reusing a child SPTE. It does have the advantage of not having to find a shadow-present child though, i.e. is likely the most performant option. I agree that that likely doesn't matter in practice, especially since the raw latency of disabling dirty logging isn't terribly important. rcu_read_lock(); for_each_tdp_pte_min_level(iter, root, PG_LEVEL_2M, start, end) { retry: if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true)) continue; if (iter.level > KVM_MAX_HUGEPAGE_LEVEL || !is_shadow_present_pte(iter.old_spte)) continue; /* * TODO: this should skip to the end of the parent, because if * the first SPTE can't be made huger, then no SPTEs at this * level can be huger. */ if (is_last_spte(iter.old_spte, iter.level)) continue; /* * If iter.gfn resides outside of the slot, i.e. the page for * the current level overlaps but is not contained by the slot, * then the SPTE can't be made huge. More importantly, trying * to query that info from slot->arch.lpage_info will cause an * out-of-bounds access. */ if (iter.gfn < start || iter.gfn >= end) continue; if (kvm_mmu_name_tbd(kvm, slot, iter.gfn, &pfn, &writable, &max_mapping_level)) continue; /* * If the range is being invalidated, do not install a SPTE as * KVM may have already zapped this specific gfn, e.g. if KVM's * unmapping has run to completion, but the primary MMU hasn't * zapped its PTEs. There is no need to check for *past* * invalidations, because all information is gathered while * holding mmu_lock, i.e. it can't have become stale due to a * entire mmu_notifier invalidation sequence completing. */ if (mmu_invalidate_retry_gfn(kvm, kvm->mmu_invalidate_seq, iter.gfn)) continue; /* * KVM disallows huge pages for write-protected gfns, it should * impossible for make_spte() to encounter such a gfn since * write-protecting a gfn requires holding mmu_lock for write. */ flush |= __tdp_mmu_map_gfn(...); WARN_ON_ONCE(r == RET_PF_EMULATE); } rcu_read_unlock(); Assuming you don't like the above idea (I'm not sure I do), what if instead of doing the step-up, KVM starts a second iteration using the shadow page it wants to replace as the root of the walk? This has the same subtle dependencies on kvm_mmu_max_mapping_level() and the ordering with respect to an mmu_notifier invalidation, but it at least avoids having to reason about the correctness of re-reading a SPTE and modifying the iteration level within the body of an interation loop. It should also yield smaller diffs overall, e.g. no revert, no separate commit to recheck the SPTE, etc. And I believe it's more performant that the step-up approach when there are SPTE that _can't_ be huge, as KVM won't traverse down into the leafs in that case. An alternative to the tdp_mmu_iter_needs_reched() logic would be to pass in &flush, but I think that ends up being more confusing and harder to follow. static int tdp_mmu_make_huge_spte(struct kvm *kvm, struct tdp_iter *parent, u64 *huge_spte) { struct kvm_mmu_page *root = sptep_to_sp(parent->sptep); gfn_t start = parent->gfn; gfn_t end = start + ???; /* use parent's level */ struct tdp_iter iter; tdp_root_for_each_leaf_pte(iter, root, start, end) { /* * Use the parent iterator when checking for forward progress, * so that KVM doesn't get stuck due to always yielding while * walking child SPTEs. */ if (tdp_mmu_iter_needs_reched(kvm, parent)) return -EAGAIN; *huge_spte = make_huge_spte(kvm, iter.old_spte); return 0; } return -ENOENT; } static void recover_huge_pages_range(struct kvm *kvm, struct kvm_mmu_page *root, const struct kvm_memory_slot *slot) { gfn_t start = slot->base_gfn; gfn_t end = start + slot->npages; struct tdp_iter iter; int max_mapping_level; bool flush = false; u64 huge_spte; if (WARN_ON_ONCE(kvm_slot_dirty_track_enabled(slot))) return; rcu_read_lock(); for_each_tdp_pte_min_level(iter, root, PG_LEVEL_2M, start, end) { restart: if (tdp_mmu_iter_cond_resched(kvm, &iter, flush, true)) { flush = false; continue; } if (iter.level > KVM_MAX_HUGEPAGE_LEVEL || !is_shadow_present_pte(iter.old_spte)) continue; /* * If iter.gfn resides outside of the slot, i.e. the page for * the current level overlaps but is not contained by the slot, * then the SPTE can't be made huge. More importantly, trying * to query that info from slot->arch.lpage_info will cause an * out-of-bounds access. */ if (iter.gfn < start || iter.gfn >= end) continue; max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot, iter.gfn); if (max_mapping_level < iter.level) continue; r = tdp_mmu_make_huge_spte(kvm, &iter, &huge_spte); if (r == -EAGAIN) goto restart; else if (r) continue; /* * If setting the SPTE fails, e.g. because it has been modified * by a different task, iteration will naturally continue with * the next SPTE. Don't bother retrying this SPTE, races are * uncommon and odds are good the SPTE */ if (!tdp_mmu_set_spte_atomic(kvm, &iter, huge_spte)) flush = true; } if (flush) kvm_flush_remote_tlbs_memslot(kvm, slot); rcu_read_unlock(); } static inline bool tdp_mmu_iter_needs_reched(struct kvm *kvm, struct tdp_iter *iter) { /* Ensure forward progress has been made before yielding. */ return iter->next_last_level_gfn != iter->yielded_gfn && (need_resched() || rwlock_needbreak(&kvm->mmu_lock)); } /* * Yield if the MMU lock is contended or this thread needs to return control * to the scheduler. * * If this function should yield and flush is set, it will perform a remote * TLB flush before yielding. * * If this function yields, iter->yielded is set and the caller must skip to * the next iteration, where tdp_iter_next() will reset the tdp_iter's walk * over the paging structures to allow the iterator to continue its traversal * from the paging structure root. * * Returns true if this function yielded. */ static inline bool __must_check tdp_mmu_iter_cond_resched(struct kvm *kvm, struct tdp_iter *iter, bool flush, bool shared) { WARN_ON_ONCE(iter->yielded); if (!tdp_mmu_iter_needs_reched(kvm, iter)) return false; if (flush) kvm_flush_remote_tlbs(kvm); rcu_read_unlock(); if (shared) cond_resched_rwlock_read(&kvm->mmu_lock); else cond_resched_rwlock_write(&kvm->mmu_lock); rcu_read_lock(); WARN_ON_ONCE(iter->gfn > iter->next_last_level_gfn); iter->yielded = true; return true; }
On Thu, Aug 22, 2024 at 9:50 AM Sean Christopherson <seanjc@google.com> wrote: > > On Mon, Aug 05, 2024, David Matlack wrote: > > Recheck the iter.old_spte still points to a page table when recovering > > huge pages. Since mmu_lock is held for read and tdp_iter_step_up() > > re-reads iter.sptep, it's possible the SPTE was zapped or recovered by > > another CPU in between stepping down and back up. > > > > This could avoids a useless cmpxchg (and possibly a remote TLB flush) if > > another CPU is recovering huge SPTEs in parallel (e.g. the NX huge page > > recovery worker, or vCPUs taking faults on the huge page region). > > > > This also makes it clear that tdp_iter_step_up() re-reads the SPTE and > > thus can see a different value, which is not immediately obvious when > > reading the code. > > > > Signed-off-by: David Matlack <dmatlack@google.com> > > --- > > arch/x86/kvm/mmu/tdp_mmu.c | 11 +++++++++++ > > 1 file changed, 11 insertions(+) > > > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > > index 07d5363c9db7..bdc7fd476721 100644 > > --- a/arch/x86/kvm/mmu/tdp_mmu.c > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > > @@ -1619,6 +1619,17 @@ static void recover_huge_pages_range(struct kvm *kvm, > > while (max_mapping_level > iter.level) > > tdp_iter_step_up(&iter); > > > > + /* > > + * Re-check that iter.old_spte still points to a page table. > > + * Since mmu_lock is held for read and tdp_iter_step_up() > > + * re-reads iter.sptep, it's possible the SPTE was zapped or > > + * recovered by another CPU in between stepping down and > > + * stepping back up. > > + */ > > + if (!is_shadow_present_pte(iter.old_spte) || > > + is_last_spte(iter.old_spte, iter.level)) > > + continue; > > This is the part of the step-up logic that I do not like. Even this check doesn't > guarantee that the SPTE that is being replaced is the same non-leaf SPTE that was > used to reach the leaf SPTE. E.g. in an absurdly theoretical situation, the SPTE > could be zapped and then re-set with another non-leaf SPTE. Which is fine, but > only because of several very subtle mechanisms. I'm not sure why that matters. The only thing that matters is that the GFN->PFN and permissions cannot change, and that is guaranteed by holding mmu_lock for read. At the end of the day, we never actually care about the value of the SPTE we are replacing. We only care that it's a non-leaf SPTE. > > kvm_mmu_max_mapping_level() ensures that there are no write-tracked gfns anywhere > in the huge range, so it's safe to propagate any and all WRITABLE bits. This > requires knowing/remembering that KVM disallows huge pages when a gfn is write- > tracked, and relies on that never changing (which is a fairly safe bet, but the > behavior isn't fully set in stone). > not set. > > And the presence of a shadow-present leaf SPTE ensures there are no in-flight > mmu_notifier invalidations, as kvm_mmu_notifier_invalidate_range_start() won't > return until all relevant leaf SPTEs have been zapped. As you point out in the next paragraph there could be an inflight invalidate that yielded, no? > > And even more subtly, recover_huge_pages_range() can install a huge SPTE while > tdp_mmu_zap_leafs() is running, e.g. if tdp_mmu_zap_leafs() is processing 4KiB > SPTEs because the greater 2MiB page is being unmapped by the primary MMU, and > tdp_mmu_zap_leafs() yields. That's again safe only because upon regaining > control, tdp_mmu_zap_leafs() will restart at the root and thus observe and zap > the new huge SPTE. I agree it's subtle, but only in the sense that the TDP MMU is subtle. Restarting at the root after dropping mmu_lock is a fundamental concept in the TDP MMU. > > So while I'm pretty sure this logic is functionally ok, its safety is extremely > dependent on a number of behaviors in KVM. > > That said, I can't tell which option I dislike less. E.g. we could do something > like this, where kvm_mmu_name_tbd() grabs the pfn+writable information from the > primary MMU's PTE/PMD/PUD. Ideally, KVM would use GUP, but then KVM wouldn't > be able to create huge SPTEs for non-GUP-able memory, e.g. PFNMAP'd memory. > > I don't love this either, primarily because not using GUP makes this yet another > custom flow Yeah. I don't like having the huge page recovery path needing its own special way to construct SPTEs from scratch. e.g. I could see this approach becoming a problem if KVM gains support for R/W/X GFN attributes. > i.e. isn't any less tricky than reusing a child SPTE. It does have > the advantage of not having to find a shadow-present child though, i.e. is likely > the most performant option. I agree that that likely doesn't matter in practice, > especially since the raw latency of disabling dirty logging isn't terribly > important. > > rcu_read_lock(); > > for_each_tdp_pte_min_level(iter, root, PG_LEVEL_2M, start, end) { > retry: > if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true)) > continue; > > if (iter.level > KVM_MAX_HUGEPAGE_LEVEL || > !is_shadow_present_pte(iter.old_spte)) > continue; > > /* > * TODO: this should skip to the end of the parent, because if > * the first SPTE can't be made huger, then no SPTEs at this > * level can be huger. > */ > if (is_last_spte(iter.old_spte, iter.level)) > continue; > > /* > * If iter.gfn resides outside of the slot, i.e. the page for > * the current level overlaps but is not contained by the slot, > * then the SPTE can't be made huge. More importantly, trying > * to query that info from slot->arch.lpage_info will cause an > * out-of-bounds access. > */ > if (iter.gfn < start || iter.gfn >= end) > continue; > > if (kvm_mmu_name_tbd(kvm, slot, iter.gfn, &pfn, &writable, > &max_mapping_level)) > continue; > > /* > * If the range is being invalidated, do not install a SPTE as > * KVM may have already zapped this specific gfn, e.g. if KVM's > * unmapping has run to completion, but the primary MMU hasn't > * zapped its PTEs. There is no need to check for *past* > * invalidations, because all information is gathered while > * holding mmu_lock, i.e. it can't have become stale due to a > * entire mmu_notifier invalidation sequence completing. > */ > if (mmu_invalidate_retry_gfn(kvm, kvm->mmu_invalidate_seq, iter.gfn)) > continue; > > /* > * KVM disallows huge pages for write-protected gfns, it should > * impossible for make_spte() to encounter such a gfn since > * write-protecting a gfn requires holding mmu_lock for write. > */ > flush |= __tdp_mmu_map_gfn(...); > WARN_ON_ONCE(r == RET_PF_EMULATE); > } > > rcu_read_unlock(); > > Assuming you don't like the above idea (I'm not sure I do), what if instead of > doing the step-up, KVM starts a second iteration using the shadow page it wants > to replace as the root of the walk? > > This has the same subtle dependencies on kvm_mmu_max_mapping_level() and the > ordering with respect to an mmu_notifier invalidation, but it at least avoids > having to reason about the correctness of re-reading a SPTE and modifying the > iteration level within the body of an interation loop. > > It should also yield smaller diffs overall, e.g. no revert, no separate commit > to recheck the SPTE, etc. And I believe it's more performant that the step-up > approach when there are SPTE that _can't_ be huge, as KVM won't traverse down > into the leafs in that case. This approach looks good to me. I'll try it out and see if I run into any issues. > > An alternative to the tdp_mmu_iter_needs_reched() logic would be to pass in > &flush, but I think that ends up being more confusing and harder to follow. Yeah I think that would be more complicated. If we drop mmu_lock then we need to re-check kvm_mmu_max_mapping_level() and restart at the root. > > static int tdp_mmu_make_huge_spte(struct kvm *kvm, struct tdp_iter *parent, > u64 *huge_spte) > { > struct kvm_mmu_page *root = sptep_to_sp(parent->sptep); > gfn_t start = parent->gfn; > gfn_t end = start + ???; /* use parent's level */ > struct tdp_iter iter; > > tdp_root_for_each_leaf_pte(iter, root, start, end) { > /* > * Use the parent iterator when checking for forward progress, > * so that KVM doesn't get stuck due to always yielding while > * walking child SPTEs. > */ > if (tdp_mmu_iter_needs_reched(kvm, parent)) > return -EAGAIN; > > *huge_spte = make_huge_spte(kvm, iter.old_spte); > return 0; > } > > return -ENOENT; > } > > static void recover_huge_pages_range(struct kvm *kvm, > struct kvm_mmu_page *root, > const struct kvm_memory_slot *slot) > { > gfn_t start = slot->base_gfn; > gfn_t end = start + slot->npages; > struct tdp_iter iter; > int max_mapping_level; > bool flush = false; > u64 huge_spte; > > if (WARN_ON_ONCE(kvm_slot_dirty_track_enabled(slot))) > return; > > rcu_read_lock(); > > for_each_tdp_pte_min_level(iter, root, PG_LEVEL_2M, start, end) { > restart: > if (tdp_mmu_iter_cond_resched(kvm, &iter, flush, true)) { > flush = false; > continue; > } > > if (iter.level > KVM_MAX_HUGEPAGE_LEVEL || > !is_shadow_present_pte(iter.old_spte)) > continue; > > /* > * If iter.gfn resides outside of the slot, i.e. the page for > * the current level overlaps but is not contained by the slot, > * then the SPTE can't be made huge. More importantly, trying > * to query that info from slot->arch.lpage_info will cause an > * out-of-bounds access. > */ > if (iter.gfn < start || iter.gfn >= end) > continue; > > max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot, iter.gfn); > if (max_mapping_level < iter.level) > continue; > > r = tdp_mmu_make_huge_spte(kvm, &iter, &huge_spte); > if (r == -EAGAIN) > goto restart; > else if (r) > continue; > > /* > * If setting the SPTE fails, e.g. because it has been modified > * by a different task, iteration will naturally continue with > * the next SPTE. Don't bother retrying this SPTE, races are > * uncommon and odds are good the SPTE > */ > if (!tdp_mmu_set_spte_atomic(kvm, &iter, huge_spte)) > flush = true; > } > > if (flush) > kvm_flush_remote_tlbs_memslot(kvm, slot); > > rcu_read_unlock(); > } > > static inline bool tdp_mmu_iter_needs_reched(struct kvm *kvm, > struct tdp_iter *iter) > { > /* Ensure forward progress has been made before yielding. */ > return iter->next_last_level_gfn != iter->yielded_gfn && > (need_resched() || rwlock_needbreak(&kvm->mmu_lock)); > > } > > /* > * Yield if the MMU lock is contended or this thread needs to return control > * to the scheduler. > * > * If this function should yield and flush is set, it will perform a remote > * TLB flush before yielding. > * > * If this function yields, iter->yielded is set and the caller must skip to > * the next iteration, where tdp_iter_next() will reset the tdp_iter's walk > * over the paging structures to allow the iterator to continue its traversal > * from the paging structure root. > * > * Returns true if this function yielded. > */ > static inline bool __must_check tdp_mmu_iter_cond_resched(struct kvm *kvm, > struct tdp_iter *iter, > bool flush, bool shared) > { > WARN_ON_ONCE(iter->yielded); > > if (!tdp_mmu_iter_needs_reched(kvm, iter)) > return false; > > if (flush) > kvm_flush_remote_tlbs(kvm); > > rcu_read_unlock(); > > if (shared) > cond_resched_rwlock_read(&kvm->mmu_lock); > else > cond_resched_rwlock_write(&kvm->mmu_lock); > > rcu_read_lock(); > > WARN_ON_ONCE(iter->gfn > iter->next_last_level_gfn); > > iter->yielded = true; > return true; > }
On Thu, Aug 22, 2024, David Matlack wrote: > On Thu, Aug 22, 2024 at 9:50 AM Sean Christopherson <seanjc@google.com> wrote: > > > > On Mon, Aug 05, 2024, David Matlack wrote: > > > Recheck the iter.old_spte still points to a page table when recovering > > > huge pages. Since mmu_lock is held for read and tdp_iter_step_up() > > > re-reads iter.sptep, it's possible the SPTE was zapped or recovered by > > > another CPU in between stepping down and back up. > > > > > > This could avoids a useless cmpxchg (and possibly a remote TLB flush) if > > > another CPU is recovering huge SPTEs in parallel (e.g. the NX huge page > > > recovery worker, or vCPUs taking faults on the huge page region). > > > > > > This also makes it clear that tdp_iter_step_up() re-reads the SPTE and > > > thus can see a different value, which is not immediately obvious when > > > reading the code. > > > > > > Signed-off-by: David Matlack <dmatlack@google.com> > > > --- > > > arch/x86/kvm/mmu/tdp_mmu.c | 11 +++++++++++ > > > 1 file changed, 11 insertions(+) > > > > > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > > > index 07d5363c9db7..bdc7fd476721 100644 > > > --- a/arch/x86/kvm/mmu/tdp_mmu.c > > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > > > @@ -1619,6 +1619,17 @@ static void recover_huge_pages_range(struct kvm *kvm, > > > while (max_mapping_level > iter.level) > > > tdp_iter_step_up(&iter); > > > > > > + /* > > > + * Re-check that iter.old_spte still points to a page table. > > > + * Since mmu_lock is held for read and tdp_iter_step_up() > > > + * re-reads iter.sptep, it's possible the SPTE was zapped or > > > + * recovered by another CPU in between stepping down and > > > + * stepping back up. > > > + */ > > > + if (!is_shadow_present_pte(iter.old_spte) || > > > + is_last_spte(iter.old_spte, iter.level)) > > > + continue; > > > > This is the part of the step-up logic that I do not like. Even this check doesn't > > guarantee that the SPTE that is being replaced is the same non-leaf SPTE that was > > used to reach the leaf SPTE. E.g. in an absurdly theoretical situation, the SPTE > > could be zapped and then re-set with another non-leaf SPTE. Which is fine, but > > only because of several very subtle mechanisms. > > I'm not sure why that matters. The only thing that matters is that the > GFN->PFN and permissions cannot change, and that is guaranteed by > holding mmu_lock for read. Because it introduces possible edge cases, that while benign, require the reader to think about and reason through. E.g. if _this_ task is trying to replace a 4KiB page with a 1GiB page, what happens if some other task replaces the parent 2MiB page? It's not immediately obvious that looping on tdp_iter_step_up() will happily blaze past a !PRESENT SPTE, which might not even be the actual SPTE in the tree at this point. > At the end of the day, we never actually care about the value of the > SPTE we are replacing. We only care that it's a non-leaf SPTE. > > > kvm_mmu_max_mapping_level() ensures that there are no write-tracked gfns anywhere > > in the huge range, so it's safe to propagate any and all WRITABLE bits. This > > requires knowing/remembering that KVM disallows huge pages when a gfn is write- > > tracked, and relies on that never changing (which is a fairly safe bet, but the > > behavior isn't fully set in stone). > > not set. > > > > And the presence of a shadow-present leaf SPTE ensures there are no in-flight > > mmu_notifier invalidations, as kvm_mmu_notifier_invalidate_range_start() won't > > return until all relevant leaf SPTEs have been zapped. > > As you point out in the next paragraph there could be an inflight invalidate > that yielded, no? Yeah, ignore this, I forgot to amend it after remembering that invalidation can yield.
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 07d5363c9db7..bdc7fd476721 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -1619,6 +1619,17 @@ static void recover_huge_pages_range(struct kvm *kvm, while (max_mapping_level > iter.level) tdp_iter_step_up(&iter); + /* + * Re-check that iter.old_spte still points to a page table. + * Since mmu_lock is held for read and tdp_iter_step_up() + * re-reads iter.sptep, it's possible the SPTE was zapped or + * recovered by another CPU in between stepping down and + * stepping back up. + */ + if (!is_shadow_present_pte(iter.old_spte) || + is_last_spte(iter.old_spte, iter.level)) + continue; + if (!tdp_mmu_set_spte_atomic(kvm, &iter, huge_spte)) flush = true;
Recheck the iter.old_spte still points to a page table when recovering huge pages. Since mmu_lock is held for read and tdp_iter_step_up() re-reads iter.sptep, it's possible the SPTE was zapped or recovered by another CPU in between stepping down and back up. This could avoids a useless cmpxchg (and possibly a remote TLB flush) if another CPU is recovering huge SPTEs in parallel (e.g. the NX huge page recovery worker, or vCPUs taking faults on the huge page region). This also makes it clear that tdp_iter_step_up() re-reads the SPTE and thus can see a different value, which is not immediately obvious when reading the code. Signed-off-by: David Matlack <dmatlack@google.com> --- arch/x86/kvm/mmu/tdp_mmu.c | 11 +++++++++++ 1 file changed, 11 insertions(+)