Message ID | 20230808071329.19995-1-yan.y.zhao@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | Reduce NUMA balance caused TLB-shootdowns in a VM | expand |
On Tue, Aug 08, 2023, Jason Gunthorpe wrote: > On Tue, Aug 08, 2023 at 03:17:02PM +0800, Yan Zhao wrote: > > @@ -859,6 +860,21 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root, > > !is_last_spte(iter.old_spte, iter.level)) > > continue; > > > > + if (skip_pinned) { > > + kvm_pfn_t pfn = spte_to_pfn(iter.old_spte); > > + struct page *page = kvm_pfn_to_refcounted_page(pfn); > > + struct folio *folio; > > + > > + if (!page) > > + continue; > > + > > + folio = page_folio(page); > > + > > + if (folio_test_anon(folio) && PageAnonExclusive(&folio->page) && > > + folio_maybe_dma_pinned(folio)) > > + continue; > > + } > > + > > I don't get it.. > > The last patch made it so that the NUMA balancing code doesn't change > page_maybe_dma_pinned() pages to PROT_NONE > > So why doesn't KVM just check if the current and new SPTE are the same > and refrain from invalidating if nothing changed? Because KVM doesn't have visibility into the current and new PTEs when the zapping occurs. The contract for invalidate_range_start() requires that KVM drop all references before returning, and so the zapping occurs before change_pte_range() or change_huge_pmd() have done antyhing. > Duplicating the checks here seems very frail to me. Yes, this is approach gets a hard NAK from me. IIUC, folio_maybe_dma_pinned() can yield different results purely based on refcounts, i.e. KVM could skip pages that the primary MMU does not, and thus violate the mmu_notifier contract. And in general, I am steadfastedly against adding any kind of heuristic to KVM's zapping logic. This really needs to be fixed in the primary MMU and not require any direct involvement from secondary MMUs, e.g. the mmu_notifier invalidation itself needs to be skipped.
On Tue, Aug 08, 2023, Jason Gunthorpe wrote: > On Tue, Aug 08, 2023 at 07:26:07AM -0700, Sean Christopherson wrote: > > On Tue, Aug 08, 2023, Jason Gunthorpe wrote: > > > On Tue, Aug 08, 2023 at 03:17:02PM +0800, Yan Zhao wrote: > > > > @@ -859,6 +860,21 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root, > > > > !is_last_spte(iter.old_spte, iter.level)) > > > > continue; > > > > > > > > + if (skip_pinned) { > > > > + kvm_pfn_t pfn = spte_to_pfn(iter.old_spte); > > > > + struct page *page = kvm_pfn_to_refcounted_page(pfn); > > > > + struct folio *folio; > > > > + > > > > + if (!page) > > > > + continue; > > > > + > > > > + folio = page_folio(page); > > > > + > > > > + if (folio_test_anon(folio) && PageAnonExclusive(&folio->page) && > > > > + folio_maybe_dma_pinned(folio)) > > > > + continue; > > > > + } > > > > + > > > > > > I don't get it.. > > > > > > The last patch made it so that the NUMA balancing code doesn't change > > > page_maybe_dma_pinned() pages to PROT_NONE > > > > > > So why doesn't KVM just check if the current and new SPTE are the same > > > and refrain from invalidating if nothing changed? > > > > Because KVM doesn't have visibility into the current and new PTEs when the zapping > > occurs. The contract for invalidate_range_start() requires that KVM drop all > > references before returning, and so the zapping occurs before change_pte_range() > > or change_huge_pmd() have done antyhing. > > > > > Duplicating the checks here seems very frail to me. > > > > Yes, this is approach gets a hard NAK from me. IIUC, folio_maybe_dma_pinned() > > can yield different results purely based on refcounts, i.e. KVM could skip pages > > that the primary MMU does not, and thus violate the mmu_notifier contract. And > > in general, I am steadfastedly against adding any kind of heuristic to KVM's > > zapping logic. > > > > This really needs to be fixed in the primary MMU and not require any direct > > involvement from secondary MMUs, e.g. the mmu_notifier invalidation itself needs > > to be skipped. > > This likely has the same issue you just described, we don't know if it > can be skipped until we iterate over the PTEs and by then it is too > late to invoke the notifier. Maybe some kind of abort and restart > scheme could work? Or maybe treat this as a userspace config problem? Pinning DMA pages in a VM, having a fair amount of remote memory, *and* expecting NUMA balancing to do anything useful for that VM seems like a userspace problem. Actually, does NUMA balancing even support this particular scenario? I see this in do_numa_page() /* TODO: handle PTE-mapped THP */ if (PageCompound(page)) goto out_map; and then for PG_anon_exclusive * ... For now, we only expect it to be * set on tail pages for PTE-mapped THP. */ PG_anon_exclusive = PG_mappedtodisk, which IIUC means zapping these pages to do migrate_on-fault will never succeed. Can we just tell userspace to mbind() the pinned region to explicitly exclude the VMA(s) from NUMA balancing?
On Tue, Aug 08, 2023 at 04:56:11PM -0700, Sean Christopherson wrote: > On Tue, Aug 08, 2023, Jason Gunthorpe wrote: > > On Tue, Aug 08, 2023 at 07:26:07AM -0700, Sean Christopherson wrote: > > > On Tue, Aug 08, 2023, Jason Gunthorpe wrote: > > > > On Tue, Aug 08, 2023 at 03:17:02PM +0800, Yan Zhao wrote: > > > > > @@ -859,6 +860,21 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root, > > > > > !is_last_spte(iter.old_spte, iter.level)) > > > > > continue; > > > > > > > > > > + if (skip_pinned) { > > > > > + kvm_pfn_t pfn = spte_to_pfn(iter.old_spte); > > > > > + struct page *page = kvm_pfn_to_refcounted_page(pfn); > > > > > + struct folio *folio; > > > > > + > > > > > + if (!page) > > > > > + continue; > > > > > + > > > > > + folio = page_folio(page); > > > > > + > > > > > + if (folio_test_anon(folio) && PageAnonExclusive(&folio->page) && > > > > > + folio_maybe_dma_pinned(folio)) > > > > > + continue; > > > > > + } > > > > > + > > > > > > > > I don't get it.. > > > > > > > > The last patch made it so that the NUMA balancing code doesn't change > > > > page_maybe_dma_pinned() pages to PROT_NONE > > > > > > > > So why doesn't KVM just check if the current and new SPTE are the same > > > > and refrain from invalidating if nothing changed? > > > > > > Because KVM doesn't have visibility into the current and new PTEs when the zapping > > > occurs. The contract for invalidate_range_start() requires that KVM drop all > > > references before returning, and so the zapping occurs before change_pte_range() > > > or change_huge_pmd() have done antyhing. > > > > > > > Duplicating the checks here seems very frail to me. > > > > > > Yes, this is approach gets a hard NAK from me. IIUC, folio_maybe_dma_pinned() > > > can yield different results purely based on refcounts, i.e. KVM could skip pages > > > that the primary MMU does not, and thus violate the mmu_notifier contract. And > > > in general, I am steadfastedly against adding any kind of heuristic to KVM's > > > zapping logic. > > > > > > This really needs to be fixed in the primary MMU and not require any direct > > > involvement from secondary MMUs, e.g. the mmu_notifier invalidation itself needs > > > to be skipped. > > > > This likely has the same issue you just described, we don't know if it > > can be skipped until we iterate over the PTEs and by then it is too > > late to invoke the notifier. Maybe some kind of abort and restart > > scheme could work? > > Or maybe treat this as a userspace config problem? Pinning DMA pages in a VM, > having a fair amount of remote memory, *and* expecting NUMA balancing to do anything > useful for that VM seems like a userspace problem. > > Actually, does NUMA balancing even support this particular scenario? I see this > in do_numa_page() > > /* TODO: handle PTE-mapped THP */ > if (PageCompound(page)) > goto out_map; hi Sean, I think compound page is handled in do_huge_pmd_numa_page(), and I do observed numa migration of those kind of pages. > and then for PG_anon_exclusive > > * ... For now, we only expect it to be > * set on tail pages for PTE-mapped THP. > */ > PG_anon_exclusive = PG_mappedtodisk, > > which IIUC means zapping these pages to do migrate_on-fault will never succeed. > > Can we just tell userspace to mbind() the pinned region to explicitly exclude the > VMA(s) from NUMA balancing? For VMs with VFIO mdev mediated devices, the VMAs to be pinned are dynamic, I think it's hard to mbind() in advance. Thanks Yan
On Tue, Aug 08, 2023 at 07:26:07AM -0700, Sean Christopherson wrote: > On Tue, Aug 08, 2023, Jason Gunthorpe wrote: > > On Tue, Aug 08, 2023 at 03:17:02PM +0800, Yan Zhao wrote: > > > @@ -859,6 +860,21 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root, > > > !is_last_spte(iter.old_spte, iter.level)) > > > continue; > > > > > > + if (skip_pinned) { > > > + kvm_pfn_t pfn = spte_to_pfn(iter.old_spte); > > > + struct page *page = kvm_pfn_to_refcounted_page(pfn); > > > + struct folio *folio; > > > + > > > + if (!page) > > > + continue; > > > + > > > + folio = page_folio(page); > > > + > > > + if (folio_test_anon(folio) && PageAnonExclusive(&folio->page) && > > > + folio_maybe_dma_pinned(folio)) > > > + continue; > > > + } > > > + > > > > I don't get it.. > > > > The last patch made it so that the NUMA balancing code doesn't change > > page_maybe_dma_pinned() pages to PROT_NONE > > > > So why doesn't KVM just check if the current and new SPTE are the same > > and refrain from invalidating if nothing changed? > > Because KVM doesn't have visibility into the current and new PTEs when the zapping > occurs. The contract for invalidate_range_start() requires that KVM drop all > references before returning, and so the zapping occurs before change_pte_range() > or change_huge_pmd() have done antyhing. > > > Duplicating the checks here seems very frail to me. > > Yes, this is approach gets a hard NAK from me. IIUC, folio_maybe_dma_pinned() > can yield different results purely based on refcounts, i.e. KVM could skip pages Do you mean the different results of folio_maybe_dma_pinned() and page_maybe_dma_pinned()? I choose to use folio_maybe_dma_pinned() in KVM on purpose because in this .invalidate_range_start() handler in KVM, we may get tail pages of a folio, so it's better to call this folio's version of folio_maybe_dma_pinned(). However, in mm core, i.e. in change_huge_pmd() and change_pte_range(), the "page" it gets is always head page of a folio, so though page_maybe_dma_pinned() is called in it, it actually equals to folio_maybe_dma_pinned(page_folio(page)). So, I think the two sides should yield equal results. On this other hand, if you are concerning about the ref count of page is dynamic, and because KVM and mm core do not check ref count of a page atomically, I think it's still fine. Because, the notification of .invalidate_range_start() with event type MMU_NOTIFY_PROTECTION_VMA only means the corresponding PTE is protected in the primary MMU, it does not mean the page is UNMAPed. In series [1], we can even see that for processes other than KVM, the PROT_NONE in primary MMU for NUMA migration purpose is actually ignored and the underlying PFNs are still accessed. So, could KVM open a door for maybe-dma-pinned pages, and keeps mapping those pages until (1) a invalidate notification other than MMU_NOTIFY_PROTECTION_VMA comes or (2) a invalidate notification with MMU_NOTIFY_PROTECTION_VMA comes again with reduced page ref count? [1]: https://lore.kernel.org/all/20230803143208.383663-1-david@redhat.com/ Thanks Yan > that the primary MMU does not, and thus violate the mmu_notifier contract. And > in general, I am steadfastedly against adding any kind of heuristic to KVM's > zapping logic. > > This really needs to be fixed in the primary MMU and not require any direct > involvement from secondary MMUs, e.g. the mmu_notifier invalidation itself needs > to be skipped. >
On Tue, Aug 08, 2023 at 11:32:37AM -0300, Jason Gunthorpe wrote: .... > > This really needs to be fixed in the primary MMU and not require any direct > > involvement from secondary MMUs, e.g. the mmu_notifier invalidation itself needs > > to be skipped. > > This likely has the same issue you just described, we don't know if it > can be skipped until we iterate over the PTEs and by then it is too > late to invoke the notifier. Maybe some kind of abort and restart The problem is that KVM currently performs the zap in handler of .invalidate_range_start(), so before abort in mm, KVM has done the zap in secondary MMU. Or, could we move the zap in KVM side to handler of .invalidate_range_end() only for MMU_NOTIFY_PROTECTION_VMA and MMU_NOTIFIER_RANGE_NUMA? Then, in mm side, we could do the abort and update the range to contain only successful subrange .invalidate_range_end(). Is that acceptable? > scheme could work? >
On Tue, Aug 08, 2023 at 04:56:11PM -0700, Sean Christopherson wrote: > and then for PG_anon_exclusive > > * ... For now, we only expect it to be > * set on tail pages for PTE-mapped THP. > */ > PG_anon_exclusive = PG_mappedtodisk, > /* * Depending on the way an anonymous folio can be mapped into a page * table (e.g., single PMD/PUD/CONT of the head page vs. PTE-mapped * THP), PG_anon_exclusive may be set only for the head page or for * tail pages of an anonymous folio. For now, we only expect it to be * set on tail pages for PTE-mapped THP. */ PG_anon_exclusive = PG_mappedtodisk, Now sure why the comment says PG_anon_exclusive is set only on tail pages for PTE-mapped THP, what I observed is that only head page of a compound page is set to anon_exclusive. And the code path is here: __handle_mm_fault |->create_huge_pmd |->do_huge_pmd_anonymous_page //if (vma_is_anonymous(vmf->vma) |->folio = vma_alloc_folio(gfp, HPAGE_PMD_ORDER, vma, haddr, true); |->__do_huge_pmd_anonymous_page(vmf, &folio->page, gfp); |->folio_add_new_anon_rmap |->__page_set_anon_rmap(folio, &folio->page, vma, address, 1); |->SetPageAnonExclusive(page) And this code path has been present since 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
On Wed, Aug 09, 2023 at 08:11:17AM +0800, Yan Zhao wrote: > > Can we just tell userspace to mbind() the pinned region to explicitly exclude the > > VMA(s) from NUMA balancing? > For VMs with VFIO mdev mediated devices, the VMAs to be pinned are > dynamic, I think it's hard to mbind() in advance. It is hard to view the mediated devices path as a performance path that deserves this kind of intervention :\ Jason
On Wed, Aug 09, 2023 at 08:59:16AM -0300, Jason Gunthorpe wrote: > On Wed, Aug 09, 2023 at 08:11:17AM +0800, Yan Zhao wrote: > > > > Can we just tell userspace to mbind() the pinned region to explicitly exclude the > > > VMA(s) from NUMA balancing? > > > For VMs with VFIO mdev mediated devices, the VMAs to be pinned are > > dynamic, I think it's hard to mbind() in advance. > > It is hard to view the mediated devices path as a performance path > that deserves this kind of intervention :\ Though you are right, maybe we can still make it better? What about introducing a new callback which will be called when a page is ensured to be PROT_NONE protected for NUMA balancing? Then, rather than duplicate mm logic in KVM, KVM can depend on this callback and do the page unmap in secondary MMU only for pages that are indeed PROT_NONE protected for NUMA balancing, excluding pages that are obviously non-NUMA-migratable. I sent a RFC v2 (commit messages and comments are not well polished) to show this idea, https://lore.kernel.org/all/20230810085636.25914-1-yan.y.zhao@intel.com/ Do you think we can continue the work? Thanks a lot for your review!