mbox series

[RFC,0/3] Reduce NUMA balance caused TLB-shootdowns in a VM

Message ID 20230808071329.19995-1-yan.y.zhao@intel.com (mailing list archive)
Headers show
Series Reduce NUMA balance caused TLB-shootdowns in a VM | expand

Message

Yan Zhao Aug. 8, 2023, 7:13 a.m. UTC
This is an RFC series trying to fix the issue of unnecessary NUMA
protection and TLB-shootdowns found in VMs with assigned devices or VFIO
mediated devices during NUMA balance.

For VMs with assigned devices or VFIO mediated devices, all or part of
guest memory are pinned for long-term.

Auto NUMA balancing will periodically selects VMAs of a process and change
protections to PROT_NONE even though some or all pages in the selected
ranges are long-term pinned for DMAs, which is true for VMs with assigned
devices or VFIO mediated devices.

Though this will not cause real problem because NUMA migration will
ultimately reject migration of those kind of pages and restore those
PROT_NONE PTEs, it causes KVM's secondary MMU to be zapped periodically
with equal SPTEs finally faulted back, wasting CPU cycles and generating
unnecessary TLB-shootdowns.

This series first introduces a new flag MMU_NOTIFIER_RANGE_NUMA in patch 1
to work with mmu notifier event type MMU_NOTIFY_PROTECTION_VMA, so that
the subscriber (e.g.KVM) of the mmu notifier can know that an invalidation
event is sent for NUMA migration purpose in specific.

Then, with patch 3, during zapping KVM's secondary MMU, KVM can check
and keep accessing the long-term pinned pages even though it's
PROT_NONE-mapped in the primary MMU.

Patch 2 skips setting PROT_NONE to long-term pinned pages in the primary
MMU to avoid NUMA protection introduced page faults and restoration of old
huge PMDs/PTEs in primary MMU. As change_pmd_range() will first send
.invalidate_range_start() before going down and checking the pages to skip,
patch 1 and 3 are still required for KVM.

In my test environment, with this series, during boot-up with a VM with
assigned devices:
TLB shootdown count in KVM caused by .invalidate_range_start() sent for
NUMA balancing in change_pmd_range() is reduced from 9000+ on average to 0.

Yan Zhao (3):
  mm/mmu_notifier: introduce a new mmu notifier flag
    MMU_NOTIFIER_RANGE_NUMA
  mm: don't set PROT_NONE to maybe-dma-pinned pages for NUMA-migrate
    purpose
  KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration

 arch/x86/kvm/mmu/mmu.c       |  4 ++--
 arch/x86/kvm/mmu/tdp_mmu.c   | 26 ++++++++++++++++++++++----
 arch/x86/kvm/mmu/tdp_mmu.h   |  4 ++--
 include/linux/kvm_host.h     |  1 +
 include/linux/mmu_notifier.h |  1 +
 mm/huge_memory.c             |  5 +++++
 mm/mprotect.c                |  9 ++++++++-
 virt/kvm/kvm_main.c          |  5 +++++
 8 files changed, 46 insertions(+), 9 deletions(-)

base-commit: fdf0eaf11452d72945af31804e2a1048ee1b574c

Comments

Sean Christopherson Aug. 8, 2023, 2:26 p.m. UTC | #1
On Tue, Aug 08, 2023, Jason Gunthorpe wrote:
> On Tue, Aug 08, 2023 at 03:17:02PM +0800, Yan Zhao wrote:
> > @@ -859,6 +860,21 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> >  		    !is_last_spte(iter.old_spte, iter.level))
> >  			continue;
> >  
> > +		if (skip_pinned) {
> > +			kvm_pfn_t pfn = spte_to_pfn(iter.old_spte);
> > +			struct page *page = kvm_pfn_to_refcounted_page(pfn);
> > +			struct folio *folio;
> > +
> > +			if (!page)
> > +				continue;
> > +
> > +			folio = page_folio(page);
> > +
> > +			if (folio_test_anon(folio) && PageAnonExclusive(&folio->page) &&
> > +			    folio_maybe_dma_pinned(folio))
> > +				continue;
> > +		}
> > +
> 
> I don't get it..
> 
> The last patch made it so that the NUMA balancing code doesn't change
> page_maybe_dma_pinned() pages to PROT_NONE
> 
> So why doesn't KVM just check if the current and new SPTE are the same
> and refrain from invalidating if nothing changed?

Because KVM doesn't have visibility into the current and new PTEs when the zapping
occurs.  The contract for invalidate_range_start() requires that KVM drop all
references before returning, and so the zapping occurs before change_pte_range()
or change_huge_pmd() have done antyhing.

> Duplicating the checks here seems very frail to me.

Yes, this is approach gets a hard NAK from me.  IIUC, folio_maybe_dma_pinned()
can yield different results purely based on refcounts, i.e. KVM could skip pages
that the primary MMU does not, and thus violate the mmu_notifier contract.  And
in general, I am steadfastedly against adding any kind of heuristic to KVM's
zapping logic.

This really needs to be fixed in the primary MMU and not require any direct
involvement from secondary MMUs, e.g. the mmu_notifier invalidation itself needs
to be skipped.
Sean Christopherson Aug. 8, 2023, 11:56 p.m. UTC | #2
On Tue, Aug 08, 2023, Jason Gunthorpe wrote:
> On Tue, Aug 08, 2023 at 07:26:07AM -0700, Sean Christopherson wrote:
> > On Tue, Aug 08, 2023, Jason Gunthorpe wrote:
> > > On Tue, Aug 08, 2023 at 03:17:02PM +0800, Yan Zhao wrote:
> > > > @@ -859,6 +860,21 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> > > >  		    !is_last_spte(iter.old_spte, iter.level))
> > > >  			continue;
> > > >  
> > > > +		if (skip_pinned) {
> > > > +			kvm_pfn_t pfn = spte_to_pfn(iter.old_spte);
> > > > +			struct page *page = kvm_pfn_to_refcounted_page(pfn);
> > > > +			struct folio *folio;
> > > > +
> > > > +			if (!page)
> > > > +				continue;
> > > > +
> > > > +			folio = page_folio(page);
> > > > +
> > > > +			if (folio_test_anon(folio) && PageAnonExclusive(&folio->page) &&
> > > > +			    folio_maybe_dma_pinned(folio))
> > > > +				continue;
> > > > +		}
> > > > +
> > > 
> > > I don't get it..
> > > 
> > > The last patch made it so that the NUMA balancing code doesn't change
> > > page_maybe_dma_pinned() pages to PROT_NONE
> > > 
> > > So why doesn't KVM just check if the current and new SPTE are the same
> > > and refrain from invalidating if nothing changed?
> > 
> > Because KVM doesn't have visibility into the current and new PTEs when the zapping
> > occurs.  The contract for invalidate_range_start() requires that KVM drop all
> > references before returning, and so the zapping occurs before change_pte_range()
> > or change_huge_pmd() have done antyhing.
> > 
> > > Duplicating the checks here seems very frail to me.
> > 
> > Yes, this is approach gets a hard NAK from me.  IIUC, folio_maybe_dma_pinned()
> > can yield different results purely based on refcounts, i.e. KVM could skip pages
> > that the primary MMU does not, and thus violate the mmu_notifier contract.  And
> > in general, I am steadfastedly against adding any kind of heuristic to KVM's
> > zapping logic.
> > 
> > This really needs to be fixed in the primary MMU and not require any direct
> > involvement from secondary MMUs, e.g. the mmu_notifier invalidation itself needs
> > to be skipped.
> 
> This likely has the same issue you just described, we don't know if it
> can be skipped until we iterate over the PTEs and by then it is too
> late to invoke the notifier. Maybe some kind of abort and restart
> scheme could work?

Or maybe treat this as a userspace config problem?  Pinning DMA pages in a VM,
having a fair amount of remote memory, *and* expecting NUMA balancing to do anything
useful for that VM seems like a userspace problem.

Actually, does NUMA balancing even support this particular scenario?  I see this
in do_numa_page()

	/* TODO: handle PTE-mapped THP */
	if (PageCompound(page))
		goto out_map;

and then for PG_anon_exclusive

	 * ... For now, we only expect it to be
	 * set on tail pages for PTE-mapped THP.
	 */
	PG_anon_exclusive = PG_mappedtodisk,

which IIUC means zapping these pages to do migrate_on-fault will never succeed.

Can we just tell userspace to mbind() the pinned region to explicitly exclude the
VMA(s) from NUMA balancing?
Yan Zhao Aug. 9, 2023, 12:11 a.m. UTC | #3
On Tue, Aug 08, 2023 at 04:56:11PM -0700, Sean Christopherson wrote:
> On Tue, Aug 08, 2023, Jason Gunthorpe wrote:
> > On Tue, Aug 08, 2023 at 07:26:07AM -0700, Sean Christopherson wrote:
> > > On Tue, Aug 08, 2023, Jason Gunthorpe wrote:
> > > > On Tue, Aug 08, 2023 at 03:17:02PM +0800, Yan Zhao wrote:
> > > > > @@ -859,6 +860,21 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> > > > >  		    !is_last_spte(iter.old_spte, iter.level))
> > > > >  			continue;
> > > > >  
> > > > > +		if (skip_pinned) {
> > > > > +			kvm_pfn_t pfn = spte_to_pfn(iter.old_spte);
> > > > > +			struct page *page = kvm_pfn_to_refcounted_page(pfn);
> > > > > +			struct folio *folio;
> > > > > +
> > > > > +			if (!page)
> > > > > +				continue;
> > > > > +
> > > > > +			folio = page_folio(page);
> > > > > +
> > > > > +			if (folio_test_anon(folio) && PageAnonExclusive(&folio->page) &&
> > > > > +			    folio_maybe_dma_pinned(folio))
> > > > > +				continue;
> > > > > +		}
> > > > > +
> > > > 
> > > > I don't get it..
> > > > 
> > > > The last patch made it so that the NUMA balancing code doesn't change
> > > > page_maybe_dma_pinned() pages to PROT_NONE
> > > > 
> > > > So why doesn't KVM just check if the current and new SPTE are the same
> > > > and refrain from invalidating if nothing changed?
> > > 
> > > Because KVM doesn't have visibility into the current and new PTEs when the zapping
> > > occurs.  The contract for invalidate_range_start() requires that KVM drop all
> > > references before returning, and so the zapping occurs before change_pte_range()
> > > or change_huge_pmd() have done antyhing.
> > > 
> > > > Duplicating the checks here seems very frail to me.
> > > 
> > > Yes, this is approach gets a hard NAK from me.  IIUC, folio_maybe_dma_pinned()
> > > can yield different results purely based on refcounts, i.e. KVM could skip pages
> > > that the primary MMU does not, and thus violate the mmu_notifier contract.  And
> > > in general, I am steadfastedly against adding any kind of heuristic to KVM's
> > > zapping logic.
> > > 
> > > This really needs to be fixed in the primary MMU and not require any direct
> > > involvement from secondary MMUs, e.g. the mmu_notifier invalidation itself needs
> > > to be skipped.
> > 
> > This likely has the same issue you just described, we don't know if it
> > can be skipped until we iterate over the PTEs and by then it is too
> > late to invoke the notifier. Maybe some kind of abort and restart
> > scheme could work?
> 
> Or maybe treat this as a userspace config problem?  Pinning DMA pages in a VM,
> having a fair amount of remote memory, *and* expecting NUMA balancing to do anything
> useful for that VM seems like a userspace problem.
> 
> Actually, does NUMA balancing even support this particular scenario?  I see this
> in do_numa_page()
> 
> 	/* TODO: handle PTE-mapped THP */
> 	if (PageCompound(page))
> 		goto out_map;
hi Sean,
I think compound page is handled in do_huge_pmd_numa_page(), and I do
observed numa migration of those kind of pages.


> and then for PG_anon_exclusive
> 
> 	 * ... For now, we only expect it to be
> 	 * set on tail pages for PTE-mapped THP.
> 	 */
> 	PG_anon_exclusive = PG_mappedtodisk,
> 
> which IIUC means zapping these pages to do migrate_on-fault will never succeed.
> 
> Can we just tell userspace to mbind() the pinned region to explicitly exclude the
> VMA(s) from NUMA balancing?
For VMs with VFIO mdev mediated devices, the VMAs to be pinned are
dynamic, I think it's hard to mbind() in advance.

Thanks
Yan
Yan Zhao Aug. 9, 2023, 12:29 a.m. UTC | #4
On Tue, Aug 08, 2023 at 07:26:07AM -0700, Sean Christopherson wrote:
> On Tue, Aug 08, 2023, Jason Gunthorpe wrote:
> > On Tue, Aug 08, 2023 at 03:17:02PM +0800, Yan Zhao wrote:
> > > @@ -859,6 +860,21 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> > >  		    !is_last_spte(iter.old_spte, iter.level))
> > >  			continue;
> > >  
> > > +		if (skip_pinned) {
> > > +			kvm_pfn_t pfn = spte_to_pfn(iter.old_spte);
> > > +			struct page *page = kvm_pfn_to_refcounted_page(pfn);
> > > +			struct folio *folio;
> > > +
> > > +			if (!page)
> > > +				continue;
> > > +
> > > +			folio = page_folio(page);
> > > +
> > > +			if (folio_test_anon(folio) && PageAnonExclusive(&folio->page) &&
> > > +			    folio_maybe_dma_pinned(folio))
> > > +				continue;
> > > +		}
> > > +
> > 
> > I don't get it..
> > 
> > The last patch made it so that the NUMA balancing code doesn't change
> > page_maybe_dma_pinned() pages to PROT_NONE
> > 
> > So why doesn't KVM just check if the current and new SPTE are the same
> > and refrain from invalidating if nothing changed?
> 
> Because KVM doesn't have visibility into the current and new PTEs when the zapping
> occurs.  The contract for invalidate_range_start() requires that KVM drop all
> references before returning, and so the zapping occurs before change_pte_range()
> or change_huge_pmd() have done antyhing.
> 
> > Duplicating the checks here seems very frail to me.
> 
> Yes, this is approach gets a hard NAK from me.  IIUC, folio_maybe_dma_pinned()
> can yield different results purely based on refcounts, i.e. KVM could skip pages
Do you mean the different results of folio_maybe_dma_pinned() and
page_maybe_dma_pinned()?

I choose to use folio_maybe_dma_pinned() in KVM on purpose because in
this .invalidate_range_start() handler in KVM, we may get tail pages of
a folio, so it's better to call this folio's version of folio_maybe_dma_pinned().

However, in mm core, i.e. in change_huge_pmd() and change_pte_range(),
the "page" it gets is always head page of a folio, so though
page_maybe_dma_pinned() is called in it, it actually equals to
folio_maybe_dma_pinned(page_folio(page)).

So, I think the two sides should yield equal results.

On this other hand, if you are concerning about the ref count of page is
dynamic, and because KVM and mm core do not check ref count of a page
atomically, I think it's still fine.
Because, the notification of .invalidate_range_start() with event type
MMU_NOTIFY_PROTECTION_VMA only means the corresponding PTE is protected
in the primary MMU, it does not mean the page is UNMAPed.

In series [1], we can even see that for processes other than KVM, the
PROT_NONE in primary MMU for NUMA migration purpose is actually ignored
and the underlying PFNs are still accessed.

So, could KVM open a door for maybe-dma-pinned pages, and keeps mapping
those pages until
(1) a invalidate notification other than MMU_NOTIFY_PROTECTION_VMA comes or
(2) a invalidate notification with MMU_NOTIFY_PROTECTION_VMA comes again with
reduced page ref count?

[1]: https://lore.kernel.org/all/20230803143208.383663-1-david@redhat.com/

Thanks
Yan

> that the primary MMU does not, and thus violate the mmu_notifier contract.  And
> in general, I am steadfastedly against adding any kind of heuristic to KVM's
> zapping logic.
> 
> This really needs to be fixed in the primary MMU and not require any direct
> involvement from secondary MMUs, e.g. the mmu_notifier invalidation itself needs
> to be skipped.
>
Yan Zhao Aug. 9, 2023, 2:58 a.m. UTC | #5
On Tue, Aug 08, 2023 at 11:32:37AM -0300, Jason Gunthorpe wrote:
.... 
> > This really needs to be fixed in the primary MMU and not require any direct
> > involvement from secondary MMUs, e.g. the mmu_notifier invalidation itself needs
> > to be skipped.
> 
> This likely has the same issue you just described, we don't know if it
> can be skipped until we iterate over the PTEs and by then it is too
> late to invoke the notifier. Maybe some kind of abort and restart
The problem is that KVM currently performs the zap in handler of .invalidate_range_start(),
so before abort in mm, KVM has done the zap in secondary MMU.

Or, could we move the zap in KVM side to handler of .invalidate_range_end() only for
MMU_NOTIFY_PROTECTION_VMA and MMU_NOTIFIER_RANGE_NUMA?

Then, in mm side, we could do the abort and update the range to contain only successful
subrange .invalidate_range_end().

Is that acceptable?

> scheme could work?
>
Yan Zhao Aug. 9, 2023, 5:06 a.m. UTC | #6
On Tue, Aug 08, 2023 at 04:56:11PM -0700, Sean Christopherson wrote:
 
> and then for PG_anon_exclusive
> 
> 	 * ... For now, we only expect it to be
> 	 * set on tail pages for PTE-mapped THP.
> 	 */
> 	PG_anon_exclusive = PG_mappedtodisk,
> 

	/*
         * Depending on the way an anonymous folio can be mapped into a page
         * table (e.g., single PMD/PUD/CONT of the head page vs. PTE-mapped
         * THP), PG_anon_exclusive may be set only for the head page or for
         * tail pages of an anonymous folio. For now, we only expect it to be
         * set on tail pages for PTE-mapped THP.
         */
        PG_anon_exclusive = PG_mappedtodisk,

Now sure why the comment says PG_anon_exclusive is set only on tail
pages for PTE-mapped THP,

what I observed is that only head page of a compound page is set to
anon_exclusive.

And the code path is here:
__handle_mm_fault
  |->create_huge_pmd
     |->do_huge_pmd_anonymous_page //if (vma_is_anonymous(vmf->vma)
     	|->folio = vma_alloc_folio(gfp, HPAGE_PMD_ORDER, vma, haddr, true);
        |->__do_huge_pmd_anonymous_page(vmf, &folio->page, gfp);
           |->folio_add_new_anon_rmap
              |->__page_set_anon_rmap(folio, &folio->page, vma, address, 1);
	         |->SetPageAnonExclusive(page)

And this code path has been present since
6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
Jason Gunthorpe Aug. 9, 2023, 11:59 a.m. UTC | #7
On Wed, Aug 09, 2023 at 08:11:17AM +0800, Yan Zhao wrote:

> > Can we just tell userspace to mbind() the pinned region to explicitly exclude the
> > VMA(s) from NUMA balancing?

> For VMs with VFIO mdev mediated devices, the VMAs to be pinned are
> dynamic, I think it's hard to mbind() in advance.

It is hard to view the mediated devices path as a performance path
that deserves this kind of intervention :\

Jason
Yan Zhao Aug. 10, 2023, 9:08 a.m. UTC | #8
On Wed, Aug 09, 2023 at 08:59:16AM -0300, Jason Gunthorpe wrote:
> On Wed, Aug 09, 2023 at 08:11:17AM +0800, Yan Zhao wrote:
> 
> > > Can we just tell userspace to mbind() the pinned region to explicitly exclude the
> > > VMA(s) from NUMA balancing?
> 
> > For VMs with VFIO mdev mediated devices, the VMAs to be pinned are
> > dynamic, I think it's hard to mbind() in advance.
> 
> It is hard to view the mediated devices path as a performance path
> that deserves this kind of intervention :\

Though you are right, maybe we can still make it better?

What about introducing a new callback which will be called when a page
is ensured to be PROT_NONE protected for NUMA balancing?

Then, rather than duplicate mm logic in KVM, KVM can depend on this callback
and do the page unmap in secondary MMU only for pages that are indeed
PROT_NONE protected for NUMA balancing, excluding pages that are obviously
non-NUMA-migratable.

I sent a RFC v2 (commit messages and comments are not well polished) to
show this idea,
https://lore.kernel.org/all/20230810085636.25914-1-yan.y.zhao@intel.com/ 

Do you think we can continue the work?

Thanks a lot for your review!