diff mbox series

[RFC,v2,5/5] KVM: Unmap pages only when it's indeed protected for NUMA migration

Message ID 20230810090218.26244-1-yan.y.zhao@intel.com (mailing list archive)
State New, archived
Headers show
Series Reduce NUMA balance caused TLB-shootdowns in a VM | expand

Commit Message

Yan Zhao Aug. 10, 2023, 9:02 a.m. UTC
Register to .numa_protect() callback in mmu notifier so that KVM can get
acurate information about when a page is PROT_NONE protected in primary
MMU and unmap it in secondary MMU accordingly.

In KVM's .invalidate_range_start() handler, if the event is to notify that
the range may be protected to PROT_NONE for NUMA migration purpose,
don't do the unmapping in secondary MMU. Hold on until.numa_protect()
comes.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 virt/kvm/kvm_main.c | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

Comments

bibo mao Aug. 10, 2023, 1:16 p.m. UTC | #1
在 2023/8/10 17:02, Yan Zhao 写道:
> Register to .numa_protect() callback in mmu notifier so that KVM can get
> acurate information about when a page is PROT_NONE protected in primary
> MMU and unmap it in secondary MMU accordingly.
> 
> In KVM's .invalidate_range_start() handler, if the event is to notify that
> the range may be protected to PROT_NONE for NUMA migration purpose,
> don't do the unmapping in secondary MMU. Hold on until.numa_protect()
> comes.
> 
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
>  virt/kvm/kvm_main.c | 25 ++++++++++++++++++++++---
>  1 file changed, 22 insertions(+), 3 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index dfbaafbe3a00..907444a1761b 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -711,6 +711,20 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
>  	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_change_spte_gfn);
>  }
>  
> +static void kvm_mmu_notifier_numa_protect(struct mmu_notifier *mn,
> +					  struct mm_struct *mm,
> +					  unsigned long start,
> +					  unsigned long end)
> +{
> +	struct kvm *kvm = mmu_notifier_to_kvm(mn);
> +
> +	WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
> +	if (!READ_ONCE(kvm->mmu_invalidate_in_progress))
> +		return;
> +
> +	kvm_handle_hva_range(mn, start, end, __pte(0), kvm_unmap_gfn_range);
> +}
numa balance will scan wide memory range, and there will be one time
ipi notification with kvm_flush_remote_tlbs. With page level notification,
it may bring out lots of flush remote tlb ipi notification.

however numa balance notification, pmd table of vm maybe needs not be freed
in kvm_unmap_gfn_range.

Regards
Bibo Mao
> +
>  void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
>  			      unsigned long end)
>  {
> @@ -744,14 +758,18 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  					const struct mmu_notifier_range *range)
>  {
>  	struct kvm *kvm = mmu_notifier_to_kvm(mn);
> +	bool is_numa = (range->event == MMU_NOTIFY_PROTECTION_VMA) &&
> +		       (range->flags & MMU_NOTIFIER_RANGE_NUMA);
>  	const struct kvm_hva_range hva_range = {
>  		.start		= range->start,
>  		.end		= range->end,
>  		.pte		= __pte(0),
> -		.handler	= kvm_unmap_gfn_range,
> +		.handler	= !is_numa ? kvm_unmap_gfn_range :
> +				  (void *)kvm_null_fn,
>  		.on_lock	= kvm_mmu_invalidate_begin,
> -		.on_unlock	= kvm_arch_guest_memory_reclaimed,
> -		.flush_on_ret	= true,
> +		.on_unlock	= !is_numa ? kvm_arch_guest_memory_reclaimed :
> +				  (void *)kvm_null_fn,
> +		.flush_on_ret	= !is_numa ? true : false,
>  		.may_block	= mmu_notifier_range_blockable(range),
>  	};
>  
> @@ -899,6 +917,7 @@ static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
>  	.clear_young		= kvm_mmu_notifier_clear_young,
>  	.test_young		= kvm_mmu_notifier_test_young,
>  	.change_pte		= kvm_mmu_notifier_change_pte,
> +	.numa_protect		= kvm_mmu_notifier_numa_protect,
>  	.release		= kvm_mmu_notifier_release,
>  };
>
Yan Zhao Aug. 11, 2023, 3:45 a.m. UTC | #2
> > +static void kvm_mmu_notifier_numa_protect(struct mmu_notifier *mn,
> > +					  struct mm_struct *mm,
> > +					  unsigned long start,
> > +					  unsigned long end)
> > +{
> > +	struct kvm *kvm = mmu_notifier_to_kvm(mn);
> > +
> > +	WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
> > +	if (!READ_ONCE(kvm->mmu_invalidate_in_progress))
> > +		return;
> > +
> > +	kvm_handle_hva_range(mn, start, end, __pte(0), kvm_unmap_gfn_range);
> > +}
> numa balance will scan wide memory range, and there will be one time
Though scanning memory range is wide, .invalidate_range_start() is sent
for each 2M range.

> ipi notification with kvm_flush_remote_tlbs. With page level notification,
> it may bring out lots of flush remote tlb ipi notification.

Hmm, for VMs with assigned devices, apparently, the flush remote tlb IPIs
will be reduced to 0 with this series.

For VMs without assigned devices or mdev devices, I was previously also
worried about that there might be more IPIs.
But with current test data, there's no more remote tlb IPIs on average.

The reason is below:

Before this series, kvm_unmap_gfn_range() is called for once for a 2M
range.
After this series, kvm_unmap_gfn_range() is called for once if the 2M is
mapped to a huge page in primary MMU, and called for at most 512 times
if mapped to 4K pages in primary MMU.


Though kvm_unmap_gfn_range() is only called once before this series,
as the range is blockable, when there're contentions, remote tlb IPIs
can be sent page by page in 4K granularity (in tdp_mmu_iter_cond_resched())
if the pages are mapped in 4K in secondary MMU.

With this series, on the other hand, .numa_protect() sets range to be
unblockable. So there could be less remote tlb IPIs when a 2M range is
mapped into small PTEs in secondary MMU.
Besides, .numa_protect() is not sent for all pages in a given 2M range.

Below is my testing data on a VM without assigned devices:
The data is an average of 10 times guest boot-up.
                   
    data           | numa balancing caused  | numa balancing caused    
  on average       | #kvm_unmap_gfn_range() | #kvm_flush_remote_tlbs() 
-------------------|------------------------|--------------------------
before this series |         35             |     8625                 
after  this series |      10037             |     4610                 

For a single guest bootup,
                   | numa balancing caused  | numa balancing caused    
    best  data     | #kvm_unmap_gfn_range() | #kvm_flush_remote_tlbs() 
-------------------|------------------------|--------------------------
before this series |         28             |       13                  
after  this series |        406             |      195                  

                   | numa balancing caused  | numa balancing caused    
   worst  data     | #kvm_unmap_gfn_range() | #kvm_flush_remote_tlbs() 
-------------------|------------------------|--------------------------
before this series |         44             |    43920               
after  this series |      17352             |     8668                 


> 
> however numa balance notification, pmd table of vm maybe needs not be freed
> in kvm_unmap_gfn_range.
>
bibo mao Aug. 11, 2023, 7:40 a.m. UTC | #3
在 2023/8/11 11:45, Yan Zhao 写道:
>>> +static void kvm_mmu_notifier_numa_protect(struct mmu_notifier *mn,
>>> +					  struct mm_struct *mm,
>>> +					  unsigned long start,
>>> +					  unsigned long end)
>>> +{
>>> +	struct kvm *kvm = mmu_notifier_to_kvm(mn);
>>> +
>>> +	WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
>>> +	if (!READ_ONCE(kvm->mmu_invalidate_in_progress))
>>> +		return;
>>> +
>>> +	kvm_handle_hva_range(mn, start, end, __pte(0), kvm_unmap_gfn_range);
>>> +}
>> numa balance will scan wide memory range, and there will be one time
> Though scanning memory range is wide, .invalidate_range_start() is sent
> for each 2M range.
yes, range is huge page size when changing numa protection during numa scanning.

> 
>> ipi notification with kvm_flush_remote_tlbs. With page level notification,
>> it may bring out lots of flush remote tlb ipi notification.
> 
> Hmm, for VMs with assigned devices, apparently, the flush remote tlb IPIs
> will be reduced to 0 with this series.
> 
> For VMs without assigned devices or mdev devices, I was previously also
> worried about that there might be more IPIs.
> But with current test data, there's no more remote tlb IPIs on average.
> 
> The reason is below:
> 
> Before this series, kvm_unmap_gfn_range() is called for once for a 2M
> range.
> After this series, kvm_unmap_gfn_range() is called for once if the 2M is
> mapped to a huge page in primary MMU, and called for at most 512 times
> if mapped to 4K pages in primary MMU.
> 
> 
> Though kvm_unmap_gfn_range() is only called once before this series,
> as the range is blockable, when there're contentions, remote tlb IPIs
> can be sent page by page in 4K granularity (in tdp_mmu_iter_cond_resched())
I do not know much about x86, does this happen always or only need reschedule
from code?  so that there will be many times of tlb IPIs in only once function
call about kvm_unmap_gfn_range.

> if the pages are mapped in 4K in secondary MMU.
> 
> With this series, on the other hand, .numa_protect() sets range to be
> unblockable. So there could be less remote tlb IPIs when a 2M range is
> mapped into small PTEs in secondary MMU.
> Besides, .numa_protect() is not sent for all pages in a given 2M range.
No, .numa_protect() is not sent for all pages. It depends on the workload,
whether the page is accessed for different cpu threads cross-nodes.

> 
> Below is my testing data on a VM without assigned devices:
> The data is an average of 10 times guest boot-up.
>                    
>     data           | numa balancing caused  | numa balancing caused    
>   on average       | #kvm_unmap_gfn_range() | #kvm_flush_remote_tlbs() 
> -------------------|------------------------|--------------------------
> before this series |         35             |     8625                 
> after  this series |      10037             |     4610   
just be cautious, before the series there are  8625/35 = 246 IPI tlb flush ops
during one time kvm_unmap_gfn_range, is that x86 specific or generic? 

By the way are primary mmu and secondary mmu both 4K small page size "on average"?

Regards
Bibo Mao
              
> 
> For a single guest bootup,
>                    | numa balancing caused  | numa balancing caused    
>     best  data     | #kvm_unmap_gfn_range() | #kvm_flush_remote_tlbs() 
> -------------------|------------------------|--------------------------
> before this series |         28             |       13                  
> after  this series |        406             |      195                  
> 
>                    | numa balancing caused  | numa balancing caused    
>    worst  data     | #kvm_unmap_gfn_range() | #kvm_flush_remote_tlbs() 
> -------------------|------------------------|--------------------------
> before this series |         44             |    43920               
> after  this series |      17352             |     8668                 

> 
> 
>>
>> however numa balance notification, pmd table of vm maybe needs not be freed
>> in kvm_unmap_gfn_range.
>>
>
Yan Zhao Aug. 11, 2023, 8:01 a.m. UTC | #4
On Fri, Aug 11, 2023 at 03:40:44PM +0800, bibo mao wrote:
> 
> 
> 在 2023/8/11 11:45, Yan Zhao 写道:
> >>> +static void kvm_mmu_notifier_numa_protect(struct mmu_notifier *mn,
> >>> +					  struct mm_struct *mm,
> >>> +					  unsigned long start,
> >>> +					  unsigned long end)
> >>> +{
> >>> +	struct kvm *kvm = mmu_notifier_to_kvm(mn);
> >>> +
> >>> +	WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
> >>> +	if (!READ_ONCE(kvm->mmu_invalidate_in_progress))
> >>> +		return;
> >>> +
> >>> +	kvm_handle_hva_range(mn, start, end, __pte(0), kvm_unmap_gfn_range);
> >>> +}
> >> numa balance will scan wide memory range, and there will be one time
> > Though scanning memory range is wide, .invalidate_range_start() is sent
> > for each 2M range.
> yes, range is huge page size when changing numa protection during numa scanning.
> 
> > 
> >> ipi notification with kvm_flush_remote_tlbs. With page level notification,
> >> it may bring out lots of flush remote tlb ipi notification.
> > 
> > Hmm, for VMs with assigned devices, apparently, the flush remote tlb IPIs
> > will be reduced to 0 with this series.
> > 
> > For VMs without assigned devices or mdev devices, I was previously also
> > worried about that there might be more IPIs.
> > But with current test data, there's no more remote tlb IPIs on average.
> > 
> > The reason is below:
> > 
> > Before this series, kvm_unmap_gfn_range() is called for once for a 2M
> > range.
> > After this series, kvm_unmap_gfn_range() is called for once if the 2M is
> > mapped to a huge page in primary MMU, and called for at most 512 times
> > if mapped to 4K pages in primary MMU.
> > 
> > 
> > Though kvm_unmap_gfn_range() is only called once before this series,
> > as the range is blockable, when there're contentions, remote tlb IPIs
> > can be sent page by page in 4K granularity (in tdp_mmu_iter_cond_resched())
> I do not know much about x86, does this happen always or only need reschedule
Ah, sorry, I missed platforms other than x86.
Maybe there will be a big difference in other platforms.

> from code?  so that there will be many times of tlb IPIs in only once function
Only when MMU lock is contended. But it's not seldom because of the contention in
TDP page fault.

> call about kvm_unmap_gfn_range.
> 
> > if the pages are mapped in 4K in secondary MMU.
> > 
> > With this series, on the other hand, .numa_protect() sets range to be
> > unblockable. So there could be less remote tlb IPIs when a 2M range is
> > mapped into small PTEs in secondary MMU.
> > Besides, .numa_protect() is not sent for all pages in a given 2M range.
> No, .numa_protect() is not sent for all pages. It depends on the workload,
> whether the page is accessed for different cpu threads cross-nodes.
The .numa_protect() is called in patch 4 only when PROT_NONE is set to
the page.

> 
> > 
> > Below is my testing data on a VM without assigned devices:
> > The data is an average of 10 times guest boot-up.
> >                    
> >     data           | numa balancing caused  | numa balancing caused    
> >   on average       | #kvm_unmap_gfn_range() | #kvm_flush_remote_tlbs() 
> > -------------------|------------------------|--------------------------
> > before this series |         35             |     8625                 
> > after  this series |      10037             |     4610   
> just be cautious, before the series there are  8625/35 = 246 IPI tlb flush ops
> during one time kvm_unmap_gfn_range, is that x86 specific or generic? 
Only on x86. Didn't test on other platforms.

> 
> By the way are primary mmu and secondary mmu both 4K small page size "on average"?
No. 4K and 2M combined in both.
Sean Christopherson Aug. 11, 2023, 5:14 p.m. UTC | #5
On Fri, Aug 11, 2023, Yan Zhao wrote:
> On Fri, Aug 11, 2023 at 03:40:44PM +0800, bibo mao wrote:
> > 
> > 在 2023/8/11 11:45, Yan Zhao 写道:
> > >>> +static void kvm_mmu_notifier_numa_protect(struct mmu_notifier *mn,
> > >>> +					  struct mm_struct *mm,
> > >>> +					  unsigned long start,
> > >>> +					  unsigned long end)
> > >>> +{
> > >>> +	struct kvm *kvm = mmu_notifier_to_kvm(mn);
> > >>> +
> > >>> +	WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
> > >>> +	if (!READ_ONCE(kvm->mmu_invalidate_in_progress))
> > >>> +		return;
> > >>> +
> > >>> +	kvm_handle_hva_range(mn, start, end, __pte(0), kvm_unmap_gfn_range);
> > >>> +}
> > >> numa balance will scan wide memory range, and there will be one time
> > > Though scanning memory range is wide, .invalidate_range_start() is sent
> > > for each 2M range.
> > yes, range is huge page size when changing numa protection during numa scanning.
> > 
> > > 
> > >> ipi notification with kvm_flush_remote_tlbs. With page level notification,
> > >> it may bring out lots of flush remote tlb ipi notification.
> > > 
> > > Hmm, for VMs with assigned devices, apparently, the flush remote tlb IPIs
> > > will be reduced to 0 with this series.
> > > 
> > > For VMs without assigned devices or mdev devices, I was previously also
> > > worried about that there might be more IPIs.
> > > But with current test data, there's no more remote tlb IPIs on average.
> > > 
> > > The reason is below:
> > > 
> > > Before this series, kvm_unmap_gfn_range() is called for once for a 2M
> > > range.

No, it's potentially called once per 1GiB range.  change_pmd_range() invokes the
mmu_notifier with addr+end, where "end" is the end of the range covered by the
PUD, not the the end of the current PMD.  So the worst case scenario would be a
256k increase.  Of course, if you have to migrate an entire 1GiB chunk of memory
then you likely have bigger problems, but still.

> > > After this series, kvm_unmap_gfn_range() is called for once if the 2M is
> > > mapped to a huge page in primary MMU, and called for at most 512 times
> > > if mapped to 4K pages in primary MMU.
> > > 
> > > 
> > > Though kvm_unmap_gfn_range() is only called once before this series,
> > > as the range is blockable, when there're contentions, remote tlb IPIs
> > > can be sent page by page in 4K granularity (in tdp_mmu_iter_cond_resched())
> > I do not know much about x86, does this happen always or only need reschedule
> Ah, sorry, I missed platforms other than x86.
> Maybe there will be a big difference in other platforms.
> 
> > from code?  so that there will be many times of tlb IPIs in only once function
> Only when MMU lock is contended. But it's not seldom because of the contention in
> TDP page fault.

No?  I don't see how mmu_lock contention would affect the number of calls to 
kvm_flush_remote_tlbs().  If vCPUs are running and not generating faults, i.e.
not trying to access the range in question, then ever zap will generate a remote
TLB flush and thus send IPIs to all running vCPUs.

> > call about kvm_unmap_gfn_range.
> > 
> > > if the pages are mapped in 4K in secondary MMU.
> > > 
> > > With this series, on the other hand, .numa_protect() sets range to be
> > > unblockable. So there could be less remote tlb IPIs when a 2M range is
> > > mapped into small PTEs in secondary MMU.
> > > Besides, .numa_protect() is not sent for all pages in a given 2M range.
> > No, .numa_protect() is not sent for all pages. It depends on the workload,
> > whether the page is accessed for different cpu threads cross-nodes.
> The .numa_protect() is called in patch 4 only when PROT_NONE is set to
> the page.

I'm strongly opposed to adding MMU_NOTIFIER_RANGE_NUMA.  It's too much of a one-off,
and losing the batching of invalidations makes me nervous.  As Bibo points out,
the behavior will vary based on the workload, VM configuration, etc.

There's also a *very* subtle change, in that the notification will be sent while
holding the PMD/PTE lock.  Taking KVM's mmu_lock under that is *probably* ok, but
I'm not exactly 100% confident on that.  And the only reason there isn't a more
obvious bug is because kvm_handle_hva_range() sets may_block to false, e.g. KVM
won't yield if there's mmu_lock contention.

However, *if* it's ok to invoke MMU notifiers while holding PMD/PTE locks, then
I think we can achieve what you want without losing batching, and without changing
secondary MMUs.

Rather than muck with the notification types and add a one-off for NUMA, just
defer the notification until a present PMD/PTE is actually going to be modified.
It's not the prettiest, but other than the locking, I don't think has any chance
of regressing other workloads/configurations.

Note, I'm assuming secondary MMUs aren't allowed to map swap entries...

Compile tested only.

From: Sean Christopherson <seanjc@google.com>
Date: Fri, 11 Aug 2023 10:03:36 -0700
Subject: [PATCH] tmp

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 include/linux/huge_mm.h |  4 +++-
 include/linux/mm.h      |  3 +++
 mm/huge_memory.c        |  5 ++++-
 mm/mprotect.c           | 47 +++++++++++++++++++++++++++++------------
 4 files changed, 44 insertions(+), 15 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 20284387b841..b88316adaaad 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -7,6 +7,8 @@
 
 #include <linux/fs.h> /* only for vma_is_dax() */
 
+struct mmu_notifier_range;
+
 vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
@@ -38,7 +40,7 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 		   unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd);
 int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		    pmd_t *pmd, unsigned long addr, pgprot_t newprot,
-		    unsigned long cp_flags);
+		    unsigned long cp_flags, struct mmu_notifier_range *range);
 
 vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
 vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2dd73e4f3d8e..284f61cf9c37 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2478,6 +2478,9 @@ static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma
 	return !!(vma->vm_flags & VM_WRITE);
 
 }
+
+void change_pmd_range_notify_secondary_mmus(unsigned long addr,
+					    struct mmu_notifier_range *range);
 bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
 			     pte_t pte);
 extern long change_protection(struct mmu_gather *tlb,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a71cf686e3b2..47c7712b163e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1808,7 +1808,7 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
  */
 int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		    pmd_t *pmd, unsigned long addr, pgprot_t newprot,
-		    unsigned long cp_flags)
+		    unsigned long cp_flags, struct mmu_notifier_range *range)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	spinlock_t *ptl;
@@ -1893,6 +1893,9 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		    !toptier)
 			xchg_page_access_time(page, jiffies_to_msecs(jiffies));
 	}
+
+	change_pmd_range_notify_secondary_mmus(addr, range);
+
 	/*
 	 * In case prot_numa, we are under mmap_read_lock(mm). It's critical
 	 * to not clear pmd intermittently to avoid race with MADV_DONTNEED
diff --git a/mm/mprotect.c b/mm/mprotect.c
index d1a809167f49..f5844adbe0cb 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -82,7 +82,8 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
 
 static long change_pte_range(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
-		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
+		unsigned long end, pgprot_t newprot, unsigned long cp_flags,
+		struct mmu_notifier_range *range)
 {
 	pte_t *pte, oldpte;
 	spinlock_t *ptl;
@@ -164,8 +165,12 @@ static long change_pte_range(struct mmu_gather *tlb,
 				    !toptier)
 					xchg_page_access_time(page,
 						jiffies_to_msecs(jiffies));
+
+
 			}
 
+			change_pmd_range_notify_secondary_mmus(addr, range);
+
 			oldpte = ptep_modify_prot_start(vma, addr, pte);
 			ptent = pte_modify(oldpte, newprot);
 
@@ -355,6 +360,17 @@ pgtable_populate_needed(struct vm_area_struct *vma, unsigned long cp_flags)
 		err;							\
 	})
 
+void change_pmd_range_notify_secondary_mmus(unsigned long addr,
+					    struct mmu_notifier_range *range)
+{
+	if (range->start)
+		return;
+
+	VM_WARN_ON(addr >= range->end);
+	range->start = addr;
+	mmu_notifier_invalidate_range_start_nonblock(range);
+}
+
 static inline long change_pmd_range(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, pud_t *pud, unsigned long addr,
 		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
@@ -365,7 +381,14 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 	unsigned long nr_huge_updates = 0;
 	struct mmu_notifier_range range;
 
-	range.start = 0;
+	/*
+	 * Defer invalidation of secondary MMUs until a PMD/PTE change is
+	 * imminent, e.g. NUMA balancing in particular can "fail" for certain
+	 * types of mappings.  Initialize range.start to '0' and use it to
+	 * track whether or not the invalidation notification has been set.
+	 */
+	mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_VMA, 0,
+				vma->vm_mm, 0, end);
 
 	pmd = pmd_offset(pud, addr);
 	do {
@@ -383,18 +406,16 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 		if (pmd_none(*pmd))
 			goto next;
 
-		/* invoke the mmu notifier if the pmd is populated */
-		if (!range.start) {
-			mmu_notifier_range_init(&range,
-				MMU_NOTIFY_PROTECTION_VMA, 0,
-				vma->vm_mm, addr, end);
-			mmu_notifier_invalidate_range_start(&range);
-		}
-
 		_pmd = pmdp_get_lockless(pmd);
 		if (is_swap_pmd(_pmd) || pmd_trans_huge(_pmd) || pmd_devmap(_pmd)) {
 			if ((next - addr != HPAGE_PMD_SIZE) ||
 			    pgtable_split_needed(vma, cp_flags)) {
+				/*
+				 * FIXME: __split_huge_pmd() performs its own
+				 * mmu_notifier invalidation prior to clearing
+				 * the PMD, ideally all invalidations for the
+				 * range would be batched.
+				 */
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
 				/*
 				 * For file-backed, the pmd could have been
@@ -407,8 +428,8 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 					break;
 				}
 			} else {
-				ret = change_huge_pmd(tlb, vma, pmd,
-						addr, newprot, cp_flags);
+				ret = change_huge_pmd(tlb, vma, pmd, addr,
+						      newprot, cp_flags, &range);
 				if (ret) {
 					if (ret == HPAGE_PMD_NR) {
 						pages += HPAGE_PMD_NR;
@@ -423,7 +444,7 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 		}
 
 		ret = change_pte_range(tlb, vma, pmd, addr, next, newprot,
-				       cp_flags);
+				       cp_flags, &range);
 		if (ret < 0)
 			goto again;
 		pages += ret;

base-commit: 1f40f634009556c119974cafce4c7b2f9b8c58ad
--
Jason Gunthorpe Aug. 11, 2023, 5:18 p.m. UTC | #6
> However, *if* it's ok to invoke MMU notifiers while holding PMD/PTE locks, then
> I think we can achieve what you want without losing batching, and without changing
> secondary MMUs.

No, this is not legal.

> +void change_pmd_range_notify_secondary_mmus(unsigned long addr,
> +					    struct mmu_notifier_range *range)
> +{
> +	if (range->start)
> +		return;
> +
> +	VM_WARN_ON(addr >= range->end);
> +	range->start = addr;
> +	mmu_notifier_invalidate_range_start_nonblock(range);
> +}

The 'nonblock' entry point is advisory, if it fails the caller must
not change the mm.

Jason
Yan Zhao Aug. 14, 2023, 6:52 a.m. UTC | #7
On Fri, Aug 11, 2023 at 10:14:45AM -0700, Sean Christopherson wrote:
> On Fri, Aug 11, 2023, Yan Zhao wrote:
> > On Fri, Aug 11, 2023 at 03:40:44PM +0800, bibo mao wrote:
> > > 
> > > 在 2023/8/11 11:45, Yan Zhao 写道:
> > > >>> +static void kvm_mmu_notifier_numa_protect(struct mmu_notifier *mn,
> > > >>> +					  struct mm_struct *mm,
> > > >>> +					  unsigned long start,
> > > >>> +					  unsigned long end)
> > > >>> +{
> > > >>> +	struct kvm *kvm = mmu_notifier_to_kvm(mn);
> > > >>> +
> > > >>> +	WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
> > > >>> +	if (!READ_ONCE(kvm->mmu_invalidate_in_progress))
> > > >>> +		return;
> > > >>> +
> > > >>> +	kvm_handle_hva_range(mn, start, end, __pte(0), kvm_unmap_gfn_range);
> > > >>> +}
> > > >> numa balance will scan wide memory range, and there will be one time
> > > > Though scanning memory range is wide, .invalidate_range_start() is sent
> > > > for each 2M range.
> > > yes, range is huge page size when changing numa protection during numa scanning.
> > > 
> > > > 
> > > >> ipi notification with kvm_flush_remote_tlbs. With page level notification,
> > > >> it may bring out lots of flush remote tlb ipi notification.
> > > > 
> > > > Hmm, for VMs with assigned devices, apparently, the flush remote tlb IPIs
> > > > will be reduced to 0 with this series.
> > > > 
> > > > For VMs without assigned devices or mdev devices, I was previously also
> > > > worried about that there might be more IPIs.
> > > > But with current test data, there's no more remote tlb IPIs on average.
> > > > 
> > > > The reason is below:
> > > > 
> > > > Before this series, kvm_unmap_gfn_range() is called for once for a 2M
> > > > range.
> 
> No, it's potentially called once per 1GiB range.  change_pmd_range() invokes the
> mmu_notifier with addr+end, where "end" is the end of the range covered by the
> PUD, not the the end of the current PMD.  So the worst case scenario would be a
> 256k increase.  Of course, if you have to migrate an entire 1GiB chunk of memory
> then you likely have bigger problems, but still.
Yes, thanks for pointing it out.
I realized it after re-reading the code and re-checking the log message.
This wider range also explained the collected worst data:
44 kvm_unmap_gfn_range() caused 43920 kvm_flush_remote_tlbs()requests. i.e.
998 remote tlb flush requests per kvm_unmap_gfn_range().

> 
> > > > After this series, kvm_unmap_gfn_range() is called for once if the 2M is
> > > > mapped to a huge page in primary MMU, and called for at most 512 times
> > > > if mapped to 4K pages in primary MMU.
> > > > 
> > > > 
> > > > Though kvm_unmap_gfn_range() is only called once before this series,
> > > > as the range is blockable, when there're contentions, remote tlb IPIs
> > > > can be sent page by page in 4K granularity (in tdp_mmu_iter_cond_resched())
> > > I do not know much about x86, does this happen always or only need reschedule
> > Ah, sorry, I missed platforms other than x86.
> > Maybe there will be a big difference in other platforms.
> > 
> > > from code?  so that there will be many times of tlb IPIs in only once function
> > Only when MMU lock is contended. But it's not seldom because of the contention in
> > TDP page fault.
> 
> No?  I don't see how mmu_lock contention would affect the number of calls to 
> kvm_flush_remote_tlbs().  If vCPUs are running and not generating faults, i.e.
> not trying to access the range in question, then ever zap will generate a remote
> TLB flush and thus send IPIs to all running vCPUs.
In tdp_mmu_zap_leafs() for kvm_unmap_gfn_range(), the flow is like this:

1. Check necessity of mmu_lock reschedule
1.a -- if yes,
       1.a.1 do kvm_flush_remote_tlbs() if flush is true.
       1.a.2 reschedule of mmu_lock
       1.a.3 goto 2.
1.b -- if no, goto 2
2. Zap present leaf entry, and set flush to be true
3. Get next gfn and go to to 1

With a wide range to .invalidate_range_start()/end(), it's easy to find
rwlock_needbreak(&kvm->mmu_lock) to be true (goes to 1.a and 1.a.1)
And in tdp_mmu_zap_leafs(), before rescheduling of mmu_lock, tlb flush
request is performed. (1.a.1)


Take a real case for example,
NUMA balancing requests KVM to zap GFN range from 0xb4000 to 0xb9800.
Then when KVM zaps GFN 0xb807b, it will finds mmu_lock needs break
because vCPU is faulting GFN 0xb804c.
And vCPUs fill constantly retry faulting 0xb804c for 3298 times until
.invalidate_range_end().
Then for this zap of GFN range from 0xb4000 - 0xb9800,
vCPUs retry fault of
GFN 0xb804c for 3298 times,
GFN 0xb8074 for 3161 times,
GFN 0xb81ce for 15190 times,
and the accesses of the 3 GFNs cause 3209 times of kvm_flush_remote_tlbs()
for one kvm_unmap_gfn_range() because flush requests are not batched.
(this range is mapped in both 2M and 4K in secondary MMU).

Without the contending of mmu_lock, tdp_mmu_zap_leafs() just combines
all flush requests and leaves 1 kvm_flush_remote_tlbs() to be called
after it returns.


In my test scenario, which is VM boot-up, this kind of contention is
frequent.
Here's the 10 times data for a VM boot-up collected previously.
       |      count of       |       count of          |
       | kvm_unmap_gfn_range | kvm_flush_remote_tlbs() |
-------|---------------------|-------------------------|
   1   |         38          |           14            |
   2   |         28          |         5191            |
   3   |         36          |        13398            |
   4   |         44          |        43920            |
   5   |         28          |           14            |
   6   |         36          |        10803            |
   7   |         38          |          892            |
   8   |         32          |         5905            |
   9   |         32          |           13            |
  10   |         38          |         6096            |
-------|---------------------|-------------------------|
average|         35          |         8625            |


I wonder if we could loose the frequency to check for rescheduling in
tdp_mmu_iter_cond_resched() if the zap range is wide, e.g.

if (iter->next_last_level_gfn ==
    iter->yielded_gfn + KVM_PAGES_PER_HPAGE(PG_LEVEL_2M))
	return false; 

> 
> > > call about kvm_unmap_gfn_range.
> > > 
> > > > if the pages are mapped in 4K in secondary MMU.
> > > > 
> > > > With this series, on the other hand, .numa_protect() sets range to be
> > > > unblockable. So there could be less remote tlb IPIs when a 2M range is
> > > > mapped into small PTEs in secondary MMU.
> > > > Besides, .numa_protect() is not sent for all pages in a given 2M range.
> > > No, .numa_protect() is not sent for all pages. It depends on the workload,
> > > whether the page is accessed for different cpu threads cross-nodes.
> > The .numa_protect() is called in patch 4 only when PROT_NONE is set to
> > the page.
> 
> I'm strongly opposed to adding MMU_NOTIFIER_RANGE_NUMA.  It's too much of a one-off,
> and losing the batching of invalidations makes me nervous.  As Bibo points out,
> the behavior will vary based on the workload, VM configuration, etc.
> 
> There's also a *very* subtle change, in that the notification will be sent while
> holding the PMD/PTE lock.  Taking KVM's mmu_lock under that is *probably* ok, but
> I'm not exactly 100% confident on that.  And the only reason there isn't a more
MMU lock is a rwlock, which is a variant of spinlock.
So, I think taking it within PMD/PTE lock is ok.
Actually we have already done that during the .change_pte() notification, where

kvm_mmu_notifier_change_pte() takes KVM mmu_lock for write,
while PTE lock is held while sending set_pte_at_notify() --> .change_pte() handlers


> obvious bug is because kvm_handle_hva_range() sets may_block to false, e.g. KVM
> won't yield if there's mmu_lock contention.
Yes, KVM won't yield and reschedule of KVM mmu_lock, so it's fine.

> However, *if* it's ok to invoke MMU notifiers while holding PMD/PTE locks, then
> I think we can achieve what you want without losing batching, and without changing
> secondary MMUs.
> 
> Rather than muck with the notification types and add a one-off for NUMA, just
> defer the notification until a present PMD/PTE is actually going to be modified.
> It's not the prettiest, but other than the locking, I don't think has any chance
> of regressing other workloads/configurations.
> 
> Note, I'm assuming secondary MMUs aren't allowed to map swap entries...
> 
> Compile tested only.

I don't find a matching end to each
mmu_notifier_invalidate_range_start_nonblock().

> 
> From: Sean Christopherson <seanjc@google.com>
> Date: Fri, 11 Aug 2023 10:03:36 -0700
> Subject: [PATCH] tmp
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  include/linux/huge_mm.h |  4 +++-
>  include/linux/mm.h      |  3 +++
>  mm/huge_memory.c        |  5 ++++-
>  mm/mprotect.c           | 47 +++++++++++++++++++++++++++++------------
>  4 files changed, 44 insertions(+), 15 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 20284387b841..b88316adaaad 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -7,6 +7,8 @@
>  
>  #include <linux/fs.h> /* only for vma_is_dax() */
>  
> +struct mmu_notifier_range;
> +
>  vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
>  int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
> @@ -38,7 +40,7 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
>  		   unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd);
>  int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		    pmd_t *pmd, unsigned long addr, pgprot_t newprot,
> -		    unsigned long cp_flags);
> +		    unsigned long cp_flags, struct mmu_notifier_range *range);
>  
>  vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
>  vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 2dd73e4f3d8e..284f61cf9c37 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2478,6 +2478,9 @@ static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma
>  	return !!(vma->vm_flags & VM_WRITE);
>  
>  }
> +
> +void change_pmd_range_notify_secondary_mmus(unsigned long addr,
> +					    struct mmu_notifier_range *range);
>  bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>  			     pte_t pte);
>  extern long change_protection(struct mmu_gather *tlb,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a71cf686e3b2..47c7712b163e 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1808,7 +1808,7 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
>   */
>  int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		    pmd_t *pmd, unsigned long addr, pgprot_t newprot,
> -		    unsigned long cp_flags)
> +		    unsigned long cp_flags, struct mmu_notifier_range *range)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	spinlock_t *ptl;
> @@ -1893,6 +1893,9 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		    !toptier)
>  			xchg_page_access_time(page, jiffies_to_msecs(jiffies));
>  	}
> +
> +	change_pmd_range_notify_secondary_mmus(addr, range);
> +
>  	/*
>  	 * In case prot_numa, we are under mmap_read_lock(mm). It's critical
>  	 * to not clear pmd intermittently to avoid race with MADV_DONTNEED
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index d1a809167f49..f5844adbe0cb 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -82,7 +82,8 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
>  
>  static long change_pte_range(struct mmu_gather *tlb,
>  		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
> -		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
> +		unsigned long end, pgprot_t newprot, unsigned long cp_flags,
> +		struct mmu_notifier_range *range)
>  {
>  	pte_t *pte, oldpte;
>  	spinlock_t *ptl;
> @@ -164,8 +165,12 @@ static long change_pte_range(struct mmu_gather *tlb,
>  				    !toptier)
>  					xchg_page_access_time(page,
>  						jiffies_to_msecs(jiffies));
> +
> +
>  			}
>  
> +			change_pmd_range_notify_secondary_mmus(addr, range);
> +
>  			oldpte = ptep_modify_prot_start(vma, addr, pte);
>  			ptent = pte_modify(oldpte, newprot);
>  
> @@ -355,6 +360,17 @@ pgtable_populate_needed(struct vm_area_struct *vma, unsigned long cp_flags)
>  		err;							\
>  	})
>  
> +void change_pmd_range_notify_secondary_mmus(unsigned long addr,
> +					    struct mmu_notifier_range *range)
> +{
> +	if (range->start)
> +		return;
> +
> +	VM_WARN_ON(addr >= range->end);
> +	range->start = addr;
> +	mmu_notifier_invalidate_range_start_nonblock(range);
This will cause range from addr to end to be invalidated, which may
include pinned ranges.

> +}
> +
>  static inline long change_pmd_range(struct mmu_gather *tlb,
>  		struct vm_area_struct *vma, pud_t *pud, unsigned long addr,
>  		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
> @@ -365,7 +381,14 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
>  	unsigned long nr_huge_updates = 0;
>  	struct mmu_notifier_range range;
>  
> -	range.start = 0;
> +	/*
> +	 * Defer invalidation of secondary MMUs until a PMD/PTE change is
> +	 * imminent, e.g. NUMA balancing in particular can "fail" for certain
> +	 * types of mappings.  Initialize range.start to '0' and use it to
> +	 * track whether or not the invalidation notification has been set.
> +	 */
> +	mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_VMA, 0,
> +				vma->vm_mm, 0, end);
>  
>  	pmd = pmd_offset(pud, addr);
>  	do {
> @@ -383,18 +406,16 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
>  		if (pmd_none(*pmd))
>  			goto next;
>  
> -		/* invoke the mmu notifier if the pmd is populated */
> -		if (!range.start) {
> -			mmu_notifier_range_init(&range,
> -				MMU_NOTIFY_PROTECTION_VMA, 0,
> -				vma->vm_mm, addr, end);
> -			mmu_notifier_invalidate_range_start(&range);
> -		}
> -
>  		_pmd = pmdp_get_lockless(pmd);
>  		if (is_swap_pmd(_pmd) || pmd_trans_huge(_pmd) || pmd_devmap(_pmd)) {
>  			if ((next - addr != HPAGE_PMD_SIZE) ||
>  			    pgtable_split_needed(vma, cp_flags)) {
> +				/*
> +				 * FIXME: __split_huge_pmd() performs its own
> +				 * mmu_notifier invalidation prior to clearing
> +				 * the PMD, ideally all invalidations for the
> +				 * range would be batched.
> +				 */
>  				__split_huge_pmd(vma, pmd, addr, false, NULL);
>  				/*
>  				 * For file-backed, the pmd could have been
> @@ -407,8 +428,8 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
>  					break;
>  				}
>  			} else {
> -				ret = change_huge_pmd(tlb, vma, pmd,
> -						addr, newprot, cp_flags);
> +				ret = change_huge_pmd(tlb, vma, pmd, addr,
> +						      newprot, cp_flags, &range);
>  				if (ret) {
>  					if (ret == HPAGE_PMD_NR) {
>  						pages += HPAGE_PMD_NR;
> @@ -423,7 +444,7 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
>  		}
>  
>  		ret = change_pte_range(tlb, vma, pmd, addr, next, newprot,
> -				       cp_flags);
> +				       cp_flags, &range);
>  		if (ret < 0)
>  			goto again;
>  		pages += ret;
> 
> base-commit: 1f40f634009556c119974cafce4c7b2f9b8c58ad
> -- 
> 
> 
>
Yan Zhao Aug. 14, 2023, 7:44 a.m. UTC | #8
On Mon, Aug 14, 2023 at 02:52:07PM +0800, Yan Zhao wrote:
> I wonder if we could loose the frequency to check for rescheduling in
> tdp_mmu_iter_cond_resched() if the zap range is wide, e.g.
> 
> if (iter->next_last_level_gfn ==
>     iter->yielded_gfn + KVM_PAGES_PER_HPAGE(PG_LEVEL_2M))
> 	return false;
Correct:

@@ -712,7 +713,8 @@ static inline bool __must_check tdp_mmu_iter_cond_resched(struct kvm *kvm,
        WARN_ON(iter->yielded);

        /* Ensure forward progress has been made before yielding. */
-       if (iter->next_last_level_gfn == iter->yielded_gfn)
+       if (iter->next_last_level_gfn >= iter->yielded_gfn &&
+          iter->next_last_level_gfn < iter->yielded_gfn + KVM_PAGES_PER_HPAGE(PG_LEVEL_2M))
                return false;

        if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {


This can reduce kvm_flush_remote_tlbs() a lot in one kvm_unmap_gfn_range() in KVM x86 TDP MMU.
Sean Christopherson Aug. 14, 2023, 4:40 p.m. UTC | #9
On Mon, Aug 14, 2023, Yan Zhao wrote:
> On Fri, Aug 11, 2023 at 10:14:45AM -0700, Sean Christopherson wrote:
> > > > > After this series, kvm_unmap_gfn_range() is called for once if the 2M is
> > > > > mapped to a huge page in primary MMU, and called for at most 512 times
> > > > > if mapped to 4K pages in primary MMU.
> > > > > 
> > > > > 
> > > > > Though kvm_unmap_gfn_range() is only called once before this series,
> > > > > as the range is blockable, when there're contentions, remote tlb IPIs
> > > > > can be sent page by page in 4K granularity (in tdp_mmu_iter_cond_resched())
> > > > I do not know much about x86, does this happen always or only need reschedule
> > > Ah, sorry, I missed platforms other than x86.
> > > Maybe there will be a big difference in other platforms.
> > > 
> > > > from code?  so that there will be many times of tlb IPIs in only once function
> > > Only when MMU lock is contended. But it's not seldom because of the contention in
> > > TDP page fault.
> > 
> > No?  I don't see how mmu_lock contention would affect the number of calls to 
> > kvm_flush_remote_tlbs().  If vCPUs are running and not generating faults, i.e.
> > not trying to access the range in question, then ever zap will generate a remote
> > TLB flush and thus send IPIs to all running vCPUs.
> In tdp_mmu_zap_leafs() for kvm_unmap_gfn_range(), the flow is like this:
> 
> 1. Check necessity of mmu_lock reschedule

Ah, you're running a preemptible kernel.

> 1.a -- if yes,
>        1.a.1 do kvm_flush_remote_tlbs() if flush is true.
>        1.a.2 reschedule of mmu_lock
>        1.a.3 goto 2.
> 1.b -- if no, goto 2
> 2. Zap present leaf entry, and set flush to be true
> 3. Get next gfn and go to to 1
> 
> With a wide range to .invalidate_range_start()/end(), it's easy to find
> rwlock_needbreak(&kvm->mmu_lock) to be true (goes to 1.a and 1.a.1)
> And in tdp_mmu_zap_leafs(), before rescheduling of mmu_lock, tlb flush
> request is performed. (1.a.1)
> 
> 
> Take a real case for example,
> NUMA balancing requests KVM to zap GFN range from 0xb4000 to 0xb9800.
> Then when KVM zaps GFN 0xb807b, it will finds mmu_lock needs break
> because vCPU is faulting GFN 0xb804c.
> And vCPUs fill constantly retry faulting 0xb804c for 3298 times until
> .invalidate_range_end().
> Then for this zap of GFN range from 0xb4000 - 0xb9800,
> vCPUs retry fault of
> GFN 0xb804c for 3298 times,
> GFN 0xb8074 for 3161 times,
> GFN 0xb81ce for 15190 times,
> and the accesses of the 3 GFNs cause 3209 times of kvm_flush_remote_tlbs()
> for one kvm_unmap_gfn_range() because flush requests are not batched.
> (this range is mapped in both 2M and 4K in secondary MMU).
> 
> Without the contending of mmu_lock, tdp_mmu_zap_leafs() just combines
> all flush requests and leaves 1 kvm_flush_remote_tlbs() to be called
> after it returns.
> 
> 
> In my test scenario, which is VM boot-up, this kind of contention is
> frequent.

Hmm, I think your test scenario is a bit of an outlier.  Running a preemptible
kernel on a multi-socket system (presumably your system is multi-socket if you
have NUMA) is a bit odd.

Not that that invalidates the test result, but I would be quite surprised if there
are production use cases for NUMA+KVM+preemptible.  Though maybe I'm underestimating
how much KVM is used on workstations?

> Here's the 10 times data for a VM boot-up collected previously.
>        |      count of       |       count of          |
>        | kvm_unmap_gfn_range | kvm_flush_remote_tlbs() |
> -------|---------------------|-------------------------|
>    1   |         38          |           14            |
>    2   |         28          |         5191            |
>    3   |         36          |        13398            |
>    4   |         44          |        43920            |
>    5   |         28          |           14            |
>    6   |         36          |        10803            |
>    7   |         38          |          892            |
>    8   |         32          |         5905            |
>    9   |         32          |           13            |
>   10   |         38          |         6096            |
> -------|---------------------|-------------------------|
> average|         35          |         8625            |
> 
> 
> I wonder if we could loose the frequency to check for rescheduling in
> tdp_mmu_iter_cond_resched() if the zap range is wide, e.g.
> 
> if (iter->next_last_level_gfn ==
>     iter->yielded_gfn + KVM_PAGES_PER_HPAGE(PG_LEVEL_2M))
> 	return false;

Hrm, no.  We'd want to repeat that logic for the shadow MMU, and it could regress
other scenarios, e.g. if a vCPU is trying to fault-in an address that isn't part
of the invalidation, then yielding asap is desirable.

Unless I'm missing something, the easiest solution is to check for an invalidation
*before* acquiring mmu_lock.  The check would be prone to false negatives and false
positives, so KVM would need to recheck after acquiring mmu_lock, but the check
itself is super cheap relative to the overall cost of the page fault.

Compile tested patch at the bottom.  I'll test and post formally (probably with
similar treatment for other architectures).

> > > > call about kvm_unmap_gfn_range.
> > > > 
> > > > > if the pages are mapped in 4K in secondary MMU.
> > > > > 
> > > > > With this series, on the other hand, .numa_protect() sets range to be
> > > > > unblockable. So there could be less remote tlb IPIs when a 2M range is
> > > > > mapped into small PTEs in secondary MMU.
> > > > > Besides, .numa_protect() is not sent for all pages in a given 2M range.
> > > > No, .numa_protect() is not sent for all pages. It depends on the workload,
> > > > whether the page is accessed for different cpu threads cross-nodes.
> > > The .numa_protect() is called in patch 4 only when PROT_NONE is set to
> > > the page.
> > 
> > I'm strongly opposed to adding MMU_NOTIFIER_RANGE_NUMA.  It's too much of a one-off,
> > and losing the batching of invalidations makes me nervous.  As Bibo points out,
> > the behavior will vary based on the workload, VM configuration, etc.
> > 
> > There's also a *very* subtle change, in that the notification will be sent while
> > holding the PMD/PTE lock.  Taking KVM's mmu_lock under that is *probably* ok, but
> > I'm not exactly 100% confident on that.  And the only reason there isn't a more
> MMU lock is a rwlock, which is a variant of spinlock.
> So, I think taking it within PMD/PTE lock is ok.
> Actually we have already done that during the .change_pte() notification, where
> 
> kvm_mmu_notifier_change_pte() takes KVM mmu_lock for write,
> while PTE lock is held while sending set_pte_at_notify() --> .change_pte() handlers

.change_pte() gets away with running under PMD/PTE lock because (a) it's not allowed
to fail and (b) KVM is the only secondary MMU that hooks .change_pte() and KVM
doesn't use a sleepable lock.

As Jason pointed out, mmu_notifier_invalidate_range_start_nonblock() can fail
because some secondary MMUs use mutexes or r/w semaphores to handle mmu_notifier
events.  For NUMA balancing, canceling the protection change might be ok, but for
everything else, failure is not an option.  So unfortunately, my idea won't work
as-is.

We might get away with deferring just the change_prot_numa() case, but I don't
think that's worth the mess/complexity.

I would much rather tell userspace to disable migrate-on-fault for KVM guest
mappings (mbind() should work?) for these types of setups, or to disable NUMA
balancing entirely.  If the user really cares about performance of their VM(s),
vCPUs should be affined to a single NUMA node (or core, or pinned 1:1), and if
the VM spans multiple nodes, a virtual NUMA topology should be given to the guest.
At that point, NUMA balancing is likely to do more harm than good.

> > obvious bug is because kvm_handle_hva_range() sets may_block to false, e.g. KVM
> > won't yield if there's mmu_lock contention.
> Yes, KVM won't yield and reschedule of KVM mmu_lock, so it's fine.
> 
> > However, *if* it's ok to invoke MMU notifiers while holding PMD/PTE locks, then
> > I think we can achieve what you want without losing batching, and without changing
> > secondary MMUs.
> > 
> > Rather than muck with the notification types and add a one-off for NUMA, just
> > defer the notification until a present PMD/PTE is actually going to be modified.
> > It's not the prettiest, but other than the locking, I don't think has any chance
> > of regressing other workloads/configurations.
> > 
> > Note, I'm assuming secondary MMUs aren't allowed to map swap entries...
> > 
> > Compile tested only.
> 
> I don't find a matching end to each
> mmu_notifier_invalidate_range_start_nonblock().

It pairs with existing call to mmu_notifier_invalidate_range_end() in change_pmd_range():

	if (range.start)
		mmu_notifier_invalidate_range_end(&range);

--
From: Sean Christopherson <seanjc@google.com>
Date: Mon, 14 Aug 2023 08:59:12 -0700
Subject: [PATCH] KVM: x86/mmu: Retry fault before acquiring mmu_lock if
 mapping is changing

Retry page faults without acquiring mmu_lock if the resolve hva is covered
by an active invalidation.  Contending for mmu_lock is especially
problematic on preemptible kernels as the invalidation task will yield the
lock (see rwlock_needbreak()), delay the in-progress invalidation, and
ultimately increase the latency of resolving the page fault.  And in the
worst case scenario, yielding will be accompanied by a remote TLB flush,
e.g. if the invalidation covers a large range of memory and vCPUs are
accessing addresses that were already zapped.

Alternatively, the yielding issue could be mitigated by teaching KVM's MMU
iterators to perform more work before yielding, but that wouldn't solve
the lock contention and would negatively affect scenarios where a vCPU is
trying to fault in an address that is NOT covered by the in-progress
invalidation.

Reported-by: Yan Zhao <yan.y.zhao@intel.com>
Closes: https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@yzhao56-desk.sh.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c   | 3 +++
 include/linux/kvm_host.h | 8 +++++++-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 9e4cd8b4a202..f29718a16211 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4345,6 +4345,9 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 	if (unlikely(!fault->slot))
 		return kvm_handle_noslot_fault(vcpu, fault, access);
 
+	if (mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva))
+		return RET_PF_RETRY;
+
 	return RET_PF_CONTINUE;
 }
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index cb86108c624d..f41d4fe61a06 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1960,7 +1960,6 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
 					   unsigned long mmu_seq,
 					   unsigned long hva)
 {
-	lockdep_assert_held(&kvm->mmu_lock);
 	/*
 	 * If mmu_invalidate_in_progress is non-zero, then the range maintained
 	 * by kvm_mmu_notifier_invalidate_range_start contains all addresses
@@ -1971,6 +1970,13 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
 	    hva >= kvm->mmu_invalidate_range_start &&
 	    hva < kvm->mmu_invalidate_range_end)
 		return 1;
+
+	/*
+	 * Note the lack of a memory barrier!  The caller *must* hold mmu_lock
+	 * to avoid false negatives and/or false positives (less concerning).
+	 * Holding mmu_lock is not mandatory though, e.g. to allow pre-checking
+	 * for an in-progress invalidation to avoiding contending mmu_lock.
+	 */
 	if (kvm->mmu_invalidate_seq != mmu_seq)
 		return 1;
 	return 0;

base-commit: 5bc7f472423f95a3f5c73b0b56c47e57d8833efc
--
Yan Zhao Aug. 15, 2023, 1:54 a.m. UTC | #10
On Mon, Aug 14, 2023 at 09:40:44AM -0700, Sean Christopherson wrote:
> > > I'm strongly opposed to adding MMU_NOTIFIER_RANGE_NUMA.  It's too much of a one-off,
> > > and losing the batching of invalidations makes me nervous.  As Bibo points out,
> > > the behavior will vary based on the workload, VM configuration, etc.
> > > 
> > > There's also a *very* subtle change, in that the notification will be sent while
> > > holding the PMD/PTE lock.  Taking KVM's mmu_lock under that is *probably* ok, but
> > > I'm not exactly 100% confident on that.  And the only reason there isn't a more
> > MMU lock is a rwlock, which is a variant of spinlock.
> > So, I think taking it within PMD/PTE lock is ok.
> > Actually we have already done that during the .change_pte() notification, where
> > 
> > kvm_mmu_notifier_change_pte() takes KVM mmu_lock for write,
> > while PTE lock is held while sending set_pte_at_notify() --> .change_pte() handlers
> 
> .change_pte() gets away with running under PMD/PTE lock because (a) it's not allowed
> to fail and (b) KVM is the only secondary MMU that hooks .change_pte() and KVM
> doesn't use a sleepable lock.
.numa_protect() in patch 4 is also sent when it's not allowed to
fail and there's no sleepable lock in KVM's handler :)


> As Jason pointed out, mmu_notifier_invalidate_range_start_nonblock() can fail
> because some secondary MMUs use mutexes or r/w semaphores to handle mmu_notifier
> events.  For NUMA balancing, canceling the protection change might be ok, but for
> everything else, failure is not an option.  So unfortunately, my idea won't work
> as-is.
> 
> We might get away with deferring just the change_prot_numa() case, but I don't
> think that's worth the mess/complexity.
Yes, I also think deferring mmu_notifier_invalidate_range_start() is not working.
One possible approach is to send successful .numa_protect() in batch.
But I admit it will introduce complexity.

> 
> I would much rather tell userspace to disable migrate-on-fault for KVM guest
> mappings (mbind() should work?) for these types of setups, or to disable NUMA
> balancing entirely.  If the user really cares about performance of their VM(s),
> vCPUs should be affined to a single NUMA node (or core, or pinned 1:1), and if
> the VM spans multiple nodes, a virtual NUMA topology should be given to the guest.
> At that point, NUMA balancing is likely to do more harm than good.
> 
> > > obvious bug is because kvm_handle_hva_range() sets may_block to false, e.g. KVM
> > > won't yield if there's mmu_lock contention.
> > Yes, KVM won't yield and reschedule of KVM mmu_lock, so it's fine.
> > 
> > > However, *if* it's ok to invoke MMU notifiers while holding PMD/PTE locks, then
> > > I think we can achieve what you want without losing batching, and without changing
> > > secondary MMUs.
> > > 
> > > Rather than muck with the notification types and add a one-off for NUMA, just
> > > defer the notification until a present PMD/PTE is actually going to be modified.
> > > It's not the prettiest, but other than the locking, I don't think has any chance
> > > of regressing other workloads/configurations.
> > > 
> > > Note, I'm assuming secondary MMUs aren't allowed to map swap entries...
> > > 
> > > Compile tested only.
> > 
> > I don't find a matching end to each
> > mmu_notifier_invalidate_range_start_nonblock().
> 
> It pairs with existing call to mmu_notifier_invalidate_range_end() in change_pmd_range():
> 
> 	if (range.start)
> 		mmu_notifier_invalidate_range_end(&range);
No, It doesn't work for mmu_notifier_invalidate_range_start() sent in change_pte_range(),
if we only want the range to include pages successfully set to PROT_NONE.

> --
> From: Sean Christopherson <seanjc@google.com>
> Date: Mon, 14 Aug 2023 08:59:12 -0700
> Subject: [PATCH] KVM: x86/mmu: Retry fault before acquiring mmu_lock if
>  mapping is changing
> 
> Retry page faults without acquiring mmu_lock if the resolve hva is covered
> by an active invalidation.  Contending for mmu_lock is especially
> problematic on preemptible kernels as the invalidation task will yield the
> lock (see rwlock_needbreak()), delay the in-progress invalidation, and
> ultimately increase the latency of resolving the page fault.  And in the
> worst case scenario, yielding will be accompanied by a remote TLB flush,
> e.g. if the invalidation covers a large range of memory and vCPUs are
> accessing addresses that were already zapped.
> 
> Alternatively, the yielding issue could be mitigated by teaching KVM's MMU
> iterators to perform more work before yielding, but that wouldn't solve
> the lock contention and would negatively affect scenarios where a vCPU is
> trying to fault in an address that is NOT covered by the in-progress
> invalidation.
> 
> Reported-by: Yan Zhao <yan.y.zhao@intel.com>
> Closes: https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@yzhao56-desk.sh.intel.com
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c   | 3 +++
>  include/linux/kvm_host.h | 8 +++++++-
>  2 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 9e4cd8b4a202..f29718a16211 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4345,6 +4345,9 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>  	if (unlikely(!fault->slot))
>  		return kvm_handle_noslot_fault(vcpu, fault, access);
>  
> +	if (mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva))
> +		return RET_PF_RETRY;
> +
This can effectively reduce the remote flush IPIs a lot!
One Nit is that, maybe rmb() or READ_ONCE() is required for kvm->mmu_invalidate_range_start
and kvm->mmu_invalidate_range_end.
Otherwise, I'm somewhat worried about constant false positive and retry.


>  	return RET_PF_CONTINUE;
>  }
>  
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index cb86108c624d..f41d4fe61a06 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1960,7 +1960,6 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
>  					   unsigned long mmu_seq,
>  					   unsigned long hva)
>  {
> -	lockdep_assert_held(&kvm->mmu_lock);
>  	/*
>  	 * If mmu_invalidate_in_progress is non-zero, then the range maintained
>  	 * by kvm_mmu_notifier_invalidate_range_start contains all addresses
> @@ -1971,6 +1970,13 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
>  	    hva >= kvm->mmu_invalidate_range_start &&
>  	    hva < kvm->mmu_invalidate_range_end)
>  		return 1;
> +
> +	/*
> +	 * Note the lack of a memory barrier!  The caller *must* hold mmu_lock
> +	 * to avoid false negatives and/or false positives (less concerning).
> +	 * Holding mmu_lock is not mandatory though, e.g. to allow pre-checking
> +	 * for an in-progress invalidation to avoiding contending mmu_lock.
> +	 */
>  	if (kvm->mmu_invalidate_seq != mmu_seq)
>  		return 1;
>  	return 0;
> 
> base-commit: 5bc7f472423f95a3f5c73b0b56c47e57d8833efc
> -- 
>
Sean Christopherson Aug. 15, 2023, 2:50 p.m. UTC | #11
On Tue, Aug 15, 2023, Yan Zhao wrote:
> On Mon, Aug 14, 2023 at 09:40:44AM -0700, Sean Christopherson wrote:
> > > > Note, I'm assuming secondary MMUs aren't allowed to map swap entries...
> > > > 
> > > > Compile tested only.
> > > 
> > > I don't find a matching end to each
> > > mmu_notifier_invalidate_range_start_nonblock().
> > 
> > It pairs with existing call to mmu_notifier_invalidate_range_end() in change_pmd_range():
> > 
> > 	if (range.start)
> > 		mmu_notifier_invalidate_range_end(&range);
> No, It doesn't work for mmu_notifier_invalidate_range_start() sent in change_pte_range(),
> if we only want the range to include pages successfully set to PROT_NONE.

Precise invalidation was a non-goal for my hack-a-patch.  The intent was purely
to defer invalidation until it was actually needed, but still perform only a
single notification so as to batch the TLB flushes, e.g. the start() call still
used the original @end.

The idea was to play nice with the scenario where nothing in a VMA could be migrated.
It was complete untested though, so it may not have actually done anything to reduce
the number of pointless invalidations.

> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 9e4cd8b4a202..f29718a16211 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4345,6 +4345,9 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> >  	if (unlikely(!fault->slot))
> >  		return kvm_handle_noslot_fault(vcpu, fault, access);
> >  
> > +	if (mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva))
> > +		return RET_PF_RETRY;
> > +
> This can effectively reduce the remote flush IPIs a lot!
> One Nit is that, maybe rmb() or READ_ONCE() is required for kvm->mmu_invalidate_range_start
> and kvm->mmu_invalidate_range_end.
> Otherwise, I'm somewhat worried about constant false positive and retry.

If anything, this needs a READ_ONCE() on mmu_invalidate_in_progress.  The ranges
aren't touched when when mmu_invalidate_in_progress goes to zero, so ensuring they
are reloaded wouldn't do anything.  The key to making forward progress is seeing
that there is no in-progress invalidation.

I did consider adding said READ_ONCE(), but practically speaking, constant false
positives are impossible.  KVM will re-enter the guest when retrying, and there
is zero chance of the compiler avoiding reloads across VM-Enter+VM-Exit.

I suppose in theory we might someday differentiate between "retry because a different
vCPU may have fixed the fault" and "retry because there's an in-progress invalidation",
and not bother re-entering the guest for the latter, e.g. have it try to yield
instead.  

All that said, READ_ONCE() on mmu_invalidate_in_progress should effectively be a
nop, so it wouldn't hurt to be paranoid in this case.

Hmm, at that point, it probably makes sense to add a READ_ONCE() for mmu_invalidate_seq
too, e.g. so that a sufficiently clever compiler doesn't completely optimize away
the check.  Losing the check wouldn't be problematic (false negatives are fine,
especially on that particular check), but the generated code would *look* buggy.
bibo mao Aug. 16, 2023, 2:43 a.m. UTC | #12
在 2023/8/15 22:50, Sean Christopherson 写道:
> On Tue, Aug 15, 2023, Yan Zhao wrote:
>> On Mon, Aug 14, 2023 at 09:40:44AM -0700, Sean Christopherson wrote:
>>>>> Note, I'm assuming secondary MMUs aren't allowed to map swap entries...
>>>>>
>>>>> Compile tested only.
>>>>
>>>> I don't find a matching end to each
>>>> mmu_notifier_invalidate_range_start_nonblock().
>>>
>>> It pairs with existing call to mmu_notifier_invalidate_range_end() in change_pmd_range():
>>>
>>> 	if (range.start)
>>> 		mmu_notifier_invalidate_range_end(&range);
>> No, It doesn't work for mmu_notifier_invalidate_range_start() sent in change_pte_range(),
>> if we only want the range to include pages successfully set to PROT_NONE.
> 
> Precise invalidation was a non-goal for my hack-a-patch.  The intent was purely
> to defer invalidation until it was actually needed, but still perform only a
> single notification so as to batch the TLB flushes, e.g. the start() call still
> used the original @end.
> 
> The idea was to play nice with the scenario where nothing in a VMA could be migrated.
> It was complete untested though, so it may not have actually done anything to reduce
> the number of pointless invalidations.
For numa-balance scenery, can original page still be used by application even if pte
is changed with PROT_NONE?  If it can be used, maybe we can zap shadow mmu and flush tlb
in notification mmu_notifier_invalidate_range_end with precised range, the range can
be cross-range between range mmu_gather and mmu_notifier_range.

Regards
Bibo Mao
> 
>>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>>> index 9e4cd8b4a202..f29718a16211 100644
>>> --- a/arch/x86/kvm/mmu/mmu.c
>>> +++ b/arch/x86/kvm/mmu/mmu.c
>>> @@ -4345,6 +4345,9 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>>>  	if (unlikely(!fault->slot))
>>>  		return kvm_handle_noslot_fault(vcpu, fault, access);
>>>  
>>> +	if (mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva))
>>> +		return RET_PF_RETRY;
>>> +
>> This can effectively reduce the remote flush IPIs a lot!
>> One Nit is that, maybe rmb() or READ_ONCE() is required for kvm->mmu_invalidate_range_start
>> and kvm->mmu_invalidate_range_end.
>> Otherwise, I'm somewhat worried about constant false positive and retry.
> 
> If anything, this needs a READ_ONCE() on mmu_invalidate_in_progress.  The ranges
> aren't touched when when mmu_invalidate_in_progress goes to zero, so ensuring they
> are reloaded wouldn't do anything.  The key to making forward progress is seeing
> that there is no in-progress invalidation.
> 
> I did consider adding said READ_ONCE(), but practically speaking, constant false
> positives are impossible.  KVM will re-enter the guest when retrying, and there
> is zero chance of the compiler avoiding reloads across VM-Enter+VM-Exit.
> 
> I suppose in theory we might someday differentiate between "retry because a different
> vCPU may have fixed the fault" and "retry because there's an in-progress invalidation",
> and not bother re-entering the guest for the latter, e.g. have it try to yield
> instead.  
> 
> All that said, READ_ONCE() on mmu_invalidate_in_progress should effectively be a
> nop, so it wouldn't hurt to be paranoid in this case.
> 
> Hmm, at that point, it probably makes sense to add a READ_ONCE() for mmu_invalidate_seq
> too, e.g. so that a sufficiently clever compiler doesn't completely optimize away
> the check.  Losing the check wouldn't be problematic (false negatives are fine,
> especially on that particular check), but the generated code would *look* buggy.
bibo mao Aug. 16, 2023, 3:44 a.m. UTC | #13
在 2023/8/16 10:43, bibo mao 写道:
> 
> 
> 在 2023/8/15 22:50, Sean Christopherson 写道:
>> On Tue, Aug 15, 2023, Yan Zhao wrote:
>>> On Mon, Aug 14, 2023 at 09:40:44AM -0700, Sean Christopherson wrote:
>>>>>> Note, I'm assuming secondary MMUs aren't allowed to map swap entries...
>>>>>>
>>>>>> Compile tested only.
>>>>>
>>>>> I don't find a matching end to each
>>>>> mmu_notifier_invalidate_range_start_nonblock().
>>>>
>>>> It pairs with existing call to mmu_notifier_invalidate_range_end() in change_pmd_range():
>>>>
>>>> 	if (range.start)
>>>> 		mmu_notifier_invalidate_range_end(&range);
>>> No, It doesn't work for mmu_notifier_invalidate_range_start() sent in change_pte_range(),
>>> if we only want the range to include pages successfully set to PROT_NONE.
>>
>> Precise invalidation was a non-goal for my hack-a-patch.  The intent was purely
>> to defer invalidation until it was actually needed, but still perform only a
>> single notification so as to batch the TLB flushes, e.g. the start() call still
>> used the original @end.
>>
>> The idea was to play nice with the scenario where nothing in a VMA could be migrated.
>> It was complete untested though, so it may not have actually done anything to reduce
>> the number of pointless invalidations.
> For numa-balance scenery, can original page still be used by application even if pte
> is changed with PROT_NONE?  If it can be used, maybe we can zap shadow mmu and flush tlb
Since there is kvm_mmu_notifier_change_pte notification when numa page is replaced with
new page, my meaning that can original page still be used by application even if pte
is changed with PROT_NONE and before replaced with new page?

And for primary mmu, tlb is flushed after pte is changed with PROT_NONE and 
after mmu_notifier_invalidate_range_end notification for secondary mmu.

Regards
Bibo Mao
> in notification mmu_notifier_invalidate_range_end with precised range, the range can
> be cross-range between range mmu_gather and mmu_notifier_range.
> 
> Regards
> Bibo Mao
>>
>>>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>>>> index 9e4cd8b4a202..f29718a16211 100644
>>>> --- a/arch/x86/kvm/mmu/mmu.c
>>>> +++ b/arch/x86/kvm/mmu/mmu.c
>>>> @@ -4345,6 +4345,9 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>>>>  	if (unlikely(!fault->slot))
>>>>  		return kvm_handle_noslot_fault(vcpu, fault, access);
>>>>  
>>>> +	if (mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva))
>>>> +		return RET_PF_RETRY;
>>>> +
>>> This can effectively reduce the remote flush IPIs a lot!
>>> One Nit is that, maybe rmb() or READ_ONCE() is required for kvm->mmu_invalidate_range_start
>>> and kvm->mmu_invalidate_range_end.
>>> Otherwise, I'm somewhat worried about constant false positive and retry.
>>
>> If anything, this needs a READ_ONCE() on mmu_invalidate_in_progress.  The ranges
>> aren't touched when when mmu_invalidate_in_progress goes to zero, so ensuring they
>> are reloaded wouldn't do anything.  The key to making forward progress is seeing
>> that there is no in-progress invalidation.
>>
>> I did consider adding said READ_ONCE(), but practically speaking, constant false
>> positives are impossible.  KVM will re-enter the guest when retrying, and there
>> is zero chance of the compiler avoiding reloads across VM-Enter+VM-Exit.
>>
>> I suppose in theory we might someday differentiate between "retry because a different
>> vCPU may have fixed the fault" and "retry because there's an in-progress invalidation",
>> and not bother re-entering the guest for the latter, e.g. have it try to yield
>> instead.  
>>
>> All that said, READ_ONCE() on mmu_invalidate_in_progress should effectively be a
>> nop, so it wouldn't hurt to be paranoid in this case.
>>
>> Hmm, at that point, it probably makes sense to add a READ_ONCE() for mmu_invalidate_seq
>> too, e.g. so that a sufficiently clever compiler doesn't completely optimize away
>> the check.  Losing the check wouldn't be problematic (false negatives are fine,
>> especially on that particular check), but the generated code would *look* buggy.
Yan Zhao Aug. 16, 2023, 5:14 a.m. UTC | #14
On Wed, Aug 16, 2023 at 11:44:29AM +0800, bibo mao wrote:
> 
> 
> 在 2023/8/16 10:43, bibo mao 写道:
> > 
> > 
> > 在 2023/8/15 22:50, Sean Christopherson 写道:
> >> On Tue, Aug 15, 2023, Yan Zhao wrote:
> >>> On Mon, Aug 14, 2023 at 09:40:44AM -0700, Sean Christopherson wrote:
> >>>>>> Note, I'm assuming secondary MMUs aren't allowed to map swap entries...
> >>>>>>
> >>>>>> Compile tested only.
> >>>>>
> >>>>> I don't find a matching end to each
> >>>>> mmu_notifier_invalidate_range_start_nonblock().
> >>>>
> >>>> It pairs with existing call to mmu_notifier_invalidate_range_end() in change_pmd_range():
> >>>>
> >>>> 	if (range.start)
> >>>> 		mmu_notifier_invalidate_range_end(&range);
> >>> No, It doesn't work for mmu_notifier_invalidate_range_start() sent in change_pte_range(),
> >>> if we only want the range to include pages successfully set to PROT_NONE.
> >>
> >> Precise invalidation was a non-goal for my hack-a-patch.  The intent was purely
> >> to defer invalidation until it was actually needed, but still perform only a
> >> single notification so as to batch the TLB flushes, e.g. the start() call still
> >> used the original @end.
> >>
> >> The idea was to play nice with the scenario where nothing in a VMA could be migrated.
> >> It was complete untested though, so it may not have actually done anything to reduce
> >> the number of pointless invalidations.
> > For numa-balance scenery, can original page still be used by application even if pte
> > is changed with PROT_NONE?  If it can be used, maybe we can zap shadow mmu and flush tlb
For GUPs that does not honor FOLL_HONOR_NUMA_FAULT, yes,

See https://lore.kernel.org/all/20230803143208.383663-1-david@redhat.com/

> Since there is kvm_mmu_notifier_change_pte notification when numa page is replaced with
> new page, my meaning that can original page still be used by application even if pte
> is changed with PROT_NONE and before replaced with new page?
It's not .change_pte() notification, which is sent when COW.
The do_numa_page()/do_huge_pmd_numa_page() will try to unmap old page
protected with PROT_NONE, and if every check passes, a separate
.invalidate_range_start()/end() with event type MMU_NOTIFY_CLEAR will be
sent.

So, I think KVM (though it honors FOLL_HONOR_NUMA_FAULT), can safely
keep mapping maybe-dma pages until MMU_NOTIFY_CLEAR is sent.
(this approach is implemented in RFC v1
https://lore.kernel.org/all/20230810085636.25914-1-yan.y.zhao@intel.com/)

> 
> And for primary mmu, tlb is flushed after pte is changed with PROT_NONE and 
> after mmu_notifier_invalidate_range_end notification for secondary mmu.
> Regards
> Bibo Mao

> >> in notification mmu_notifier_invalidate_range_end with precised range, the range can
But I don't think flush tlb only in the .invalidate_range_end() in
secondary MMU is a good idea.
Flush must be done before kvm->mmu_lock is unlocked, otherwise,
confusion will be caused when multiple threads trying to update the
secondary MMU.

> >> be cross-range between range mmu_gather and mmu_notifier_range.
Yan Zhao Aug. 16, 2023, 7:18 a.m. UTC | #15
On Wed, Aug 16, 2023 at 03:29:22PM +0800, bibo mao wrote:
> > Flush must be done before kvm->mmu_lock is unlocked, otherwise,
> > confusion will be caused when multiple threads trying to update the
> > secondary MMU.
> Since tlb flush is delayed after all pte entries are cleared, and currently
> there is no tlb flush range supported for secondary mmu. I do know why there
> is confusion before or after kvm->mmu_lock.

Oh, do you mean only do kvm_unmap_gfn_range() in .invalidate_range_end()?
Then check if PROT_NONE is set in primary MMU before unmap?
Looks like a good idea, I need to check if it's feasible.
Thanks!
bibo mao Aug. 16, 2023, 7:29 a.m. UTC | #16
在 2023/8/16 13:14, Yan Zhao 写道:
> On Wed, Aug 16, 2023 at 11:44:29AM +0800, bibo mao wrote:
>>
>>
>> 在 2023/8/16 10:43, bibo mao 写道:
>>>
>>>
>>> 在 2023/8/15 22:50, Sean Christopherson 写道:
>>>> On Tue, Aug 15, 2023, Yan Zhao wrote:
>>>>> On Mon, Aug 14, 2023 at 09:40:44AM -0700, Sean Christopherson wrote:
>>>>>>>> Note, I'm assuming secondary MMUs aren't allowed to map swap entries...
>>>>>>>>
>>>>>>>> Compile tested only.
>>>>>>>
>>>>>>> I don't find a matching end to each
>>>>>>> mmu_notifier_invalidate_range_start_nonblock().
>>>>>>
>>>>>> It pairs with existing call to mmu_notifier_invalidate_range_end() in change_pmd_range():
>>>>>>
>>>>>> 	if (range.start)
>>>>>> 		mmu_notifier_invalidate_range_end(&range);
>>>>> No, It doesn't work for mmu_notifier_invalidate_range_start() sent in change_pte_range(),
>>>>> if we only want the range to include pages successfully set to PROT_NONE.
>>>>
>>>> Precise invalidation was a non-goal for my hack-a-patch.  The intent was purely
>>>> to defer invalidation until it was actually needed, but still perform only a
>>>> single notification so as to batch the TLB flushes, e.g. the start() call still
>>>> used the original @end.
>>>>
>>>> The idea was to play nice with the scenario where nothing in a VMA could be migrated.
>>>> It was complete untested though, so it may not have actually done anything to reduce
>>>> the number of pointless invalidations.
>>> For numa-balance scenery, can original page still be used by application even if pte
>>> is changed with PROT_NONE?  If it can be used, maybe we can zap shadow mmu and flush tlb
> For GUPs that does not honor FOLL_HONOR_NUMA_FAULT, yes,
> 
> See https://lore.kernel.org/all/20230803143208.383663-1-david@redhat.com/
> 
>> Since there is kvm_mmu_notifier_change_pte notification when numa page is replaced with
>> new page, my meaning that can original page still be used by application even if pte
>> is changed with PROT_NONE and before replaced with new page?
> It's not .change_pte() notification, which is sent when COW.
> The do_numa_page()/do_huge_pmd_numa_page() will try to unmap old page
> protected with PROT_NONE, and if every check passes, a separate
> .invalidate_range_start()/end() with event type MMU_NOTIFY_CLEAR will be
> sent.
yes, you are right. change_pte() notification will be will called
when migrate_vma_pages, I messed it with numa page migration. However
invalidate_range_start()/end() with event type MMU_NOTIFY_CLEAR will be
sent also when new page is replaced.
> 
> So, I think KVM (though it honors FOLL_HONOR_NUMA_FAULT), can safely
> keep mapping maybe-dma pages until MMU_NOTIFY_CLEAR is sent.
> (this approach is implemented in RFC v1
> https://lore.kernel.org/all/20230810085636.25914-1-yan.y.zhao@intel.com/)
> 
>>
>> And for primary mmu, tlb is flushed after pte is changed with PROT_NONE and 
>> after mmu_notifier_invalidate_range_end notification for secondary mmu.
>> Regards
>> Bibo Mao
> 
>>>> in notification mmu_notifier_invalidate_range_end with precised range, the range can
> But I don't think flush tlb only in the .invalidate_range_end() in
> secondary MMU is a good idea.
I have no good idea, and it beyond my ability to modify kvm framework now :(

> Flush must be done before kvm->mmu_lock is unlocked, otherwise,
> confusion will be caused when multiple threads trying to update the
> secondary MMU.
Since tlb flush is delayed after all pte entries are cleared, and currently
there is no tlb flush range supported for secondary mmu. I do know why there
is confusion before or after kvm->mmu_lock.

Regards
Bibo Mao
> 
>>>> be cross-range between range mmu_gather and mmu_notifier_range.
> 
>
bibo mao Aug. 16, 2023, 7:53 a.m. UTC | #17
在 2023/8/16 15:18, Yan Zhao 写道:
> On Wed, Aug 16, 2023 at 03:29:22PM +0800, bibo mao wrote:
>>> Flush must be done before kvm->mmu_lock is unlocked, otherwise,
>>> confusion will be caused when multiple threads trying to update the
>>> secondary MMU.
>> Since tlb flush is delayed after all pte entries are cleared, and currently
>> there is no tlb flush range supported for secondary mmu. I do know why there
>> is confusion before or after kvm->mmu_lock.
> 
> Oh, do you mean only do kvm_unmap_gfn_range() in .invalidate_range_end()?
yes, it is just sketchy thought for numa balance scenery, 
do kvm_unmap_gfn_range() in invalidate_range_end rather than
invalidate_range_start.
 
> Then check if PROT_NONE is set in primary MMU before unmap?
> Looks like a good idea, I need to check if it's feasible.
> Thanks!
> 
>
Sean Christopherson Aug. 16, 2023, 1:39 p.m. UTC | #18
On Wed, Aug 16, 2023, bibo mao wrote:
> 
> 
> 在 2023/8/16 15:18, Yan Zhao 写道:
> > On Wed, Aug 16, 2023 at 03:29:22PM +0800, bibo mao wrote:
> >>> Flush must be done before kvm->mmu_lock is unlocked, otherwise,
> >>> confusion will be caused when multiple threads trying to update the
> >>> secondary MMU.
> >> Since tlb flush is delayed after all pte entries are cleared, and currently
> >> there is no tlb flush range supported for secondary mmu. I do know why there
> >> is confusion before or after kvm->mmu_lock.
> > 
> > Oh, do you mean only do kvm_unmap_gfn_range() in .invalidate_range_end()?
> yes, it is just sketchy thought for numa balance scenery, 
> do kvm_unmap_gfn_range() in invalidate_range_end rather than
> invalidate_range_start.

That is not an option, it's a direction violation of the mmu_notifier contract.
Secondary MMUs must drop all references before returning from invalidate_range_start().
diff mbox series

Patch

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index dfbaafbe3a00..907444a1761b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -711,6 +711,20 @@  static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_change_spte_gfn);
 }
 
+static void kvm_mmu_notifier_numa_protect(struct mmu_notifier *mn,
+					  struct mm_struct *mm,
+					  unsigned long start,
+					  unsigned long end)
+{
+	struct kvm *kvm = mmu_notifier_to_kvm(mn);
+
+	WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count));
+	if (!READ_ONCE(kvm->mmu_invalidate_in_progress))
+		return;
+
+	kvm_handle_hva_range(mn, start, end, __pte(0), kvm_unmap_gfn_range);
+}
+
 void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
 			      unsigned long end)
 {
@@ -744,14 +758,18 @@  static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 					const struct mmu_notifier_range *range)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
+	bool is_numa = (range->event == MMU_NOTIFY_PROTECTION_VMA) &&
+		       (range->flags & MMU_NOTIFIER_RANGE_NUMA);
 	const struct kvm_hva_range hva_range = {
 		.start		= range->start,
 		.end		= range->end,
 		.pte		= __pte(0),
-		.handler	= kvm_unmap_gfn_range,
+		.handler	= !is_numa ? kvm_unmap_gfn_range :
+				  (void *)kvm_null_fn,
 		.on_lock	= kvm_mmu_invalidate_begin,
-		.on_unlock	= kvm_arch_guest_memory_reclaimed,
-		.flush_on_ret	= true,
+		.on_unlock	= !is_numa ? kvm_arch_guest_memory_reclaimed :
+				  (void *)kvm_null_fn,
+		.flush_on_ret	= !is_numa ? true : false,
 		.may_block	= mmu_notifier_range_blockable(range),
 	};
 
@@ -899,6 +917,7 @@  static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
 	.clear_young		= kvm_mmu_notifier_clear_young,
 	.test_young		= kvm_mmu_notifier_test_young,
 	.change_pte		= kvm_mmu_notifier_change_pte,
+	.numa_protect		= kvm_mmu_notifier_numa_protect,
 	.release		= kvm_mmu_notifier_release,
 };