Message ID | 20230810090218.26244-1-yan.y.zhao@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Reduce NUMA balance caused TLB-shootdowns in a VM | expand |
在 2023/8/10 17:02, Yan Zhao 写道: > Register to .numa_protect() callback in mmu notifier so that KVM can get > acurate information about when a page is PROT_NONE protected in primary > MMU and unmap it in secondary MMU accordingly. > > In KVM's .invalidate_range_start() handler, if the event is to notify that > the range may be protected to PROT_NONE for NUMA migration purpose, > don't do the unmapping in secondary MMU. Hold on until.numa_protect() > comes. > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> > --- > virt/kvm/kvm_main.c | 25 ++++++++++++++++++++++--- > 1 file changed, 22 insertions(+), 3 deletions(-) > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index dfbaafbe3a00..907444a1761b 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -711,6 +711,20 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn, > kvm_handle_hva_range(mn, address, address + 1, pte, kvm_change_spte_gfn); > } > > +static void kvm_mmu_notifier_numa_protect(struct mmu_notifier *mn, > + struct mm_struct *mm, > + unsigned long start, > + unsigned long end) > +{ > + struct kvm *kvm = mmu_notifier_to_kvm(mn); > + > + WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count)); > + if (!READ_ONCE(kvm->mmu_invalidate_in_progress)) > + return; > + > + kvm_handle_hva_range(mn, start, end, __pte(0), kvm_unmap_gfn_range); > +} numa balance will scan wide memory range, and there will be one time ipi notification with kvm_flush_remote_tlbs. With page level notification, it may bring out lots of flush remote tlb ipi notification. however numa balance notification, pmd table of vm maybe needs not be freed in kvm_unmap_gfn_range. Regards Bibo Mao > + > void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start, > unsigned long end) > { > @@ -744,14 +758,18 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, > const struct mmu_notifier_range *range) > { > struct kvm *kvm = mmu_notifier_to_kvm(mn); > + bool is_numa = (range->event == MMU_NOTIFY_PROTECTION_VMA) && > + (range->flags & MMU_NOTIFIER_RANGE_NUMA); > const struct kvm_hva_range hva_range = { > .start = range->start, > .end = range->end, > .pte = __pte(0), > - .handler = kvm_unmap_gfn_range, > + .handler = !is_numa ? kvm_unmap_gfn_range : > + (void *)kvm_null_fn, > .on_lock = kvm_mmu_invalidate_begin, > - .on_unlock = kvm_arch_guest_memory_reclaimed, > - .flush_on_ret = true, > + .on_unlock = !is_numa ? kvm_arch_guest_memory_reclaimed : > + (void *)kvm_null_fn, > + .flush_on_ret = !is_numa ? true : false, > .may_block = mmu_notifier_range_blockable(range), > }; > > @@ -899,6 +917,7 @@ static const struct mmu_notifier_ops kvm_mmu_notifier_ops = { > .clear_young = kvm_mmu_notifier_clear_young, > .test_young = kvm_mmu_notifier_test_young, > .change_pte = kvm_mmu_notifier_change_pte, > + .numa_protect = kvm_mmu_notifier_numa_protect, > .release = kvm_mmu_notifier_release, > }; >
> > +static void kvm_mmu_notifier_numa_protect(struct mmu_notifier *mn, > > + struct mm_struct *mm, > > + unsigned long start, > > + unsigned long end) > > +{ > > + struct kvm *kvm = mmu_notifier_to_kvm(mn); > > + > > + WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count)); > > + if (!READ_ONCE(kvm->mmu_invalidate_in_progress)) > > + return; > > + > > + kvm_handle_hva_range(mn, start, end, __pte(0), kvm_unmap_gfn_range); > > +} > numa balance will scan wide memory range, and there will be one time Though scanning memory range is wide, .invalidate_range_start() is sent for each 2M range. > ipi notification with kvm_flush_remote_tlbs. With page level notification, > it may bring out lots of flush remote tlb ipi notification. Hmm, for VMs with assigned devices, apparently, the flush remote tlb IPIs will be reduced to 0 with this series. For VMs without assigned devices or mdev devices, I was previously also worried about that there might be more IPIs. But with current test data, there's no more remote tlb IPIs on average. The reason is below: Before this series, kvm_unmap_gfn_range() is called for once for a 2M range. After this series, kvm_unmap_gfn_range() is called for once if the 2M is mapped to a huge page in primary MMU, and called for at most 512 times if mapped to 4K pages in primary MMU. Though kvm_unmap_gfn_range() is only called once before this series, as the range is blockable, when there're contentions, remote tlb IPIs can be sent page by page in 4K granularity (in tdp_mmu_iter_cond_resched()) if the pages are mapped in 4K in secondary MMU. With this series, on the other hand, .numa_protect() sets range to be unblockable. So there could be less remote tlb IPIs when a 2M range is mapped into small PTEs in secondary MMU. Besides, .numa_protect() is not sent for all pages in a given 2M range. Below is my testing data on a VM without assigned devices: The data is an average of 10 times guest boot-up. data | numa balancing caused | numa balancing caused on average | #kvm_unmap_gfn_range() | #kvm_flush_remote_tlbs() -------------------|------------------------|-------------------------- before this series | 35 | 8625 after this series | 10037 | 4610 For a single guest bootup, | numa balancing caused | numa balancing caused best data | #kvm_unmap_gfn_range() | #kvm_flush_remote_tlbs() -------------------|------------------------|-------------------------- before this series | 28 | 13 after this series | 406 | 195 | numa balancing caused | numa balancing caused worst data | #kvm_unmap_gfn_range() | #kvm_flush_remote_tlbs() -------------------|------------------------|-------------------------- before this series | 44 | 43920 after this series | 17352 | 8668 > > however numa balance notification, pmd table of vm maybe needs not be freed > in kvm_unmap_gfn_range. >
在 2023/8/11 11:45, Yan Zhao 写道: >>> +static void kvm_mmu_notifier_numa_protect(struct mmu_notifier *mn, >>> + struct mm_struct *mm, >>> + unsigned long start, >>> + unsigned long end) >>> +{ >>> + struct kvm *kvm = mmu_notifier_to_kvm(mn); >>> + >>> + WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count)); >>> + if (!READ_ONCE(kvm->mmu_invalidate_in_progress)) >>> + return; >>> + >>> + kvm_handle_hva_range(mn, start, end, __pte(0), kvm_unmap_gfn_range); >>> +} >> numa balance will scan wide memory range, and there will be one time > Though scanning memory range is wide, .invalidate_range_start() is sent > for each 2M range. yes, range is huge page size when changing numa protection during numa scanning. > >> ipi notification with kvm_flush_remote_tlbs. With page level notification, >> it may bring out lots of flush remote tlb ipi notification. > > Hmm, for VMs with assigned devices, apparently, the flush remote tlb IPIs > will be reduced to 0 with this series. > > For VMs without assigned devices or mdev devices, I was previously also > worried about that there might be more IPIs. > But with current test data, there's no more remote tlb IPIs on average. > > The reason is below: > > Before this series, kvm_unmap_gfn_range() is called for once for a 2M > range. > After this series, kvm_unmap_gfn_range() is called for once if the 2M is > mapped to a huge page in primary MMU, and called for at most 512 times > if mapped to 4K pages in primary MMU. > > > Though kvm_unmap_gfn_range() is only called once before this series, > as the range is blockable, when there're contentions, remote tlb IPIs > can be sent page by page in 4K granularity (in tdp_mmu_iter_cond_resched()) I do not know much about x86, does this happen always or only need reschedule from code? so that there will be many times of tlb IPIs in only once function call about kvm_unmap_gfn_range. > if the pages are mapped in 4K in secondary MMU. > > With this series, on the other hand, .numa_protect() sets range to be > unblockable. So there could be less remote tlb IPIs when a 2M range is > mapped into small PTEs in secondary MMU. > Besides, .numa_protect() is not sent for all pages in a given 2M range. No, .numa_protect() is not sent for all pages. It depends on the workload, whether the page is accessed for different cpu threads cross-nodes. > > Below is my testing data on a VM without assigned devices: > The data is an average of 10 times guest boot-up. > > data | numa balancing caused | numa balancing caused > on average | #kvm_unmap_gfn_range() | #kvm_flush_remote_tlbs() > -------------------|------------------------|-------------------------- > before this series | 35 | 8625 > after this series | 10037 | 4610 just be cautious, before the series there are 8625/35 = 246 IPI tlb flush ops during one time kvm_unmap_gfn_range, is that x86 specific or generic? By the way are primary mmu and secondary mmu both 4K small page size "on average"? Regards Bibo Mao > > For a single guest bootup, > | numa balancing caused | numa balancing caused > best data | #kvm_unmap_gfn_range() | #kvm_flush_remote_tlbs() > -------------------|------------------------|-------------------------- > before this series | 28 | 13 > after this series | 406 | 195 > > | numa balancing caused | numa balancing caused > worst data | #kvm_unmap_gfn_range() | #kvm_flush_remote_tlbs() > -------------------|------------------------|-------------------------- > before this series | 44 | 43920 > after this series | 17352 | 8668 > > >> >> however numa balance notification, pmd table of vm maybe needs not be freed >> in kvm_unmap_gfn_range. >> >
On Fri, Aug 11, 2023 at 03:40:44PM +0800, bibo mao wrote: > > > 在 2023/8/11 11:45, Yan Zhao 写道: > >>> +static void kvm_mmu_notifier_numa_protect(struct mmu_notifier *mn, > >>> + struct mm_struct *mm, > >>> + unsigned long start, > >>> + unsigned long end) > >>> +{ > >>> + struct kvm *kvm = mmu_notifier_to_kvm(mn); > >>> + > >>> + WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count)); > >>> + if (!READ_ONCE(kvm->mmu_invalidate_in_progress)) > >>> + return; > >>> + > >>> + kvm_handle_hva_range(mn, start, end, __pte(0), kvm_unmap_gfn_range); > >>> +} > >> numa balance will scan wide memory range, and there will be one time > > Though scanning memory range is wide, .invalidate_range_start() is sent > > for each 2M range. > yes, range is huge page size when changing numa protection during numa scanning. > > > > >> ipi notification with kvm_flush_remote_tlbs. With page level notification, > >> it may bring out lots of flush remote tlb ipi notification. > > > > Hmm, for VMs with assigned devices, apparently, the flush remote tlb IPIs > > will be reduced to 0 with this series. > > > > For VMs without assigned devices or mdev devices, I was previously also > > worried about that there might be more IPIs. > > But with current test data, there's no more remote tlb IPIs on average. > > > > The reason is below: > > > > Before this series, kvm_unmap_gfn_range() is called for once for a 2M > > range. > > After this series, kvm_unmap_gfn_range() is called for once if the 2M is > > mapped to a huge page in primary MMU, and called for at most 512 times > > if mapped to 4K pages in primary MMU. > > > > > > Though kvm_unmap_gfn_range() is only called once before this series, > > as the range is blockable, when there're contentions, remote tlb IPIs > > can be sent page by page in 4K granularity (in tdp_mmu_iter_cond_resched()) > I do not know much about x86, does this happen always or only need reschedule Ah, sorry, I missed platforms other than x86. Maybe there will be a big difference in other platforms. > from code? so that there will be many times of tlb IPIs in only once function Only when MMU lock is contended. But it's not seldom because of the contention in TDP page fault. > call about kvm_unmap_gfn_range. > > > if the pages are mapped in 4K in secondary MMU. > > > > With this series, on the other hand, .numa_protect() sets range to be > > unblockable. So there could be less remote tlb IPIs when a 2M range is > > mapped into small PTEs in secondary MMU. > > Besides, .numa_protect() is not sent for all pages in a given 2M range. > No, .numa_protect() is not sent for all pages. It depends on the workload, > whether the page is accessed for different cpu threads cross-nodes. The .numa_protect() is called in patch 4 only when PROT_NONE is set to the page. > > > > > Below is my testing data on a VM without assigned devices: > > The data is an average of 10 times guest boot-up. > > > > data | numa balancing caused | numa balancing caused > > on average | #kvm_unmap_gfn_range() | #kvm_flush_remote_tlbs() > > -------------------|------------------------|-------------------------- > > before this series | 35 | 8625 > > after this series | 10037 | 4610 > just be cautious, before the series there are 8625/35 = 246 IPI tlb flush ops > during one time kvm_unmap_gfn_range, is that x86 specific or generic? Only on x86. Didn't test on other platforms. > > By the way are primary mmu and secondary mmu both 4K small page size "on average"? No. 4K and 2M combined in both.
On Fri, Aug 11, 2023, Yan Zhao wrote: > On Fri, Aug 11, 2023 at 03:40:44PM +0800, bibo mao wrote: > > > > 在 2023/8/11 11:45, Yan Zhao 写道: > > >>> +static void kvm_mmu_notifier_numa_protect(struct mmu_notifier *mn, > > >>> + struct mm_struct *mm, > > >>> + unsigned long start, > > >>> + unsigned long end) > > >>> +{ > > >>> + struct kvm *kvm = mmu_notifier_to_kvm(mn); > > >>> + > > >>> + WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count)); > > >>> + if (!READ_ONCE(kvm->mmu_invalidate_in_progress)) > > >>> + return; > > >>> + > > >>> + kvm_handle_hva_range(mn, start, end, __pte(0), kvm_unmap_gfn_range); > > >>> +} > > >> numa balance will scan wide memory range, and there will be one time > > > Though scanning memory range is wide, .invalidate_range_start() is sent > > > for each 2M range. > > yes, range is huge page size when changing numa protection during numa scanning. > > > > > > > >> ipi notification with kvm_flush_remote_tlbs. With page level notification, > > >> it may bring out lots of flush remote tlb ipi notification. > > > > > > Hmm, for VMs with assigned devices, apparently, the flush remote tlb IPIs > > > will be reduced to 0 with this series. > > > > > > For VMs without assigned devices or mdev devices, I was previously also > > > worried about that there might be more IPIs. > > > But with current test data, there's no more remote tlb IPIs on average. > > > > > > The reason is below: > > > > > > Before this series, kvm_unmap_gfn_range() is called for once for a 2M > > > range. No, it's potentially called once per 1GiB range. change_pmd_range() invokes the mmu_notifier with addr+end, where "end" is the end of the range covered by the PUD, not the the end of the current PMD. So the worst case scenario would be a 256k increase. Of course, if you have to migrate an entire 1GiB chunk of memory then you likely have bigger problems, but still. > > > After this series, kvm_unmap_gfn_range() is called for once if the 2M is > > > mapped to a huge page in primary MMU, and called for at most 512 times > > > if mapped to 4K pages in primary MMU. > > > > > > > > > Though kvm_unmap_gfn_range() is only called once before this series, > > > as the range is blockable, when there're contentions, remote tlb IPIs > > > can be sent page by page in 4K granularity (in tdp_mmu_iter_cond_resched()) > > I do not know much about x86, does this happen always or only need reschedule > Ah, sorry, I missed platforms other than x86. > Maybe there will be a big difference in other platforms. > > > from code? so that there will be many times of tlb IPIs in only once function > Only when MMU lock is contended. But it's not seldom because of the contention in > TDP page fault. No? I don't see how mmu_lock contention would affect the number of calls to kvm_flush_remote_tlbs(). If vCPUs are running and not generating faults, i.e. not trying to access the range in question, then ever zap will generate a remote TLB flush and thus send IPIs to all running vCPUs. > > call about kvm_unmap_gfn_range. > > > > > if the pages are mapped in 4K in secondary MMU. > > > > > > With this series, on the other hand, .numa_protect() sets range to be > > > unblockable. So there could be less remote tlb IPIs when a 2M range is > > > mapped into small PTEs in secondary MMU. > > > Besides, .numa_protect() is not sent for all pages in a given 2M range. > > No, .numa_protect() is not sent for all pages. It depends on the workload, > > whether the page is accessed for different cpu threads cross-nodes. > The .numa_protect() is called in patch 4 only when PROT_NONE is set to > the page. I'm strongly opposed to adding MMU_NOTIFIER_RANGE_NUMA. It's too much of a one-off, and losing the batching of invalidations makes me nervous. As Bibo points out, the behavior will vary based on the workload, VM configuration, etc. There's also a *very* subtle change, in that the notification will be sent while holding the PMD/PTE lock. Taking KVM's mmu_lock under that is *probably* ok, but I'm not exactly 100% confident on that. And the only reason there isn't a more obvious bug is because kvm_handle_hva_range() sets may_block to false, e.g. KVM won't yield if there's mmu_lock contention. However, *if* it's ok to invoke MMU notifiers while holding PMD/PTE locks, then I think we can achieve what you want without losing batching, and without changing secondary MMUs. Rather than muck with the notification types and add a one-off for NUMA, just defer the notification until a present PMD/PTE is actually going to be modified. It's not the prettiest, but other than the locking, I don't think has any chance of regressing other workloads/configurations. Note, I'm assuming secondary MMUs aren't allowed to map swap entries... Compile tested only. From: Sean Christopherson <seanjc@google.com> Date: Fri, 11 Aug 2023 10:03:36 -0700 Subject: [PATCH] tmp Signed-off-by: Sean Christopherson <seanjc@google.com> --- include/linux/huge_mm.h | 4 +++- include/linux/mm.h | 3 +++ mm/huge_memory.c | 5 ++++- mm/mprotect.c | 47 +++++++++++++++++++++++++++++------------ 4 files changed, 44 insertions(+), 15 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 20284387b841..b88316adaaad 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -7,6 +7,8 @@ #include <linux/fs.h> /* only for vma_is_dax() */ +struct mmu_notifier_range; + vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf); int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, @@ -38,7 +40,7 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr, unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd); int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, pgprot_t newprot, - unsigned long cp_flags); + unsigned long cp_flags, struct mmu_notifier_range *range); vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write); vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write); diff --git a/include/linux/mm.h b/include/linux/mm.h index 2dd73e4f3d8e..284f61cf9c37 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2478,6 +2478,9 @@ static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma return !!(vma->vm_flags & VM_WRITE); } + +void change_pmd_range_notify_secondary_mmus(unsigned long addr, + struct mmu_notifier_range *range); bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr, pte_t pte); extern long change_protection(struct mmu_gather *tlb, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index a71cf686e3b2..47c7712b163e 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1808,7 +1808,7 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr, */ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, pgprot_t newprot, - unsigned long cp_flags) + unsigned long cp_flags, struct mmu_notifier_range *range) { struct mm_struct *mm = vma->vm_mm; spinlock_t *ptl; @@ -1893,6 +1893,9 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, !toptier) xchg_page_access_time(page, jiffies_to_msecs(jiffies)); } + + change_pmd_range_notify_secondary_mmus(addr, range); + /* * In case prot_numa, we are under mmap_read_lock(mm). It's critical * to not clear pmd intermittently to avoid race with MADV_DONTNEED diff --git a/mm/mprotect.c b/mm/mprotect.c index d1a809167f49..f5844adbe0cb 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -82,7 +82,8 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr, static long change_pte_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, - unsigned long end, pgprot_t newprot, unsigned long cp_flags) + unsigned long end, pgprot_t newprot, unsigned long cp_flags, + struct mmu_notifier_range *range) { pte_t *pte, oldpte; spinlock_t *ptl; @@ -164,8 +165,12 @@ static long change_pte_range(struct mmu_gather *tlb, !toptier) xchg_page_access_time(page, jiffies_to_msecs(jiffies)); + + } + change_pmd_range_notify_secondary_mmus(addr, range); + oldpte = ptep_modify_prot_start(vma, addr, pte); ptent = pte_modify(oldpte, newprot); @@ -355,6 +360,17 @@ pgtable_populate_needed(struct vm_area_struct *vma, unsigned long cp_flags) err; \ }) +void change_pmd_range_notify_secondary_mmus(unsigned long addr, + struct mmu_notifier_range *range) +{ + if (range->start) + return; + + VM_WARN_ON(addr >= range->end); + range->start = addr; + mmu_notifier_invalidate_range_start_nonblock(range); +} + static inline long change_pmd_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pud_t *pud, unsigned long addr, unsigned long end, pgprot_t newprot, unsigned long cp_flags) @@ -365,7 +381,14 @@ static inline long change_pmd_range(struct mmu_gather *tlb, unsigned long nr_huge_updates = 0; struct mmu_notifier_range range; - range.start = 0; + /* + * Defer invalidation of secondary MMUs until a PMD/PTE change is + * imminent, e.g. NUMA balancing in particular can "fail" for certain + * types of mappings. Initialize range.start to '0' and use it to + * track whether or not the invalidation notification has been set. + */ + mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_VMA, 0, + vma->vm_mm, 0, end); pmd = pmd_offset(pud, addr); do { @@ -383,18 +406,16 @@ static inline long change_pmd_range(struct mmu_gather *tlb, if (pmd_none(*pmd)) goto next; - /* invoke the mmu notifier if the pmd is populated */ - if (!range.start) { - mmu_notifier_range_init(&range, - MMU_NOTIFY_PROTECTION_VMA, 0, - vma->vm_mm, addr, end); - mmu_notifier_invalidate_range_start(&range); - } - _pmd = pmdp_get_lockless(pmd); if (is_swap_pmd(_pmd) || pmd_trans_huge(_pmd) || pmd_devmap(_pmd)) { if ((next - addr != HPAGE_PMD_SIZE) || pgtable_split_needed(vma, cp_flags)) { + /* + * FIXME: __split_huge_pmd() performs its own + * mmu_notifier invalidation prior to clearing + * the PMD, ideally all invalidations for the + * range would be batched. + */ __split_huge_pmd(vma, pmd, addr, false, NULL); /* * For file-backed, the pmd could have been @@ -407,8 +428,8 @@ static inline long change_pmd_range(struct mmu_gather *tlb, break; } } else { - ret = change_huge_pmd(tlb, vma, pmd, - addr, newprot, cp_flags); + ret = change_huge_pmd(tlb, vma, pmd, addr, + newprot, cp_flags, &range); if (ret) { if (ret == HPAGE_PMD_NR) { pages += HPAGE_PMD_NR; @@ -423,7 +444,7 @@ static inline long change_pmd_range(struct mmu_gather *tlb, } ret = change_pte_range(tlb, vma, pmd, addr, next, newprot, - cp_flags); + cp_flags, &range); if (ret < 0) goto again; pages += ret; base-commit: 1f40f634009556c119974cafce4c7b2f9b8c58ad --
> However, *if* it's ok to invoke MMU notifiers while holding PMD/PTE locks, then > I think we can achieve what you want without losing batching, and without changing > secondary MMUs. No, this is not legal. > +void change_pmd_range_notify_secondary_mmus(unsigned long addr, > + struct mmu_notifier_range *range) > +{ > + if (range->start) > + return; > + > + VM_WARN_ON(addr >= range->end); > + range->start = addr; > + mmu_notifier_invalidate_range_start_nonblock(range); > +} The 'nonblock' entry point is advisory, if it fails the caller must not change the mm. Jason
On Fri, Aug 11, 2023 at 10:14:45AM -0700, Sean Christopherson wrote: > On Fri, Aug 11, 2023, Yan Zhao wrote: > > On Fri, Aug 11, 2023 at 03:40:44PM +0800, bibo mao wrote: > > > > > > 在 2023/8/11 11:45, Yan Zhao 写道: > > > >>> +static void kvm_mmu_notifier_numa_protect(struct mmu_notifier *mn, > > > >>> + struct mm_struct *mm, > > > >>> + unsigned long start, > > > >>> + unsigned long end) > > > >>> +{ > > > >>> + struct kvm *kvm = mmu_notifier_to_kvm(mn); > > > >>> + > > > >>> + WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count)); > > > >>> + if (!READ_ONCE(kvm->mmu_invalidate_in_progress)) > > > >>> + return; > > > >>> + > > > >>> + kvm_handle_hva_range(mn, start, end, __pte(0), kvm_unmap_gfn_range); > > > >>> +} > > > >> numa balance will scan wide memory range, and there will be one time > > > > Though scanning memory range is wide, .invalidate_range_start() is sent > > > > for each 2M range. > > > yes, range is huge page size when changing numa protection during numa scanning. > > > > > > > > > > >> ipi notification with kvm_flush_remote_tlbs. With page level notification, > > > >> it may bring out lots of flush remote tlb ipi notification. > > > > > > > > Hmm, for VMs with assigned devices, apparently, the flush remote tlb IPIs > > > > will be reduced to 0 with this series. > > > > > > > > For VMs without assigned devices or mdev devices, I was previously also > > > > worried about that there might be more IPIs. > > > > But with current test data, there's no more remote tlb IPIs on average. > > > > > > > > The reason is below: > > > > > > > > Before this series, kvm_unmap_gfn_range() is called for once for a 2M > > > > range. > > No, it's potentially called once per 1GiB range. change_pmd_range() invokes the > mmu_notifier with addr+end, where "end" is the end of the range covered by the > PUD, not the the end of the current PMD. So the worst case scenario would be a > 256k increase. Of course, if you have to migrate an entire 1GiB chunk of memory > then you likely have bigger problems, but still. Yes, thanks for pointing it out. I realized it after re-reading the code and re-checking the log message. This wider range also explained the collected worst data: 44 kvm_unmap_gfn_range() caused 43920 kvm_flush_remote_tlbs()requests. i.e. 998 remote tlb flush requests per kvm_unmap_gfn_range(). > > > > > After this series, kvm_unmap_gfn_range() is called for once if the 2M is > > > > mapped to a huge page in primary MMU, and called for at most 512 times > > > > if mapped to 4K pages in primary MMU. > > > > > > > > > > > > Though kvm_unmap_gfn_range() is only called once before this series, > > > > as the range is blockable, when there're contentions, remote tlb IPIs > > > > can be sent page by page in 4K granularity (in tdp_mmu_iter_cond_resched()) > > > I do not know much about x86, does this happen always or only need reschedule > > Ah, sorry, I missed platforms other than x86. > > Maybe there will be a big difference in other platforms. > > > > > from code? so that there will be many times of tlb IPIs in only once function > > Only when MMU lock is contended. But it's not seldom because of the contention in > > TDP page fault. > > No? I don't see how mmu_lock contention would affect the number of calls to > kvm_flush_remote_tlbs(). If vCPUs are running and not generating faults, i.e. > not trying to access the range in question, then ever zap will generate a remote > TLB flush and thus send IPIs to all running vCPUs. In tdp_mmu_zap_leafs() for kvm_unmap_gfn_range(), the flow is like this: 1. Check necessity of mmu_lock reschedule 1.a -- if yes, 1.a.1 do kvm_flush_remote_tlbs() if flush is true. 1.a.2 reschedule of mmu_lock 1.a.3 goto 2. 1.b -- if no, goto 2 2. Zap present leaf entry, and set flush to be true 3. Get next gfn and go to to 1 With a wide range to .invalidate_range_start()/end(), it's easy to find rwlock_needbreak(&kvm->mmu_lock) to be true (goes to 1.a and 1.a.1) And in tdp_mmu_zap_leafs(), before rescheduling of mmu_lock, tlb flush request is performed. (1.a.1) Take a real case for example, NUMA balancing requests KVM to zap GFN range from 0xb4000 to 0xb9800. Then when KVM zaps GFN 0xb807b, it will finds mmu_lock needs break because vCPU is faulting GFN 0xb804c. And vCPUs fill constantly retry faulting 0xb804c for 3298 times until .invalidate_range_end(). Then for this zap of GFN range from 0xb4000 - 0xb9800, vCPUs retry fault of GFN 0xb804c for 3298 times, GFN 0xb8074 for 3161 times, GFN 0xb81ce for 15190 times, and the accesses of the 3 GFNs cause 3209 times of kvm_flush_remote_tlbs() for one kvm_unmap_gfn_range() because flush requests are not batched. (this range is mapped in both 2M and 4K in secondary MMU). Without the contending of mmu_lock, tdp_mmu_zap_leafs() just combines all flush requests and leaves 1 kvm_flush_remote_tlbs() to be called after it returns. In my test scenario, which is VM boot-up, this kind of contention is frequent. Here's the 10 times data for a VM boot-up collected previously. | count of | count of | | kvm_unmap_gfn_range | kvm_flush_remote_tlbs() | -------|---------------------|-------------------------| 1 | 38 | 14 | 2 | 28 | 5191 | 3 | 36 | 13398 | 4 | 44 | 43920 | 5 | 28 | 14 | 6 | 36 | 10803 | 7 | 38 | 892 | 8 | 32 | 5905 | 9 | 32 | 13 | 10 | 38 | 6096 | -------|---------------------|-------------------------| average| 35 | 8625 | I wonder if we could loose the frequency to check for rescheduling in tdp_mmu_iter_cond_resched() if the zap range is wide, e.g. if (iter->next_last_level_gfn == iter->yielded_gfn + KVM_PAGES_PER_HPAGE(PG_LEVEL_2M)) return false; > > > > call about kvm_unmap_gfn_range. > > > > > > > if the pages are mapped in 4K in secondary MMU. > > > > > > > > With this series, on the other hand, .numa_protect() sets range to be > > > > unblockable. So there could be less remote tlb IPIs when a 2M range is > > > > mapped into small PTEs in secondary MMU. > > > > Besides, .numa_protect() is not sent for all pages in a given 2M range. > > > No, .numa_protect() is not sent for all pages. It depends on the workload, > > > whether the page is accessed for different cpu threads cross-nodes. > > The .numa_protect() is called in patch 4 only when PROT_NONE is set to > > the page. > > I'm strongly opposed to adding MMU_NOTIFIER_RANGE_NUMA. It's too much of a one-off, > and losing the batching of invalidations makes me nervous. As Bibo points out, > the behavior will vary based on the workload, VM configuration, etc. > > There's also a *very* subtle change, in that the notification will be sent while > holding the PMD/PTE lock. Taking KVM's mmu_lock under that is *probably* ok, but > I'm not exactly 100% confident on that. And the only reason there isn't a more MMU lock is a rwlock, which is a variant of spinlock. So, I think taking it within PMD/PTE lock is ok. Actually we have already done that during the .change_pte() notification, where kvm_mmu_notifier_change_pte() takes KVM mmu_lock for write, while PTE lock is held while sending set_pte_at_notify() --> .change_pte() handlers > obvious bug is because kvm_handle_hva_range() sets may_block to false, e.g. KVM > won't yield if there's mmu_lock contention. Yes, KVM won't yield and reschedule of KVM mmu_lock, so it's fine. > However, *if* it's ok to invoke MMU notifiers while holding PMD/PTE locks, then > I think we can achieve what you want without losing batching, and without changing > secondary MMUs. > > Rather than muck with the notification types and add a one-off for NUMA, just > defer the notification until a present PMD/PTE is actually going to be modified. > It's not the prettiest, but other than the locking, I don't think has any chance > of regressing other workloads/configurations. > > Note, I'm assuming secondary MMUs aren't allowed to map swap entries... > > Compile tested only. I don't find a matching end to each mmu_notifier_invalidate_range_start_nonblock(). > > From: Sean Christopherson <seanjc@google.com> > Date: Fri, 11 Aug 2023 10:03:36 -0700 > Subject: [PATCH] tmp > > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- > include/linux/huge_mm.h | 4 +++- > include/linux/mm.h | 3 +++ > mm/huge_memory.c | 5 ++++- > mm/mprotect.c | 47 +++++++++++++++++++++++++++++------------ > 4 files changed, 44 insertions(+), 15 deletions(-) > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > index 20284387b841..b88316adaaad 100644 > --- a/include/linux/huge_mm.h > +++ b/include/linux/huge_mm.h > @@ -7,6 +7,8 @@ > > #include <linux/fs.h> /* only for vma_is_dax() */ > > +struct mmu_notifier_range; > + > vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf); > int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, > pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, > @@ -38,7 +40,7 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr, > unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd); > int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, > pmd_t *pmd, unsigned long addr, pgprot_t newprot, > - unsigned long cp_flags); > + unsigned long cp_flags, struct mmu_notifier_range *range); > > vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write); > vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write); > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 2dd73e4f3d8e..284f61cf9c37 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -2478,6 +2478,9 @@ static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma > return !!(vma->vm_flags & VM_WRITE); > > } > + > +void change_pmd_range_notify_secondary_mmus(unsigned long addr, > + struct mmu_notifier_range *range); > bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr, > pte_t pte); > extern long change_protection(struct mmu_gather *tlb, > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index a71cf686e3b2..47c7712b163e 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -1808,7 +1808,7 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr, > */ > int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, > pmd_t *pmd, unsigned long addr, pgprot_t newprot, > - unsigned long cp_flags) > + unsigned long cp_flags, struct mmu_notifier_range *range) > { > struct mm_struct *mm = vma->vm_mm; > spinlock_t *ptl; > @@ -1893,6 +1893,9 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, > !toptier) > xchg_page_access_time(page, jiffies_to_msecs(jiffies)); > } > + > + change_pmd_range_notify_secondary_mmus(addr, range); > + > /* > * In case prot_numa, we are under mmap_read_lock(mm). It's critical > * to not clear pmd intermittently to avoid race with MADV_DONTNEED > diff --git a/mm/mprotect.c b/mm/mprotect.c > index d1a809167f49..f5844adbe0cb 100644 > --- a/mm/mprotect.c > +++ b/mm/mprotect.c > @@ -82,7 +82,8 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr, > > static long change_pte_range(struct mmu_gather *tlb, > struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, > - unsigned long end, pgprot_t newprot, unsigned long cp_flags) > + unsigned long end, pgprot_t newprot, unsigned long cp_flags, > + struct mmu_notifier_range *range) > { > pte_t *pte, oldpte; > spinlock_t *ptl; > @@ -164,8 +165,12 @@ static long change_pte_range(struct mmu_gather *tlb, > !toptier) > xchg_page_access_time(page, > jiffies_to_msecs(jiffies)); > + > + > } > > + change_pmd_range_notify_secondary_mmus(addr, range); > + > oldpte = ptep_modify_prot_start(vma, addr, pte); > ptent = pte_modify(oldpte, newprot); > > @@ -355,6 +360,17 @@ pgtable_populate_needed(struct vm_area_struct *vma, unsigned long cp_flags) > err; \ > }) > > +void change_pmd_range_notify_secondary_mmus(unsigned long addr, > + struct mmu_notifier_range *range) > +{ > + if (range->start) > + return; > + > + VM_WARN_ON(addr >= range->end); > + range->start = addr; > + mmu_notifier_invalidate_range_start_nonblock(range); This will cause range from addr to end to be invalidated, which may include pinned ranges. > +} > + > static inline long change_pmd_range(struct mmu_gather *tlb, > struct vm_area_struct *vma, pud_t *pud, unsigned long addr, > unsigned long end, pgprot_t newprot, unsigned long cp_flags) > @@ -365,7 +381,14 @@ static inline long change_pmd_range(struct mmu_gather *tlb, > unsigned long nr_huge_updates = 0; > struct mmu_notifier_range range; > > - range.start = 0; > + /* > + * Defer invalidation of secondary MMUs until a PMD/PTE change is > + * imminent, e.g. NUMA balancing in particular can "fail" for certain > + * types of mappings. Initialize range.start to '0' and use it to > + * track whether or not the invalidation notification has been set. > + */ > + mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_VMA, 0, > + vma->vm_mm, 0, end); > > pmd = pmd_offset(pud, addr); > do { > @@ -383,18 +406,16 @@ static inline long change_pmd_range(struct mmu_gather *tlb, > if (pmd_none(*pmd)) > goto next; > > - /* invoke the mmu notifier if the pmd is populated */ > - if (!range.start) { > - mmu_notifier_range_init(&range, > - MMU_NOTIFY_PROTECTION_VMA, 0, > - vma->vm_mm, addr, end); > - mmu_notifier_invalidate_range_start(&range); > - } > - > _pmd = pmdp_get_lockless(pmd); > if (is_swap_pmd(_pmd) || pmd_trans_huge(_pmd) || pmd_devmap(_pmd)) { > if ((next - addr != HPAGE_PMD_SIZE) || > pgtable_split_needed(vma, cp_flags)) { > + /* > + * FIXME: __split_huge_pmd() performs its own > + * mmu_notifier invalidation prior to clearing > + * the PMD, ideally all invalidations for the > + * range would be batched. > + */ > __split_huge_pmd(vma, pmd, addr, false, NULL); > /* > * For file-backed, the pmd could have been > @@ -407,8 +428,8 @@ static inline long change_pmd_range(struct mmu_gather *tlb, > break; > } > } else { > - ret = change_huge_pmd(tlb, vma, pmd, > - addr, newprot, cp_flags); > + ret = change_huge_pmd(tlb, vma, pmd, addr, > + newprot, cp_flags, &range); > if (ret) { > if (ret == HPAGE_PMD_NR) { > pages += HPAGE_PMD_NR; > @@ -423,7 +444,7 @@ static inline long change_pmd_range(struct mmu_gather *tlb, > } > > ret = change_pte_range(tlb, vma, pmd, addr, next, newprot, > - cp_flags); > + cp_flags, &range); > if (ret < 0) > goto again; > pages += ret; > > base-commit: 1f40f634009556c119974cafce4c7b2f9b8c58ad > -- > > >
On Mon, Aug 14, 2023 at 02:52:07PM +0800, Yan Zhao wrote: > I wonder if we could loose the frequency to check for rescheduling in > tdp_mmu_iter_cond_resched() if the zap range is wide, e.g. > > if (iter->next_last_level_gfn == > iter->yielded_gfn + KVM_PAGES_PER_HPAGE(PG_LEVEL_2M)) > return false; Correct: @@ -712,7 +713,8 @@ static inline bool __must_check tdp_mmu_iter_cond_resched(struct kvm *kvm, WARN_ON(iter->yielded); /* Ensure forward progress has been made before yielding. */ - if (iter->next_last_level_gfn == iter->yielded_gfn) + if (iter->next_last_level_gfn >= iter->yielded_gfn && + iter->next_last_level_gfn < iter->yielded_gfn + KVM_PAGES_PER_HPAGE(PG_LEVEL_2M)) return false; if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) { This can reduce kvm_flush_remote_tlbs() a lot in one kvm_unmap_gfn_range() in KVM x86 TDP MMU.
On Mon, Aug 14, 2023, Yan Zhao wrote: > On Fri, Aug 11, 2023 at 10:14:45AM -0700, Sean Christopherson wrote: > > > > > After this series, kvm_unmap_gfn_range() is called for once if the 2M is > > > > > mapped to a huge page in primary MMU, and called for at most 512 times > > > > > if mapped to 4K pages in primary MMU. > > > > > > > > > > > > > > > Though kvm_unmap_gfn_range() is only called once before this series, > > > > > as the range is blockable, when there're contentions, remote tlb IPIs > > > > > can be sent page by page in 4K granularity (in tdp_mmu_iter_cond_resched()) > > > > I do not know much about x86, does this happen always or only need reschedule > > > Ah, sorry, I missed platforms other than x86. > > > Maybe there will be a big difference in other platforms. > > > > > > > from code? so that there will be many times of tlb IPIs in only once function > > > Only when MMU lock is contended. But it's not seldom because of the contention in > > > TDP page fault. > > > > No? I don't see how mmu_lock contention would affect the number of calls to > > kvm_flush_remote_tlbs(). If vCPUs are running and not generating faults, i.e. > > not trying to access the range in question, then ever zap will generate a remote > > TLB flush and thus send IPIs to all running vCPUs. > In tdp_mmu_zap_leafs() for kvm_unmap_gfn_range(), the flow is like this: > > 1. Check necessity of mmu_lock reschedule Ah, you're running a preemptible kernel. > 1.a -- if yes, > 1.a.1 do kvm_flush_remote_tlbs() if flush is true. > 1.a.2 reschedule of mmu_lock > 1.a.3 goto 2. > 1.b -- if no, goto 2 > 2. Zap present leaf entry, and set flush to be true > 3. Get next gfn and go to to 1 > > With a wide range to .invalidate_range_start()/end(), it's easy to find > rwlock_needbreak(&kvm->mmu_lock) to be true (goes to 1.a and 1.a.1) > And in tdp_mmu_zap_leafs(), before rescheduling of mmu_lock, tlb flush > request is performed. (1.a.1) > > > Take a real case for example, > NUMA balancing requests KVM to zap GFN range from 0xb4000 to 0xb9800. > Then when KVM zaps GFN 0xb807b, it will finds mmu_lock needs break > because vCPU is faulting GFN 0xb804c. > And vCPUs fill constantly retry faulting 0xb804c for 3298 times until > .invalidate_range_end(). > Then for this zap of GFN range from 0xb4000 - 0xb9800, > vCPUs retry fault of > GFN 0xb804c for 3298 times, > GFN 0xb8074 for 3161 times, > GFN 0xb81ce for 15190 times, > and the accesses of the 3 GFNs cause 3209 times of kvm_flush_remote_tlbs() > for one kvm_unmap_gfn_range() because flush requests are not batched. > (this range is mapped in both 2M and 4K in secondary MMU). > > Without the contending of mmu_lock, tdp_mmu_zap_leafs() just combines > all flush requests and leaves 1 kvm_flush_remote_tlbs() to be called > after it returns. > > > In my test scenario, which is VM boot-up, this kind of contention is > frequent. Hmm, I think your test scenario is a bit of an outlier. Running a preemptible kernel on a multi-socket system (presumably your system is multi-socket if you have NUMA) is a bit odd. Not that that invalidates the test result, but I would be quite surprised if there are production use cases for NUMA+KVM+preemptible. Though maybe I'm underestimating how much KVM is used on workstations? > Here's the 10 times data for a VM boot-up collected previously. > | count of | count of | > | kvm_unmap_gfn_range | kvm_flush_remote_tlbs() | > -------|---------------------|-------------------------| > 1 | 38 | 14 | > 2 | 28 | 5191 | > 3 | 36 | 13398 | > 4 | 44 | 43920 | > 5 | 28 | 14 | > 6 | 36 | 10803 | > 7 | 38 | 892 | > 8 | 32 | 5905 | > 9 | 32 | 13 | > 10 | 38 | 6096 | > -------|---------------------|-------------------------| > average| 35 | 8625 | > > > I wonder if we could loose the frequency to check for rescheduling in > tdp_mmu_iter_cond_resched() if the zap range is wide, e.g. > > if (iter->next_last_level_gfn == > iter->yielded_gfn + KVM_PAGES_PER_HPAGE(PG_LEVEL_2M)) > return false; Hrm, no. We'd want to repeat that logic for the shadow MMU, and it could regress other scenarios, e.g. if a vCPU is trying to fault-in an address that isn't part of the invalidation, then yielding asap is desirable. Unless I'm missing something, the easiest solution is to check for an invalidation *before* acquiring mmu_lock. The check would be prone to false negatives and false positives, so KVM would need to recheck after acquiring mmu_lock, but the check itself is super cheap relative to the overall cost of the page fault. Compile tested patch at the bottom. I'll test and post formally (probably with similar treatment for other architectures). > > > > call about kvm_unmap_gfn_range. > > > > > > > > > if the pages are mapped in 4K in secondary MMU. > > > > > > > > > > With this series, on the other hand, .numa_protect() sets range to be > > > > > unblockable. So there could be less remote tlb IPIs when a 2M range is > > > > > mapped into small PTEs in secondary MMU. > > > > > Besides, .numa_protect() is not sent for all pages in a given 2M range. > > > > No, .numa_protect() is not sent for all pages. It depends on the workload, > > > > whether the page is accessed for different cpu threads cross-nodes. > > > The .numa_protect() is called in patch 4 only when PROT_NONE is set to > > > the page. > > > > I'm strongly opposed to adding MMU_NOTIFIER_RANGE_NUMA. It's too much of a one-off, > > and losing the batching of invalidations makes me nervous. As Bibo points out, > > the behavior will vary based on the workload, VM configuration, etc. > > > > There's also a *very* subtle change, in that the notification will be sent while > > holding the PMD/PTE lock. Taking KVM's mmu_lock under that is *probably* ok, but > > I'm not exactly 100% confident on that. And the only reason there isn't a more > MMU lock is a rwlock, which is a variant of spinlock. > So, I think taking it within PMD/PTE lock is ok. > Actually we have already done that during the .change_pte() notification, where > > kvm_mmu_notifier_change_pte() takes KVM mmu_lock for write, > while PTE lock is held while sending set_pte_at_notify() --> .change_pte() handlers .change_pte() gets away with running under PMD/PTE lock because (a) it's not allowed to fail and (b) KVM is the only secondary MMU that hooks .change_pte() and KVM doesn't use a sleepable lock. As Jason pointed out, mmu_notifier_invalidate_range_start_nonblock() can fail because some secondary MMUs use mutexes or r/w semaphores to handle mmu_notifier events. For NUMA balancing, canceling the protection change might be ok, but for everything else, failure is not an option. So unfortunately, my idea won't work as-is. We might get away with deferring just the change_prot_numa() case, but I don't think that's worth the mess/complexity. I would much rather tell userspace to disable migrate-on-fault for KVM guest mappings (mbind() should work?) for these types of setups, or to disable NUMA balancing entirely. If the user really cares about performance of their VM(s), vCPUs should be affined to a single NUMA node (or core, or pinned 1:1), and if the VM spans multiple nodes, a virtual NUMA topology should be given to the guest. At that point, NUMA balancing is likely to do more harm than good. > > obvious bug is because kvm_handle_hva_range() sets may_block to false, e.g. KVM > > won't yield if there's mmu_lock contention. > Yes, KVM won't yield and reschedule of KVM mmu_lock, so it's fine. > > > However, *if* it's ok to invoke MMU notifiers while holding PMD/PTE locks, then > > I think we can achieve what you want without losing batching, and without changing > > secondary MMUs. > > > > Rather than muck with the notification types and add a one-off for NUMA, just > > defer the notification until a present PMD/PTE is actually going to be modified. > > It's not the prettiest, but other than the locking, I don't think has any chance > > of regressing other workloads/configurations. > > > > Note, I'm assuming secondary MMUs aren't allowed to map swap entries... > > > > Compile tested only. > > I don't find a matching end to each > mmu_notifier_invalidate_range_start_nonblock(). It pairs with existing call to mmu_notifier_invalidate_range_end() in change_pmd_range(): if (range.start) mmu_notifier_invalidate_range_end(&range); -- From: Sean Christopherson <seanjc@google.com> Date: Mon, 14 Aug 2023 08:59:12 -0700 Subject: [PATCH] KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing Retry page faults without acquiring mmu_lock if the resolve hva is covered by an active invalidation. Contending for mmu_lock is especially problematic on preemptible kernels as the invalidation task will yield the lock (see rwlock_needbreak()), delay the in-progress invalidation, and ultimately increase the latency of resolving the page fault. And in the worst case scenario, yielding will be accompanied by a remote TLB flush, e.g. if the invalidation covers a large range of memory and vCPUs are accessing addresses that were already zapped. Alternatively, the yielding issue could be mitigated by teaching KVM's MMU iterators to perform more work before yielding, but that wouldn't solve the lock contention and would negatively affect scenarios where a vCPU is trying to fault in an address that is NOT covered by the in-progress invalidation. Reported-by: Yan Zhao <yan.y.zhao@intel.com> Closes: https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@yzhao56-desk.sh.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com> --- arch/x86/kvm/mmu/mmu.c | 3 +++ include/linux/kvm_host.h | 8 +++++++- 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 9e4cd8b4a202..f29718a16211 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4345,6 +4345,9 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, if (unlikely(!fault->slot)) return kvm_handle_noslot_fault(vcpu, fault, access); + if (mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva)) + return RET_PF_RETRY; + return RET_PF_CONTINUE; } diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index cb86108c624d..f41d4fe61a06 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1960,7 +1960,6 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm, unsigned long mmu_seq, unsigned long hva) { - lockdep_assert_held(&kvm->mmu_lock); /* * If mmu_invalidate_in_progress is non-zero, then the range maintained * by kvm_mmu_notifier_invalidate_range_start contains all addresses @@ -1971,6 +1970,13 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm, hva >= kvm->mmu_invalidate_range_start && hva < kvm->mmu_invalidate_range_end) return 1; + + /* + * Note the lack of a memory barrier! The caller *must* hold mmu_lock + * to avoid false negatives and/or false positives (less concerning). + * Holding mmu_lock is not mandatory though, e.g. to allow pre-checking + * for an in-progress invalidation to avoiding contending mmu_lock. + */ if (kvm->mmu_invalidate_seq != mmu_seq) return 1; return 0; base-commit: 5bc7f472423f95a3f5c73b0b56c47e57d8833efc --
On Mon, Aug 14, 2023 at 09:40:44AM -0700, Sean Christopherson wrote: > > > I'm strongly opposed to adding MMU_NOTIFIER_RANGE_NUMA. It's too much of a one-off, > > > and losing the batching of invalidations makes me nervous. As Bibo points out, > > > the behavior will vary based on the workload, VM configuration, etc. > > > > > > There's also a *very* subtle change, in that the notification will be sent while > > > holding the PMD/PTE lock. Taking KVM's mmu_lock under that is *probably* ok, but > > > I'm not exactly 100% confident on that. And the only reason there isn't a more > > MMU lock is a rwlock, which is a variant of spinlock. > > So, I think taking it within PMD/PTE lock is ok. > > Actually we have already done that during the .change_pte() notification, where > > > > kvm_mmu_notifier_change_pte() takes KVM mmu_lock for write, > > while PTE lock is held while sending set_pte_at_notify() --> .change_pte() handlers > > .change_pte() gets away with running under PMD/PTE lock because (a) it's not allowed > to fail and (b) KVM is the only secondary MMU that hooks .change_pte() and KVM > doesn't use a sleepable lock. .numa_protect() in patch 4 is also sent when it's not allowed to fail and there's no sleepable lock in KVM's handler :) > As Jason pointed out, mmu_notifier_invalidate_range_start_nonblock() can fail > because some secondary MMUs use mutexes or r/w semaphores to handle mmu_notifier > events. For NUMA balancing, canceling the protection change might be ok, but for > everything else, failure is not an option. So unfortunately, my idea won't work > as-is. > > We might get away with deferring just the change_prot_numa() case, but I don't > think that's worth the mess/complexity. Yes, I also think deferring mmu_notifier_invalidate_range_start() is not working. One possible approach is to send successful .numa_protect() in batch. But I admit it will introduce complexity. > > I would much rather tell userspace to disable migrate-on-fault for KVM guest > mappings (mbind() should work?) for these types of setups, or to disable NUMA > balancing entirely. If the user really cares about performance of their VM(s), > vCPUs should be affined to a single NUMA node (or core, or pinned 1:1), and if > the VM spans multiple nodes, a virtual NUMA topology should be given to the guest. > At that point, NUMA balancing is likely to do more harm than good. > > > > obvious bug is because kvm_handle_hva_range() sets may_block to false, e.g. KVM > > > won't yield if there's mmu_lock contention. > > Yes, KVM won't yield and reschedule of KVM mmu_lock, so it's fine. > > > > > However, *if* it's ok to invoke MMU notifiers while holding PMD/PTE locks, then > > > I think we can achieve what you want without losing batching, and without changing > > > secondary MMUs. > > > > > > Rather than muck with the notification types and add a one-off for NUMA, just > > > defer the notification until a present PMD/PTE is actually going to be modified. > > > It's not the prettiest, but other than the locking, I don't think has any chance > > > of regressing other workloads/configurations. > > > > > > Note, I'm assuming secondary MMUs aren't allowed to map swap entries... > > > > > > Compile tested only. > > > > I don't find a matching end to each > > mmu_notifier_invalidate_range_start_nonblock(). > > It pairs with existing call to mmu_notifier_invalidate_range_end() in change_pmd_range(): > > if (range.start) > mmu_notifier_invalidate_range_end(&range); No, It doesn't work for mmu_notifier_invalidate_range_start() sent in change_pte_range(), if we only want the range to include pages successfully set to PROT_NONE. > -- > From: Sean Christopherson <seanjc@google.com> > Date: Mon, 14 Aug 2023 08:59:12 -0700 > Subject: [PATCH] KVM: x86/mmu: Retry fault before acquiring mmu_lock if > mapping is changing > > Retry page faults without acquiring mmu_lock if the resolve hva is covered > by an active invalidation. Contending for mmu_lock is especially > problematic on preemptible kernels as the invalidation task will yield the > lock (see rwlock_needbreak()), delay the in-progress invalidation, and > ultimately increase the latency of resolving the page fault. And in the > worst case scenario, yielding will be accompanied by a remote TLB flush, > e.g. if the invalidation covers a large range of memory and vCPUs are > accessing addresses that were already zapped. > > Alternatively, the yielding issue could be mitigated by teaching KVM's MMU > iterators to perform more work before yielding, but that wouldn't solve > the lock contention and would negatively affect scenarios where a vCPU is > trying to fault in an address that is NOT covered by the in-progress > invalidation. > > Reported-by: Yan Zhao <yan.y.zhao@intel.com> > Closes: https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@yzhao56-desk.sh.intel.com > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- > arch/x86/kvm/mmu/mmu.c | 3 +++ > include/linux/kvm_host.h | 8 +++++++- > 2 files changed, 10 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 9e4cd8b4a202..f29718a16211 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -4345,6 +4345,9 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, > if (unlikely(!fault->slot)) > return kvm_handle_noslot_fault(vcpu, fault, access); > > + if (mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva)) > + return RET_PF_RETRY; > + This can effectively reduce the remote flush IPIs a lot! One Nit is that, maybe rmb() or READ_ONCE() is required for kvm->mmu_invalidate_range_start and kvm->mmu_invalidate_range_end. Otherwise, I'm somewhat worried about constant false positive and retry. > return RET_PF_CONTINUE; > } > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index cb86108c624d..f41d4fe61a06 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -1960,7 +1960,6 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm, > unsigned long mmu_seq, > unsigned long hva) > { > - lockdep_assert_held(&kvm->mmu_lock); > /* > * If mmu_invalidate_in_progress is non-zero, then the range maintained > * by kvm_mmu_notifier_invalidate_range_start contains all addresses > @@ -1971,6 +1970,13 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm, > hva >= kvm->mmu_invalidate_range_start && > hva < kvm->mmu_invalidate_range_end) > return 1; > + > + /* > + * Note the lack of a memory barrier! The caller *must* hold mmu_lock > + * to avoid false negatives and/or false positives (less concerning). > + * Holding mmu_lock is not mandatory though, e.g. to allow pre-checking > + * for an in-progress invalidation to avoiding contending mmu_lock. > + */ > if (kvm->mmu_invalidate_seq != mmu_seq) > return 1; > return 0; > > base-commit: 5bc7f472423f95a3f5c73b0b56c47e57d8833efc > -- >
On Tue, Aug 15, 2023, Yan Zhao wrote: > On Mon, Aug 14, 2023 at 09:40:44AM -0700, Sean Christopherson wrote: > > > > Note, I'm assuming secondary MMUs aren't allowed to map swap entries... > > > > > > > > Compile tested only. > > > > > > I don't find a matching end to each > > > mmu_notifier_invalidate_range_start_nonblock(). > > > > It pairs with existing call to mmu_notifier_invalidate_range_end() in change_pmd_range(): > > > > if (range.start) > > mmu_notifier_invalidate_range_end(&range); > No, It doesn't work for mmu_notifier_invalidate_range_start() sent in change_pte_range(), > if we only want the range to include pages successfully set to PROT_NONE. Precise invalidation was a non-goal for my hack-a-patch. The intent was purely to defer invalidation until it was actually needed, but still perform only a single notification so as to batch the TLB flushes, e.g. the start() call still used the original @end. The idea was to play nice with the scenario where nothing in a VMA could be migrated. It was complete untested though, so it may not have actually done anything to reduce the number of pointless invalidations. > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > index 9e4cd8b4a202..f29718a16211 100644 > > --- a/arch/x86/kvm/mmu/mmu.c > > +++ b/arch/x86/kvm/mmu/mmu.c > > @@ -4345,6 +4345,9 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, > > if (unlikely(!fault->slot)) > > return kvm_handle_noslot_fault(vcpu, fault, access); > > > > + if (mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva)) > > + return RET_PF_RETRY; > > + > This can effectively reduce the remote flush IPIs a lot! > One Nit is that, maybe rmb() or READ_ONCE() is required for kvm->mmu_invalidate_range_start > and kvm->mmu_invalidate_range_end. > Otherwise, I'm somewhat worried about constant false positive and retry. If anything, this needs a READ_ONCE() on mmu_invalidate_in_progress. The ranges aren't touched when when mmu_invalidate_in_progress goes to zero, so ensuring they are reloaded wouldn't do anything. The key to making forward progress is seeing that there is no in-progress invalidation. I did consider adding said READ_ONCE(), but practically speaking, constant false positives are impossible. KVM will re-enter the guest when retrying, and there is zero chance of the compiler avoiding reloads across VM-Enter+VM-Exit. I suppose in theory we might someday differentiate between "retry because a different vCPU may have fixed the fault" and "retry because there's an in-progress invalidation", and not bother re-entering the guest for the latter, e.g. have it try to yield instead. All that said, READ_ONCE() on mmu_invalidate_in_progress should effectively be a nop, so it wouldn't hurt to be paranoid in this case. Hmm, at that point, it probably makes sense to add a READ_ONCE() for mmu_invalidate_seq too, e.g. so that a sufficiently clever compiler doesn't completely optimize away the check. Losing the check wouldn't be problematic (false negatives are fine, especially on that particular check), but the generated code would *look* buggy.
在 2023/8/15 22:50, Sean Christopherson 写道: > On Tue, Aug 15, 2023, Yan Zhao wrote: >> On Mon, Aug 14, 2023 at 09:40:44AM -0700, Sean Christopherson wrote: >>>>> Note, I'm assuming secondary MMUs aren't allowed to map swap entries... >>>>> >>>>> Compile tested only. >>>> >>>> I don't find a matching end to each >>>> mmu_notifier_invalidate_range_start_nonblock(). >>> >>> It pairs with existing call to mmu_notifier_invalidate_range_end() in change_pmd_range(): >>> >>> if (range.start) >>> mmu_notifier_invalidate_range_end(&range); >> No, It doesn't work for mmu_notifier_invalidate_range_start() sent in change_pte_range(), >> if we only want the range to include pages successfully set to PROT_NONE. > > Precise invalidation was a non-goal for my hack-a-patch. The intent was purely > to defer invalidation until it was actually needed, but still perform only a > single notification so as to batch the TLB flushes, e.g. the start() call still > used the original @end. > > The idea was to play nice with the scenario where nothing in a VMA could be migrated. > It was complete untested though, so it may not have actually done anything to reduce > the number of pointless invalidations. For numa-balance scenery, can original page still be used by application even if pte is changed with PROT_NONE? If it can be used, maybe we can zap shadow mmu and flush tlb in notification mmu_notifier_invalidate_range_end with precised range, the range can be cross-range between range mmu_gather and mmu_notifier_range. Regards Bibo Mao > >>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c >>> index 9e4cd8b4a202..f29718a16211 100644 >>> --- a/arch/x86/kvm/mmu/mmu.c >>> +++ b/arch/x86/kvm/mmu/mmu.c >>> @@ -4345,6 +4345,9 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, >>> if (unlikely(!fault->slot)) >>> return kvm_handle_noslot_fault(vcpu, fault, access); >>> >>> + if (mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva)) >>> + return RET_PF_RETRY; >>> + >> This can effectively reduce the remote flush IPIs a lot! >> One Nit is that, maybe rmb() or READ_ONCE() is required for kvm->mmu_invalidate_range_start >> and kvm->mmu_invalidate_range_end. >> Otherwise, I'm somewhat worried about constant false positive and retry. > > If anything, this needs a READ_ONCE() on mmu_invalidate_in_progress. The ranges > aren't touched when when mmu_invalidate_in_progress goes to zero, so ensuring they > are reloaded wouldn't do anything. The key to making forward progress is seeing > that there is no in-progress invalidation. > > I did consider adding said READ_ONCE(), but practically speaking, constant false > positives are impossible. KVM will re-enter the guest when retrying, and there > is zero chance of the compiler avoiding reloads across VM-Enter+VM-Exit. > > I suppose in theory we might someday differentiate between "retry because a different > vCPU may have fixed the fault" and "retry because there's an in-progress invalidation", > and not bother re-entering the guest for the latter, e.g. have it try to yield > instead. > > All that said, READ_ONCE() on mmu_invalidate_in_progress should effectively be a > nop, so it wouldn't hurt to be paranoid in this case. > > Hmm, at that point, it probably makes sense to add a READ_ONCE() for mmu_invalidate_seq > too, e.g. so that a sufficiently clever compiler doesn't completely optimize away > the check. Losing the check wouldn't be problematic (false negatives are fine, > especially on that particular check), but the generated code would *look* buggy.
在 2023/8/16 10:43, bibo mao 写道: > > > 在 2023/8/15 22:50, Sean Christopherson 写道: >> On Tue, Aug 15, 2023, Yan Zhao wrote: >>> On Mon, Aug 14, 2023 at 09:40:44AM -0700, Sean Christopherson wrote: >>>>>> Note, I'm assuming secondary MMUs aren't allowed to map swap entries... >>>>>> >>>>>> Compile tested only. >>>>> >>>>> I don't find a matching end to each >>>>> mmu_notifier_invalidate_range_start_nonblock(). >>>> >>>> It pairs with existing call to mmu_notifier_invalidate_range_end() in change_pmd_range(): >>>> >>>> if (range.start) >>>> mmu_notifier_invalidate_range_end(&range); >>> No, It doesn't work for mmu_notifier_invalidate_range_start() sent in change_pte_range(), >>> if we only want the range to include pages successfully set to PROT_NONE. >> >> Precise invalidation was a non-goal for my hack-a-patch. The intent was purely >> to defer invalidation until it was actually needed, but still perform only a >> single notification so as to batch the TLB flushes, e.g. the start() call still >> used the original @end. >> >> The idea was to play nice with the scenario where nothing in a VMA could be migrated. >> It was complete untested though, so it may not have actually done anything to reduce >> the number of pointless invalidations. > For numa-balance scenery, can original page still be used by application even if pte > is changed with PROT_NONE? If it can be used, maybe we can zap shadow mmu and flush tlb Since there is kvm_mmu_notifier_change_pte notification when numa page is replaced with new page, my meaning that can original page still be used by application even if pte is changed with PROT_NONE and before replaced with new page? And for primary mmu, tlb is flushed after pte is changed with PROT_NONE and after mmu_notifier_invalidate_range_end notification for secondary mmu. Regards Bibo Mao > in notification mmu_notifier_invalidate_range_end with precised range, the range can > be cross-range between range mmu_gather and mmu_notifier_range. > > Regards > Bibo Mao >> >>>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c >>>> index 9e4cd8b4a202..f29718a16211 100644 >>>> --- a/arch/x86/kvm/mmu/mmu.c >>>> +++ b/arch/x86/kvm/mmu/mmu.c >>>> @@ -4345,6 +4345,9 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, >>>> if (unlikely(!fault->slot)) >>>> return kvm_handle_noslot_fault(vcpu, fault, access); >>>> >>>> + if (mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva)) >>>> + return RET_PF_RETRY; >>>> + >>> This can effectively reduce the remote flush IPIs a lot! >>> One Nit is that, maybe rmb() or READ_ONCE() is required for kvm->mmu_invalidate_range_start >>> and kvm->mmu_invalidate_range_end. >>> Otherwise, I'm somewhat worried about constant false positive and retry. >> >> If anything, this needs a READ_ONCE() on mmu_invalidate_in_progress. The ranges >> aren't touched when when mmu_invalidate_in_progress goes to zero, so ensuring they >> are reloaded wouldn't do anything. The key to making forward progress is seeing >> that there is no in-progress invalidation. >> >> I did consider adding said READ_ONCE(), but practically speaking, constant false >> positives are impossible. KVM will re-enter the guest when retrying, and there >> is zero chance of the compiler avoiding reloads across VM-Enter+VM-Exit. >> >> I suppose in theory we might someday differentiate between "retry because a different >> vCPU may have fixed the fault" and "retry because there's an in-progress invalidation", >> and not bother re-entering the guest for the latter, e.g. have it try to yield >> instead. >> >> All that said, READ_ONCE() on mmu_invalidate_in_progress should effectively be a >> nop, so it wouldn't hurt to be paranoid in this case. >> >> Hmm, at that point, it probably makes sense to add a READ_ONCE() for mmu_invalidate_seq >> too, e.g. so that a sufficiently clever compiler doesn't completely optimize away >> the check. Losing the check wouldn't be problematic (false negatives are fine, >> especially on that particular check), but the generated code would *look* buggy.
On Wed, Aug 16, 2023 at 11:44:29AM +0800, bibo mao wrote: > > > 在 2023/8/16 10:43, bibo mao 写道: > > > > > > 在 2023/8/15 22:50, Sean Christopherson 写道: > >> On Tue, Aug 15, 2023, Yan Zhao wrote: > >>> On Mon, Aug 14, 2023 at 09:40:44AM -0700, Sean Christopherson wrote: > >>>>>> Note, I'm assuming secondary MMUs aren't allowed to map swap entries... > >>>>>> > >>>>>> Compile tested only. > >>>>> > >>>>> I don't find a matching end to each > >>>>> mmu_notifier_invalidate_range_start_nonblock(). > >>>> > >>>> It pairs with existing call to mmu_notifier_invalidate_range_end() in change_pmd_range(): > >>>> > >>>> if (range.start) > >>>> mmu_notifier_invalidate_range_end(&range); > >>> No, It doesn't work for mmu_notifier_invalidate_range_start() sent in change_pte_range(), > >>> if we only want the range to include pages successfully set to PROT_NONE. > >> > >> Precise invalidation was a non-goal for my hack-a-patch. The intent was purely > >> to defer invalidation until it was actually needed, but still perform only a > >> single notification so as to batch the TLB flushes, e.g. the start() call still > >> used the original @end. > >> > >> The idea was to play nice with the scenario where nothing in a VMA could be migrated. > >> It was complete untested though, so it may not have actually done anything to reduce > >> the number of pointless invalidations. > > For numa-balance scenery, can original page still be used by application even if pte > > is changed with PROT_NONE? If it can be used, maybe we can zap shadow mmu and flush tlb For GUPs that does not honor FOLL_HONOR_NUMA_FAULT, yes, See https://lore.kernel.org/all/20230803143208.383663-1-david@redhat.com/ > Since there is kvm_mmu_notifier_change_pte notification when numa page is replaced with > new page, my meaning that can original page still be used by application even if pte > is changed with PROT_NONE and before replaced with new page? It's not .change_pte() notification, which is sent when COW. The do_numa_page()/do_huge_pmd_numa_page() will try to unmap old page protected with PROT_NONE, and if every check passes, a separate .invalidate_range_start()/end() with event type MMU_NOTIFY_CLEAR will be sent. So, I think KVM (though it honors FOLL_HONOR_NUMA_FAULT), can safely keep mapping maybe-dma pages until MMU_NOTIFY_CLEAR is sent. (this approach is implemented in RFC v1 https://lore.kernel.org/all/20230810085636.25914-1-yan.y.zhao@intel.com/) > > And for primary mmu, tlb is flushed after pte is changed with PROT_NONE and > after mmu_notifier_invalidate_range_end notification for secondary mmu. > Regards > Bibo Mao > >> in notification mmu_notifier_invalidate_range_end with precised range, the range can But I don't think flush tlb only in the .invalidate_range_end() in secondary MMU is a good idea. Flush must be done before kvm->mmu_lock is unlocked, otherwise, confusion will be caused when multiple threads trying to update the secondary MMU. > >> be cross-range between range mmu_gather and mmu_notifier_range.
On Wed, Aug 16, 2023 at 03:29:22PM +0800, bibo mao wrote: > > Flush must be done before kvm->mmu_lock is unlocked, otherwise, > > confusion will be caused when multiple threads trying to update the > > secondary MMU. > Since tlb flush is delayed after all pte entries are cleared, and currently > there is no tlb flush range supported for secondary mmu. I do know why there > is confusion before or after kvm->mmu_lock. Oh, do you mean only do kvm_unmap_gfn_range() in .invalidate_range_end()? Then check if PROT_NONE is set in primary MMU before unmap? Looks like a good idea, I need to check if it's feasible. Thanks!
在 2023/8/16 13:14, Yan Zhao 写道: > On Wed, Aug 16, 2023 at 11:44:29AM +0800, bibo mao wrote: >> >> >> 在 2023/8/16 10:43, bibo mao 写道: >>> >>> >>> 在 2023/8/15 22:50, Sean Christopherson 写道: >>>> On Tue, Aug 15, 2023, Yan Zhao wrote: >>>>> On Mon, Aug 14, 2023 at 09:40:44AM -0700, Sean Christopherson wrote: >>>>>>>> Note, I'm assuming secondary MMUs aren't allowed to map swap entries... >>>>>>>> >>>>>>>> Compile tested only. >>>>>>> >>>>>>> I don't find a matching end to each >>>>>>> mmu_notifier_invalidate_range_start_nonblock(). >>>>>> >>>>>> It pairs with existing call to mmu_notifier_invalidate_range_end() in change_pmd_range(): >>>>>> >>>>>> if (range.start) >>>>>> mmu_notifier_invalidate_range_end(&range); >>>>> No, It doesn't work for mmu_notifier_invalidate_range_start() sent in change_pte_range(), >>>>> if we only want the range to include pages successfully set to PROT_NONE. >>>> >>>> Precise invalidation was a non-goal for my hack-a-patch. The intent was purely >>>> to defer invalidation until it was actually needed, but still perform only a >>>> single notification so as to batch the TLB flushes, e.g. the start() call still >>>> used the original @end. >>>> >>>> The idea was to play nice with the scenario where nothing in a VMA could be migrated. >>>> It was complete untested though, so it may not have actually done anything to reduce >>>> the number of pointless invalidations. >>> For numa-balance scenery, can original page still be used by application even if pte >>> is changed with PROT_NONE? If it can be used, maybe we can zap shadow mmu and flush tlb > For GUPs that does not honor FOLL_HONOR_NUMA_FAULT, yes, > > See https://lore.kernel.org/all/20230803143208.383663-1-david@redhat.com/ > >> Since there is kvm_mmu_notifier_change_pte notification when numa page is replaced with >> new page, my meaning that can original page still be used by application even if pte >> is changed with PROT_NONE and before replaced with new page? > It's not .change_pte() notification, which is sent when COW. > The do_numa_page()/do_huge_pmd_numa_page() will try to unmap old page > protected with PROT_NONE, and if every check passes, a separate > .invalidate_range_start()/end() with event type MMU_NOTIFY_CLEAR will be > sent. yes, you are right. change_pte() notification will be will called when migrate_vma_pages, I messed it with numa page migration. However invalidate_range_start()/end() with event type MMU_NOTIFY_CLEAR will be sent also when new page is replaced. > > So, I think KVM (though it honors FOLL_HONOR_NUMA_FAULT), can safely > keep mapping maybe-dma pages until MMU_NOTIFY_CLEAR is sent. > (this approach is implemented in RFC v1 > https://lore.kernel.org/all/20230810085636.25914-1-yan.y.zhao@intel.com/) > >> >> And for primary mmu, tlb is flushed after pte is changed with PROT_NONE and >> after mmu_notifier_invalidate_range_end notification for secondary mmu. >> Regards >> Bibo Mao > >>>> in notification mmu_notifier_invalidate_range_end with precised range, the range can > But I don't think flush tlb only in the .invalidate_range_end() in > secondary MMU is a good idea. I have no good idea, and it beyond my ability to modify kvm framework now :( > Flush must be done before kvm->mmu_lock is unlocked, otherwise, > confusion will be caused when multiple threads trying to update the > secondary MMU. Since tlb flush is delayed after all pte entries are cleared, and currently there is no tlb flush range supported for secondary mmu. I do know why there is confusion before or after kvm->mmu_lock. Regards Bibo Mao > >>>> be cross-range between range mmu_gather and mmu_notifier_range. > >
在 2023/8/16 15:18, Yan Zhao 写道: > On Wed, Aug 16, 2023 at 03:29:22PM +0800, bibo mao wrote: >>> Flush must be done before kvm->mmu_lock is unlocked, otherwise, >>> confusion will be caused when multiple threads trying to update the >>> secondary MMU. >> Since tlb flush is delayed after all pte entries are cleared, and currently >> there is no tlb flush range supported for secondary mmu. I do know why there >> is confusion before or after kvm->mmu_lock. > > Oh, do you mean only do kvm_unmap_gfn_range() in .invalidate_range_end()? yes, it is just sketchy thought for numa balance scenery, do kvm_unmap_gfn_range() in invalidate_range_end rather than invalidate_range_start. > Then check if PROT_NONE is set in primary MMU before unmap? > Looks like a good idea, I need to check if it's feasible. > Thanks! > >
On Wed, Aug 16, 2023, bibo mao wrote: > > > 在 2023/8/16 15:18, Yan Zhao 写道: > > On Wed, Aug 16, 2023 at 03:29:22PM +0800, bibo mao wrote: > >>> Flush must be done before kvm->mmu_lock is unlocked, otherwise, > >>> confusion will be caused when multiple threads trying to update the > >>> secondary MMU. > >> Since tlb flush is delayed after all pte entries are cleared, and currently > >> there is no tlb flush range supported for secondary mmu. I do know why there > >> is confusion before or after kvm->mmu_lock. > > > > Oh, do you mean only do kvm_unmap_gfn_range() in .invalidate_range_end()? > yes, it is just sketchy thought for numa balance scenery, > do kvm_unmap_gfn_range() in invalidate_range_end rather than > invalidate_range_start. That is not an option, it's a direction violation of the mmu_notifier contract. Secondary MMUs must drop all references before returning from invalidate_range_start().
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index dfbaafbe3a00..907444a1761b 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -711,6 +711,20 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn, kvm_handle_hva_range(mn, address, address + 1, pte, kvm_change_spte_gfn); } +static void kvm_mmu_notifier_numa_protect(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, + unsigned long end) +{ + struct kvm *kvm = mmu_notifier_to_kvm(mn); + + WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count)); + if (!READ_ONCE(kvm->mmu_invalidate_in_progress)) + return; + + kvm_handle_hva_range(mn, start, end, __pte(0), kvm_unmap_gfn_range); +} + void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start, unsigned long end) { @@ -744,14 +758,18 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, const struct mmu_notifier_range *range) { struct kvm *kvm = mmu_notifier_to_kvm(mn); + bool is_numa = (range->event == MMU_NOTIFY_PROTECTION_VMA) && + (range->flags & MMU_NOTIFIER_RANGE_NUMA); const struct kvm_hva_range hva_range = { .start = range->start, .end = range->end, .pte = __pte(0), - .handler = kvm_unmap_gfn_range, + .handler = !is_numa ? kvm_unmap_gfn_range : + (void *)kvm_null_fn, .on_lock = kvm_mmu_invalidate_begin, - .on_unlock = kvm_arch_guest_memory_reclaimed, - .flush_on_ret = true, + .on_unlock = !is_numa ? kvm_arch_guest_memory_reclaimed : + (void *)kvm_null_fn, + .flush_on_ret = !is_numa ? true : false, .may_block = mmu_notifier_range_blockable(range), }; @@ -899,6 +917,7 @@ static const struct mmu_notifier_ops kvm_mmu_notifier_ops = { .clear_young = kvm_mmu_notifier_clear_young, .test_young = kvm_mmu_notifier_test_young, .change_pte = kvm_mmu_notifier_change_pte, + .numa_protect = kvm_mmu_notifier_numa_protect, .release = kvm_mmu_notifier_release, };
Register to .numa_protect() callback in mmu notifier so that KVM can get acurate information about when a page is PROT_NONE protected in primary MMU and unmap it in secondary MMU accordingly. In KVM's .invalidate_range_start() handler, if the event is to notify that the range may be protected to PROT_NONE for NUMA migration purpose, don't do the unmapping in secondary MMU. Hold on until.numa_protect() comes. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> --- virt/kvm/kvm_main.c | 25 ++++++++++++++++++++++--- 1 file changed, 22 insertions(+), 3 deletions(-)