Message ID | 20220801151928.270380-1-vipinsh@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: x86/mmu: Make page tables for eager page splitting NUMA aware | expand |
On Mon, Aug 01, 2022 at 08:19:28AM -0700, Vipin Sharma wrote: > In the shortlog, clarify that this only applies to the TDP MMU. > tdp_mmu_alloc_sp_for_split() allocates page tables for Eager Page > Splitting. Currently it does not specify a NUMA node preference, so it > will try to allocate from the local node. The thread doing eager page nit: Instead of "try to allocate from the local node", say something like "allocate using the current threads mempolicy, which is most likely going to be MPOL_LOCAL, i.e. allocate from the local node." > splitting is supplied by the userspace and may not be running on the same > node where it would be best for page tables to be allocated. Suggest comparing eager page splitting to vCPU faults, which is the other place that TDP page tables are allocated. e.g. Note that TDP page tables allocated during fault handling are less likely to suffer from the same NUMA locality issue because they will be allocated on the same node as the vCPU thread (assuming its mempolicy is MPOL_LOCAL), which is often the right node. That being said, KVM currently has a gap where a guest doing a lot of remote memory accesses when touching memory for the first time will cause KVM to allocate the TDP page tables on the arguably wrong node. > > We can improve TDP MMU eager page splitting by making > tdp_mmu_alloc_sp_for_split() NUMA-aware. Specifically, when splitting a > huge page, allocate the new lower level page tables on the same node as the > huge page. > > __get_free_page() is replaced by alloc_page_nodes(). This introduces two > functional changes. > > 1. __get_free_page() removes gfp flag __GFP_HIGHMEM via call to > __get_free_pages(). This should not be an issue as __GFP_HIGHMEM flag is > not passed in tdp_mmu_alloc_sp_for_split() anyway. > > 2. __get_free_page() calls alloc_pages() and use thread's mempolicy for > the NUMA node allocation. From this commit, thread's mempolicy will not > be used and first preference will be to allocate on the node where huge > page was present. It would be worth noting that userspace could change the mempolicy of the thread doing eager splitting to prefer allocating from the target NUMA node, as an alternative approach. I don't prefer the alternative though since it bleeds details from KVM into userspace, such as the fact that enabling dirty logging does eager page splitting, which allocates page tables. It's also unnecessary since KVM can infer an appropriate NUMA placement without the help of userspace, and I can't think of a reason for userspace to prefer a different policy. > > dirty_log_perf_test for 416 vcpu and 1GB/vcpu configuration on a 8 NUMA > node machine showed dirty memory time improvements between 2% - 35% in > multiple runs. That's a pretty wide range. What's probably happening is vCPU threads are being migrated across NUMA nodes during the test. So for example, a vCPU thread might first run on Node 0 during the Populate memory phase, so the huge page will be allocated on Node 0 as well. But then if the thread gets migrated to Node 1 later, it will perform poorly because it's memory is on a remote node. I would recommend pinning vCPUs to specific NUMA nodes to prevent cross-node migrations. e.g. For a 416 vCPU VM, pin 52 to Node 0, 52 to Node 1, etc. That should results in more consistent performance and will be more inline with how large NUMA VMs are actually configured to run. > > Suggested-by: David Matlack <dmatlack@google.com> > Signed-off-by: Vipin Sharma <vipinsh@google.com> > --- > arch/x86/kvm/mmu/tdp_mmu.c | 24 +++++++++++++++++++----- > 1 file changed, 19 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index bf2ccf9debcaa..1e30e18fc6a03 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -1402,9 +1402,19 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm, > return spte_set; > } > > -static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp) > +/* > + * Caller's responsibility to pass a valid spte which has the shadow page > + * present. > + */ > +static int tdp_mmu_spte_to_nid(u64 spte) I think this name is ambiguous because it could be getting the node ID of the SPTE itself, or the node ID of the page the SPTE is pointing to. Maybe tdp_mmu_spte_pfn_nid()? Or just drop the helper an open code this calculation in tdp_mmu_alloc_sp_for_split(). > +{ > + return page_to_nid(pfn_to_page(spte_to_pfn(spte))); > +} > + > +static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(int nid, gfp_t gfp) > { > struct kvm_mmu_page *sp; > + struct page *spt_page; > > gfp |= __GFP_ZERO; > > @@ -1412,11 +1422,12 @@ static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp) > if (!sp) > return NULL; > > - sp->spt = (void *)__get_free_page(gfp); > - if (!sp->spt) { > + spt_page = alloc_pages_node(nid, gfp, 0); > + if (!spt_page) { > kmem_cache_free(mmu_page_header_cache, sp); > return NULL; > } > + sp->spt = page_address(spt_page); > > return sp; > } > @@ -1426,6 +1437,9 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm, > bool shared) > { > struct kvm_mmu_page *sp; > + int nid; > + > + nid = tdp_mmu_spte_to_nid(iter->old_spte); > > /* > * Since we are allocating while under the MMU lock we have to be > @@ -1436,7 +1450,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm, > * If this allocation fails we drop the lock and retry with reclaim > * allowed. > */ > - sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT); > + sp = __tdp_mmu_alloc_sp_for_split(nid, GFP_NOWAIT | __GFP_ACCOUNT); > if (sp) > return sp; > > @@ -1448,7 +1462,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm, > write_unlock(&kvm->mmu_lock); > > iter->yielded = true; > - sp = __tdp_mmu_alloc_sp_for_split(GFP_KERNEL_ACCOUNT); > + sp = __tdp_mmu_alloc_sp_for_split(nid, GFP_KERNEL_ACCOUNT); > > if (shared) > read_lock(&kvm->mmu_lock); > -- > 2.37.1.455.g008518b4e5-goog >
On Mon, Aug 01, 2022, David Matlack wrote: > On Mon, Aug 01, 2022 at 08:19:28AM -0700, Vipin Sharma wrote: > That being said, KVM currently has a gap where a guest doing a lot of > remote memory accesses when touching memory for the first time will > cause KVM to allocate the TDP page tables on the arguably wrong node. Userspace can solve this by setting the NUMA policy on a VMA or shared-object basis. E.g. create dedicated memslots for each NUMA node, then bind each of the backing stores to the appropriate host node. If there is a gap, e.g. a backing store we want to use doesn't properly support mempolicy for shared mappings, then we should enhance the backing store. > > We can improve TDP MMU eager page splitting by making > > tdp_mmu_alloc_sp_for_split() NUMA-aware. Specifically, when splitting a > > huge page, allocate the new lower level page tables on the same node as the > > huge page. > > > > __get_free_page() is replaced by alloc_page_nodes(). This introduces two > > functional changes. > > > > 1. __get_free_page() removes gfp flag __GFP_HIGHMEM via call to > > __get_free_pages(). This should not be an issue as __GFP_HIGHMEM flag is > > not passed in tdp_mmu_alloc_sp_for_split() anyway. > > > > 2. __get_free_page() calls alloc_pages() and use thread's mempolicy for > > the NUMA node allocation. From this commit, thread's mempolicy will not > > be used and first preference will be to allocate on the node where huge > > page was present. > > It would be worth noting that userspace could change the mempolicy of > the thread doing eager splitting to prefer allocating from the target > NUMA node, as an alternative approach. > > I don't prefer the alternative though since it bleeds details from KVM > into userspace, such as the fact that enabling dirty logging does eager > page splitting, which allocates page tables. As above, if userspace is cares about vNUMA, then it already needs to be aware of some of KVM/kernel details. Separate memslots aren't strictly necessary, e.g. userspace could stitch together contiguous VMAs to create a single mega-memslot, but that seems like it'd be more work than just creating separate memslots. And because eager page splitting for dirty logging runs with mmu_lock held for, userspace might also benefit from per-node memslots as it can do the splitting on multiple tasks/CPUs. Regardless of what we do, the behavior needs to be document, i.e. KVM details will bleed into userspace. E.g. if KVM is overriding the per-task NUMA policy, then that should be documented. > It's also unnecessary since KVM can infer an appropriate NUMA placement > without the help of userspace, and I can't think of a reason for userspace to > prefer a different policy. I can't think of a reason why userspace would want to have a different policy for the task that's enabling dirty logging, but I also can't think of a reason why KVM should go out of its way to ignore that policy. IMO this is a "bug" in dirty_log_perf_test, though it's probably a good idea to document how to effectively configure vNUMA-aware memslots.
On Mon, Aug 01, 2022 at 11:56:21PM +0000, Sean Christopherson wrote: > On Mon, Aug 01, 2022, David Matlack wrote: > > On Mon, Aug 01, 2022 at 08:19:28AM -0700, Vipin Sharma wrote: > > That being said, KVM currently has a gap where a guest doing a lot of > > remote memory accesses when touching memory for the first time will > > cause KVM to allocate the TDP page tables on the arguably wrong node. > > Userspace can solve this by setting the NUMA policy on a VMA or shared-object > basis. E.g. create dedicated memslots for each NUMA node, then bind each of the > backing stores to the appropriate host node. > > If there is a gap, e.g. a backing store we want to use doesn't properly support > mempolicy for shared mappings, then we should enhance the backing store. Just to clarify: this patch, and my comment here, is about the NUMA locality of KVM's page tables, not guest memory. KVM allocates all page tables through __get_free_page() which uses the current task's mempolicy. This means the only way for userspace to control the page table NUMA locality is to set mempolicies on the threads on which KVM will allocate page tables (vCPUs and any thread doing eager page splitting, i.e. enable dirty logging or KVM_CLEAR_DIRTY_LOG). The ideal setup from a NUMA locality perspective would be that page tables are always on the local node, but that would require KVM maintain multiple copies of page tables (at minimum one per physical NUMA node), which is extra memory and complexity. With the current KVM MMU (one page table hierarchy shared by all vCPUs), the next best setup, I'd argue, is to co-locate the page tables with the memory they are mapping. This setup would ensure that a vCPU accessing memory in its local virtual node would primarily be accessing page tables in the local physical node when doing page walks. Obviously page tables at levels 5, 4, and 3, which likely map memory spanning multiple nodes, could not always be co-located with all the memory they map. My comment here is saying that there is no way for userspace to actually enforce the above policy. There is no mempolicy userspace can set on vCPU threads to ensure that KVM co-locates page tables with the memory they are mapping. All userspace can do is force KVM to allocate page tables on the same node as the vCPU thread, or on a specific node. For eager page splitting, userspace can control what range of memory is going to be split by controlling the memslot layout, so it is possible for userspace to set an appropriate mempolicy on that thread. But if you agree about my point about vCPU threads above, I think it would be a step in the right direction to have KVM start forcibly co-locating page tables with memory for eager page splitting (and we can fix the vCPU case later). > > > > We can improve TDP MMU eager page splitting by making > > > tdp_mmu_alloc_sp_for_split() NUMA-aware. Specifically, when splitting a > > > huge page, allocate the new lower level page tables on the same node as the > > > huge page. > > > > > > __get_free_page() is replaced by alloc_page_nodes(). This introduces two > > > functional changes. > > > > > > 1. __get_free_page() removes gfp flag __GFP_HIGHMEM via call to > > > __get_free_pages(). This should not be an issue as __GFP_HIGHMEM flag is > > > not passed in tdp_mmu_alloc_sp_for_split() anyway. > > > > > > 2. __get_free_page() calls alloc_pages() and use thread's mempolicy for > > > the NUMA node allocation. From this commit, thread's mempolicy will not > > > be used and first preference will be to allocate on the node where huge > > > page was present. > > > > It would be worth noting that userspace could change the mempolicy of > > the thread doing eager splitting to prefer allocating from the target > > NUMA node, as an alternative approach. > > > > I don't prefer the alternative though since it bleeds details from KVM > > into userspace, such as the fact that enabling dirty logging does eager > > page splitting, which allocates page tables. > > As above, if userspace is cares about vNUMA, then it already needs to be aware of > some of KVM/kernel details. Separate memslots aren't strictly necessary, e.g. > userspace could stitch together contiguous VMAs to create a single mega-memslot, > but that seems like it'd be more work than just creating separate memslots. > > And because eager page splitting for dirty logging runs with mmu_lock held for, > userspace might also benefit from per-node memslots as it can do the splitting on > multiple tasks/CPUs. > > Regardless of what we do, the behavior needs to be document, i.e. KVM details will > bleed into userspace. E.g. if KVM is overriding the per-task NUMA policy, then > that should be documented. +1 > > > It's also unnecessary since KVM can infer an appropriate NUMA placement > > without the help of userspace, and I can't think of a reason for userspace to > > prefer a different policy. > > I can't think of a reason why userspace would want to have a different policy for > the task that's enabling dirty logging, but I also can't think of a reason why > KVM should go out of its way to ignore that policy. > > IMO this is a "bug" in dirty_log_perf_test, though it's probably a good idea to > document how to effectively configure vNUMA-aware memslots.
On Tue, Aug 02, 2022, David Matlack wrote: > On Mon, Aug 01, 2022 at 11:56:21PM +0000, Sean Christopherson wrote: > > On Mon, Aug 01, 2022, David Matlack wrote: > > > On Mon, Aug 01, 2022 at 08:19:28AM -0700, Vipin Sharma wrote: > > > That being said, KVM currently has a gap where a guest doing a lot of > > > remote memory accesses when touching memory for the first time will > > > cause KVM to allocate the TDP page tables on the arguably wrong node. > > > > Userspace can solve this by setting the NUMA policy on a VMA or shared-object > > basis. E.g. create dedicated memslots for each NUMA node, then bind each of the > > backing stores to the appropriate host node. > > > > If there is a gap, e.g. a backing store we want to use doesn't properly support > > mempolicy for shared mappings, then we should enhance the backing store. > > Just to clarify: this patch, and my comment here, is about the NUMA > locality of KVM's page tables, not guest memory. Oooh, I overlooked the "TDP page tables" part in the above paragraph. > KVM allocates all page tables through __get_free_page() which uses the > current task's mempolicy. This means the only way for userspace to > control the page table NUMA locality is to set mempolicies on the > threads on which KVM will allocate page tables (vCPUs and any thread > doing eager page splitting, i.e. enable dirty logging or > KVM_CLEAR_DIRTY_LOG). > > The ideal setup from a NUMA locality perspective would be that page > tables are always on the local node, but that would require KVM maintain > multiple copies of page tables (at minimum one per physical NUMA node), > which is extra memory and complexity. Hmm, it actually wouldn't be much complexity, e.g. a few bits in the role, but the memory would indeed be a problem. > With the current KVM MMU (one page table hierarchy shared by all vCPUs), > the next best setup, I'd argue, is to co-locate the page tables with the > memory they are mapping. This setup would ensure that a vCPU accessing > memory in its local virtual node would primarily be accessing page > tables in the local physical node when doing page walks. Obviously page > tables at levels 5, 4, and 3, which likely map memory spanning multiple > nodes, could not always be co-located with all the memory they map. > > My comment here is saying that there is no way for userspace to actually > enforce the above policy. There is no mempolicy userspace can set on > vCPU threads to ensure that KVM co-locates page tables with the memory > they are mapping. All userspace can do is force KVM to allocate page > tables on the same node as the vCPU thread, or on a specific node. > > For eager page splitting, userspace can control what range of memory is > going to be split by controlling the memslot layout, so it is possible > for userspace to set an appropriate mempolicy on that thread. But if you > agree about my point about vCPU threads above, I think it would be a > step in the right direction to have KVM start forcibly co-locating page > tables with memory for eager page splitting (and we can fix the vCPU > case later). I agree that there's a gap with respect to a vCPU being the first to touch a remote node, but I disagree that forcibly co-locating page tables for eager page splitting is necessary. Userspace can already force the ideal setup for eager page splitting by configuring vNUMA-aware memslots and using a task with appropriate policy to toggle dirty logging. And userspace really should be encouraged to do that, because otherwise walking the page tables in software to do the split is going to be constantly accessing remote memory. And if anyone ever wants to fix the aforementioned gap, the two fixes that come to mind would be to tie the policy to the memslot, or to do page_to_nid() on the underlying page (with the assumption that memory that's not backed by struct page isn't NUMA-aware, so fall back to task policy). At that point, having one-off code to pull the node from the existing page tables is also unnecessary.
On 8/2/22 19:22, Sean Christopherson wrote: > Userspace can already force the ideal setup for eager page splitting by configuring > vNUMA-aware memslots and using a task with appropriate policy to toggle dirty > logging. And userspace really should be encouraged to do that, because otherwise > walking the page tables in software to do the split is going to be constantly > accessing remote memory. Yes, it's possible to locate the page tables on the node that holds the memory they're mapping by enable dirty logging from different tasks for different memslots, but that seems a bit weird. Walking the page tables in software is going to do several remote memory accesses, but it will do that in a thread that probably is devoted to background tasks anyway. The relative impact of remote memory accesses in the thread that enables dirty logging vs. in the vCPU thread should also be visible in the overall performance of dirty_log_perf_test. So I agree with Vipin's patch and would even extend it to all page table allocations, however dirty_log_perf_test should be run with fixed CPU mappings to measure accurately the impact of the remote memory accesses. Paolo
On Tue, Aug 02, 2022, Paolo Bonzini wrote: > On 8/2/22 19:22, Sean Christopherson wrote: > > Userspace can already force the ideal setup for eager page splitting by configuring > > vNUMA-aware memslots and using a task with appropriate policy to toggle dirty > > logging. And userspace really should be encouraged to do that, because otherwise > > walking the page tables in software to do the split is going to be constantly > > accessing remote memory. > > Yes, it's possible to locate the page tables on the node that holds the > memory they're mapping by enable dirty logging from different tasks for > different memslots, but that seems a bit weird. Gah, I misread this patch. More below. But FWIW, my understanding is that Google's userspace already creates per-node memslots with dedicated tasks for dirty logging operations. There are multiple advantages to such a setup, e.g. slots can be processed in parallel (with the TDP MMU), and when the VM enters blackout the tasks can be affined to the physical CPUs that were just vacated by the VM. > Walking the page tables in software is going to do several remote memory > accesses, but it will do that in a thread that probably is devoted to > background tasks anyway. I agree that enabling dirty logging isn't exactly performance critical, but reaping the dirty log is very much in the critical path. My point is that userspace that cares about NUMA should be encouraged to create an optimal setup, at which point this one-off override is unnecessary. > The relative impact of remote memory accesses in the thread that enables > dirty logging vs. in the vCPU thread should also be visible in the overall > performance of dirty_log_perf_test. > > So I agree with Vipin's patch and would even extend it to all page table > allocations, Ah crud, I misread the patch. I was thinking tdp_mmu_spte_to_nid() was getting the node for the existing shadow page, but it's actually getting the node for the underlying physical huge page. That means this patch is broken now that KVM doesn't require a "struct page" to create huge SPTEs (commit a8ac499bb6ab ("KVM: x86/mmu: Don't require refcounted "struct page" to create huge SPTEs"). I.e. this will explode as pfn_to_page() may return garbage. return page_to_nid(pfn_to_page(spte_to_pfn(spte))); That said, I still don't like this patch, at least not as a one-off thing. I do like the idea of having page table allocations follow the node requested for the target pfn, what I don't like is having one-off behavior that silently overrides userspace policy. I would much rather we solve the problem for all page table allocations, maybe with a KVM_CAP or module param? Unfortunately, that's a very non-trivial change because it will probably require having a per-node cache in each of the per-vCPU caches. Hmm, actually, a not-awful alternative would be to have the fault path fallback to the current task's policy if allocation fails. I.e. make it best effort. E.g. as a rough sketch... --- arch/x86/kvm/mmu/tdp_mmu.c | 37 ++++++++++++++++++++++++++++++------- 1 file changed, 30 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index bf2ccf9debca..e475f5b55794 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -273,10 +273,11 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm, static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu) { + struct kvm_mmu_memory_cache *cache = &vcpu->arch.mmu_shadow_page_cache; struct kvm_mmu_page *sp; sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache); - sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache); + sp->spt = kvm_mmu_alloc_shadow_page_table(cache, GFP_NOWAIT, pfn); return sp; } @@ -1190,7 +1191,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) if (is_removed_spte(iter.old_spte)) break; - sp = tdp_mmu_alloc_sp(vcpu); + sp = tdp_mmu_alloc_sp(vcpu, fault->pfn); tdp_mmu_init_child_sp(sp, &iter); if (tdp_mmu_link_sp(vcpu->kvm, &iter, sp, account_nx, true)) { @@ -1402,17 +1403,39 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm, return spte_set; } -static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp) +void *kvm_mmu_alloc_shadow_page_table(struct kvm_mmu_memory_cache *cache, + gfp_t gfp, kvm_pfn_t pfn) +{ + struct page *page = kvm_pfn_to_refcounted_page(pfn); + struct page *spt_page; + int node; + + gfp |= __GFP_ZERO | __GFP_ACCOUNT; + + if (page) { + spt_page = alloc_pages_node(page_to_nid(page), gfp, 0); + if (spt_page) + return page_address(spt_page); + } else if (!cache) { + return (void *)__get_free_page(gfp); + } + + if (cache) + return kvm_mmu_memory_cache_alloc(cache); + + return NULL; +} + +static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp, + kvm_pfn_t pfn) { struct kvm_mmu_page *sp; - gfp |= __GFP_ZERO; - sp = kmem_cache_alloc(mmu_page_header_cache, gfp); if (!sp) return NULL; - sp->spt = (void *)__get_free_page(gfp); + sp->spt = kvm_mmu_alloc_shadow_page_table(NULL, gfp, pfn); if (!sp->spt) { kmem_cache_free(mmu_page_header_cache, sp); return NULL; @@ -1436,7 +1459,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm, * If this allocation fails we drop the lock and retry with reclaim * allowed. */ - sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT); + sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT); if (sp) return sp; base-commit: f8990bfe1eab91c807ca8fc0d48705e8f986b951 --
On 8/2/22 21:12, Sean Christopherson wrote: > Ah crud, I misread the patch. I was thinking tdp_mmu_spte_to_nid() was getting > the node for the existing shadow page, but it's actually getting the node for the > underlying physical huge page. > > That means this patch is broken now that KVM doesn't require a "struct page" to > create huge SPTEs (commit a8ac499bb6ab ("KVM: x86/mmu: Don't require refcounted > "struct page" to create huge SPTEs"). I.e. this will explode as pfn_to_page() > may return garbage. > > return page_to_nid(pfn_to_page(spte_to_pfn(spte))); I was about to say that yesterday. However my knowledge of struct page things has been proved to be spotty enough, that I wasn't 100% sure of that. But yeah, with a fresh mind it's quite obvious that anything that goes through hva_to_pfn_remap and similar paths will fail. > That said, I still don't like this patch, at least not as a one-off thing. I do > like the idea of having page table allocations follow the node requested for the > target pfn, what I don't like is having one-off behavior that silently overrides > userspace policy. Yes, I totally agree with making it all or nothing. > I would much rather we solve the problem for all page table allocations, maybe > with a KVM_CAP or module param? Unfortunately, that's a very non-trivial change > because it will probably require having a per-node cache in each of the per-vCPU > caches. A module parameter is fine. If you care about performance to this level, your userspace is probably homogeneous enough. Paolo > Hmm, actually, a not-awful alternative would be to have the fault path fallback > to the current task's policy if allocation fails. I.e. make it best effort. > > E.g. as a rough sketch... > > --- > arch/x86/kvm/mmu/tdp_mmu.c | 37 ++++++++++++++++++++++++++++++------- > 1 file changed, 30 insertions(+), 7 deletions(-) > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index bf2ccf9debca..e475f5b55794 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -273,10 +273,11 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm, > > static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu) > { > + struct kvm_mmu_memory_cache *cache = &vcpu->arch.mmu_shadow_page_cache; > struct kvm_mmu_page *sp; > > sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache); > - sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache); > + sp->spt = kvm_mmu_alloc_shadow_page_table(cache, GFP_NOWAIT, pfn); > > return sp; > } > @@ -1190,7 +1191,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > if (is_removed_spte(iter.old_spte)) > break; > > - sp = tdp_mmu_alloc_sp(vcpu); > + sp = tdp_mmu_alloc_sp(vcpu, fault->pfn); > tdp_mmu_init_child_sp(sp, &iter); > > if (tdp_mmu_link_sp(vcpu->kvm, &iter, sp, account_nx, true)) { > @@ -1402,17 +1403,39 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm, > return spte_set; > } > > -static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp) > +void *kvm_mmu_alloc_shadow_page_table(struct kvm_mmu_memory_cache *cache, > + gfp_t gfp, kvm_pfn_t pfn) > +{ > + struct page *page = kvm_pfn_to_refcounted_page(pfn); > + struct page *spt_page; > + int node; > + > + gfp |= __GFP_ZERO | __GFP_ACCOUNT; > + > + if (page) { > + spt_page = alloc_pages_node(page_to_nid(page), gfp, 0); > + if (spt_page) > + return page_address(spt_page); > + } else if (!cache) { > + return (void *)__get_free_page(gfp); > + } > + > + if (cache) > + return kvm_mmu_memory_cache_alloc(cache); > + > + return NULL; > +} > + > +static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp, > + kvm_pfn_t pfn) > { > struct kvm_mmu_page *sp; > > - gfp |= __GFP_ZERO; > - > sp = kmem_cache_alloc(mmu_page_header_cache, gfp); > if (!sp) > return NULL; > > - sp->spt = (void *)__get_free_page(gfp); > + sp->spt = kvm_mmu_alloc_shadow_page_table(NULL, gfp, pfn); > if (!sp->spt) { > kmem_cache_free(mmu_page_header_cache, sp); > return NULL; > @@ -1436,7 +1459,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm, > * If this allocation fails we drop the lock and retry with reclaim > * allowed. > */ > - sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT); > + sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT); > if (sp) > return sp; > > > base-commit: f8990bfe1eab91c807ca8fc0d48705e8f986b951 > -- >
On Wed, Aug 3, 2022 at 8:08 AM Paolo Bonzini <pbonzini@redhat.com> wrote: > > On 8/2/22 21:12, Sean Christopherson wrote: > > Ah crud, I misread the patch. I was thinking tdp_mmu_spte_to_nid() was getting > > the node for the existing shadow page, but it's actually getting the node for the > > underlying physical huge page. > > > > That means this patch is broken now that KVM doesn't require a "struct page" to > > create huge SPTEs (commit a8ac499bb6ab ("KVM: x86/mmu: Don't require refcounted > > "struct page" to create huge SPTEs"). I.e. this will explode as pfn_to_page() > > may return garbage. > > > > return page_to_nid(pfn_to_page(spte_to_pfn(spte))); > > I was about to say that yesterday. However my knowledge of struct page > things has been proved to be spotty enough, that I wasn't 100% sure of > that. But yeah, with a fresh mind it's quite obvious that anything that > goes through hva_to_pfn_remap and similar paths will fail. > > > That said, I still don't like this patch, at least not as a one-off thing. I do > > like the idea of having page table allocations follow the node requested for the > > target pfn, what I don't like is having one-off behavior that silently overrides > > userspace policy. > > Yes, I totally agree with making it all or nothing. > > > I would much rather we solve the problem for all page table allocations, maybe > > with a KVM_CAP or module param? Unfortunately, that's a very non-trivial change > > because it will probably require having a per-node cache in each of the per-vCPU > > caches. > > A module parameter is fine. If you care about performance to this > level, your userspace is probably homogeneous enough. > Thank you all for the feedback and suggestions. Regarding dirty_log_perf_test, I will send out a patch to add an option which will tag vcpus to physical cpus using sched_setaffinity() calls. This should increase accuracy for the test. It seems like we are agreeing on two things: 1. Keep the same behavior for the page table pages allocation for both during the page fault and during eager page splitting. 2. Page table pages should be allocated on the same node where the underlying guest memory page is residing. Here are the two approaches, please provide feedback on which one looks more appropriate before I start spamming your inbox with my patches Approach A: Have per numa node cache for page table pages allocation Instead of having only one mmu_shadow_page_cache per vcpu, we provide multiple caches for each node either: mmu_shadow_page_cache[MAX_NUMNODES] or mmu_shadow_page_cache->objects[MAX_NUMNODES * KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE] We can decrease KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE to some lower value instead of 40 to control memory consumption. When a fault happens, use the pfn to find which node the page should belong to and use the corresponding cache to get page table pages. struct *page = kvm_pfn_to_refcounted_page(pfn); int nid; if(page) { nid = page_to_nid(page); } else { nid = numa_node_id(); } ... tdp_mmu_alloc_sp(nid, vcpu); ... static struct kvm_mmu_page *tdp_mmu_alloc_sp(int nid, struct kvm_vcpu *vcpu) { ... sp->spt = kvm_mmu_memory_cache_alloc(nid, &vcpu->arch.mmu_shadow_page_cache); ... } Since we are changing cache allocation for page table pages, should we also make similar changes for other caches like mmu_page_header_cache, mmu_gfn_array_cache, and mmu_pte_list_desc_cache? I am not sure how good this idea is. Approach B: Ask page from the specific node on fault path with option to fallback to the original cache and default task policy. This is what Sean's rough patch looks like.
On Fri, Aug 5, 2022 at 4:30 PM Vipin Sharma <vipinsh@google.com> wrote: [...] > > Here are the two approaches, please provide feedback on which one > looks more appropriate before I start spamming your inbox with my > patches > > Approach A: > Have per numa node cache for page table pages allocation > > Instead of having only one mmu_shadow_page_cache per vcpu, we provide > multiple caches for each node > > either: > mmu_shadow_page_cache[MAX_NUMNODES] > or > mmu_shadow_page_cache->objects[MAX_NUMNODES * KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE] I think the former approach will work better. The objects[] array is allocated dynamically during top-up now, so if a vCPU never allocates a page table to map memory on a given node, KVM will never have to allocate an objects[] array for that node. Whereas with the latter approach KVM would have to allocate the entire objects[] array up-front. > > We can decrease KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE to some lower value > instead of 40 to control memory consumption. I'm not sure we are getting any performance benefit from the cache size being so high. It doesn't fundamentally change the number of times a vCPU thread will have to call __get_free_page(), it just batches more of those calls together. Assuming reducing the cache size doesn't impact performance, I think it's a good idea to reduce it as part of this feature. KVM needs at most PT64_ROOT_MAX_LEVEL (5) page tables to handle a fault. So we could decrease the mmu_shadow_page_cache.objects[] capacity to PT64_ROOT_MAX_LEVEL (5) and support up to 8 NUMA nodes without increasing memory usage. If a user wants to run a VM on an even larger machine, I think it's safe to consume a few extra bytes for the vCPU shadow page caches at that point (the machine probably has 10s of TiB of RAM). > > When a fault happens, use the pfn to find which node the page should > belong to and use the corresponding cache to get page table pages. > > struct *page = kvm_pfn_to_refcounted_page(pfn); > int nid; > if(page) { > nid = page_to_nid(page); > } else { > nid = numa_node_id(); > } > > ... > tdp_mmu_alloc_sp(nid, vcpu); > ... > > static struct kvm_mmu_page *tdp_mmu_alloc_sp(int nid, struct kvm_vcpu *vcpu) { > ... > sp->spt = kvm_mmu_memory_cache_alloc(nid, > &vcpu->arch.mmu_shadow_page_cache); > ... > } > > > Since we are changing cache allocation for page table pages, should we > also make similar changes for other caches like mmu_page_header_cache, > mmu_gfn_array_cache, and mmu_pte_list_desc_cache? I am not sure how > good this idea is. We don't currently have a reason to make these objects NUMA-aware, so I would only recommend it if it somehow makes the code a lot simpler. > > Approach B: > Ask page from the specific node on fault path with option to fallback > to the original cache and default task policy. > > This is what Sean's rough patch looks like. This would definitely be a simpler approach but could increase the amount of time a vCPU thread holds the MMU lock when handling a fault, since KVM would start performing GFP_NOWAIT allocations under the lock. So my preference would be to try the cache approach first and see how complex it turns out to be.
On Tue, Aug 09, 2022, David Matlack wrote: > On Fri, Aug 5, 2022 at 4:30 PM Vipin Sharma <vipinsh@google.com> wrote: > > Approach B: > > Ask page from the specific node on fault path with option to fallback > > to the original cache and default task policy. > > > > This is what Sean's rough patch looks like. > > This would definitely be a simpler approach but could increase the > amount of time a vCPU thread holds the MMU lock when handling a fault, > since KVM would start performing GFP_NOWAIT allocations under the > lock. So my preference would be to try the cache approach first and > see how complex it turns out to be. Ya, as discussed off-list, I don't like my idea either :-) The pfn and thus node information is available before mmu_lock is acquired, so I don't see any reason to defer the allocation other than to reduce the memory footprint, and that's a solvable problem one way or another.
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index bf2ccf9debcaa..1e30e18fc6a03 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -1402,9 +1402,19 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm, return spte_set; } -static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp) +/* + * Caller's responsibility to pass a valid spte which has the shadow page + * present. + */ +static int tdp_mmu_spte_to_nid(u64 spte) +{ + return page_to_nid(pfn_to_page(spte_to_pfn(spte))); +} + +static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(int nid, gfp_t gfp) { struct kvm_mmu_page *sp; + struct page *spt_page; gfp |= __GFP_ZERO; @@ -1412,11 +1422,12 @@ static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp) if (!sp) return NULL; - sp->spt = (void *)__get_free_page(gfp); - if (!sp->spt) { + spt_page = alloc_pages_node(nid, gfp, 0); + if (!spt_page) { kmem_cache_free(mmu_page_header_cache, sp); return NULL; } + sp->spt = page_address(spt_page); return sp; } @@ -1426,6 +1437,9 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm, bool shared) { struct kvm_mmu_page *sp; + int nid; + + nid = tdp_mmu_spte_to_nid(iter->old_spte); /* * Since we are allocating while under the MMU lock we have to be @@ -1436,7 +1450,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm, * If this allocation fails we drop the lock and retry with reclaim * allowed. */ - sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT); + sp = __tdp_mmu_alloc_sp_for_split(nid, GFP_NOWAIT | __GFP_ACCOUNT); if (sp) return sp; @@ -1448,7 +1462,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm, write_unlock(&kvm->mmu_lock); iter->yielded = true; - sp = __tdp_mmu_alloc_sp_for_split(GFP_KERNEL_ACCOUNT); + sp = __tdp_mmu_alloc_sp_for_split(nid, GFP_KERNEL_ACCOUNT); if (shared) read_lock(&kvm->mmu_lock);
tdp_mmu_alloc_sp_for_split() allocates page tables for Eager Page Splitting. Currently it does not specify a NUMA node preference, so it will try to allocate from the local node. The thread doing eager page splitting is supplied by the userspace and may not be running on the same node where it would be best for page tables to be allocated. We can improve TDP MMU eager page splitting by making tdp_mmu_alloc_sp_for_split() NUMA-aware. Specifically, when splitting a huge page, allocate the new lower level page tables on the same node as the huge page. __get_free_page() is replaced by alloc_page_nodes(). This introduces two functional changes. 1. __get_free_page() removes gfp flag __GFP_HIGHMEM via call to __get_free_pages(). This should not be an issue as __GFP_HIGHMEM flag is not passed in tdp_mmu_alloc_sp_for_split() anyway. 2. __get_free_page() calls alloc_pages() and use thread's mempolicy for the NUMA node allocation. From this commit, thread's mempolicy will not be used and first preference will be to allocate on the node where huge page was present. dirty_log_perf_test for 416 vcpu and 1GB/vcpu configuration on a 8 NUMA node machine showed dirty memory time improvements between 2% - 35% in multiple runs. Suggested-by: David Matlack <dmatlack@google.com> Signed-off-by: Vipin Sharma <vipinsh@google.com> --- arch/x86/kvm/mmu/tdp_mmu.c | 24 +++++++++++++++++++----- 1 file changed, 19 insertions(+), 5 deletions(-)