Message ID | 20220301085329.3210428-4-ying.huang@intel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | NUMA balancing: optimize memory placement for memory tiering system | expand |
On Tue, Mar 1, 2022 at 12:54 AM Huang Ying <ying.huang@intel.com> wrote: > > If the NUMA balancing isn't used to optimize the page placement among > sockets but only among memory types, the hot pages in the fast memory > node couldn't be migrated (promoted) to anywhere. So it's unnecessary > to scan the pages in the fast memory node via changing their PTE/PMD > mapping to be PROT_NONE. So that the page faults could be avoided > too. > > In the test, if only the memory tiering NUMA balancing mode is enabled, the > number of the NUMA balancing hint faults for the DRAM node is reduced to > almost 0 with the patch. While the benchmark score doesn't change > visibly. Reviewed-by: Yang Shi <shy828301@gmail.com> > > Signed-off-by: "Huang, Ying" <ying.huang@intel.com> > Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> > Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> > Acked-by: Johannes Weiner <hannes@cmpxchg.org> > Reviewed-by: Oscar Salvador <osalvador@suse.de> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Michal Hocko <mhocko@suse.com> > Cc: Rik van Riel <riel@surriel.com> > Cc: Mel Gorman <mgorman@techsingularity.net> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Yang Shi <shy828301@gmail.com> > Cc: Zi Yan <ziy@nvidia.com> > Cc: Wei Xu <weixugc@google.com> > Cc: Shakeel Butt <shakeelb@google.com> > Cc: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com> > Cc: linux-kernel@vger.kernel.org > Cc: linux-mm@kvack.org > --- > mm/huge_memory.c | 30 +++++++++++++++++++++--------- > mm/mprotect.c | 13 ++++++++++++- > 2 files changed, 33 insertions(+), 10 deletions(-) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 406a3c28c026..9ce126cb0cfd 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -34,6 +34,7 @@ > #include <linux/oom.h> > #include <linux/numa.h> > #include <linux/page_owner.h> > +#include <linux/sched/sysctl.h> > > #include <asm/tlb.h> > #include <asm/pgalloc.h> > @@ -1766,17 +1767,28 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, > } > #endif > > - /* > - * Avoid trapping faults against the zero page. The read-only > - * data is likely to be read-cached on the local CPU and > - * local/remote hits to the zero page are not interesting. > - */ > - if (prot_numa && is_huge_zero_pmd(*pmd)) > - goto unlock; > + if (prot_numa) { > + struct page *page; > + /* > + * Avoid trapping faults against the zero page. The read-only > + * data is likely to be read-cached on the local CPU and > + * local/remote hits to the zero page are not interesting. > + */ > + if (is_huge_zero_pmd(*pmd)) > + goto unlock; > > - if (prot_numa && pmd_protnone(*pmd)) > - goto unlock; > + if (pmd_protnone(*pmd)) > + goto unlock; > > + page = pmd_page(*pmd); > + /* > + * Skip scanning top tier node if normal numa > + * balancing is disabled > + */ > + if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) && > + node_is_toptier(page_to_nid(page))) > + goto unlock; > + } > /* > * In case prot_numa, we are under mmap_read_lock(mm). It's critical > * to not clear pmd intermittently to avoid race with MADV_DONTNEED > diff --git a/mm/mprotect.c b/mm/mprotect.c > index 0138dfcdb1d8..2fe03e695c81 100644 > --- a/mm/mprotect.c > +++ b/mm/mprotect.c > @@ -29,6 +29,7 @@ > #include <linux/uaccess.h> > #include <linux/mm_inline.h> > #include <linux/pgtable.h> > +#include <linux/sched/sysctl.h> > #include <asm/cacheflush.h> > #include <asm/mmu_context.h> > #include <asm/tlbflush.h> > @@ -83,6 +84,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, > */ > if (prot_numa) { > struct page *page; > + int nid; > > /* Avoid TLB flush if possible */ > if (pte_protnone(oldpte)) > @@ -109,7 +111,16 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, > * Don't mess with PTEs if page is already on the node > * a single-threaded process is running on. > */ > - if (target_node == page_to_nid(page)) > + nid = page_to_nid(page); > + if (target_node == nid) > + continue; > + > + /* > + * Skip scanning top tier node if normal numa > + * balancing is disabled > + */ > + if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) && > + node_is_toptier(nid)) > continue; > } > > -- > 2.30.2 >
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 406a3c28c026..9ce126cb0cfd 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -34,6 +34,7 @@ #include <linux/oom.h> #include <linux/numa.h> #include <linux/page_owner.h> +#include <linux/sched/sysctl.h> #include <asm/tlb.h> #include <asm/pgalloc.h> @@ -1766,17 +1767,28 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, } #endif - /* - * Avoid trapping faults against the zero page. The read-only - * data is likely to be read-cached on the local CPU and - * local/remote hits to the zero page are not interesting. - */ - if (prot_numa && is_huge_zero_pmd(*pmd)) - goto unlock; + if (prot_numa) { + struct page *page; + /* + * Avoid trapping faults against the zero page. The read-only + * data is likely to be read-cached on the local CPU and + * local/remote hits to the zero page are not interesting. + */ + if (is_huge_zero_pmd(*pmd)) + goto unlock; - if (prot_numa && pmd_protnone(*pmd)) - goto unlock; + if (pmd_protnone(*pmd)) + goto unlock; + page = pmd_page(*pmd); + /* + * Skip scanning top tier node if normal numa + * balancing is disabled + */ + if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) && + node_is_toptier(page_to_nid(page))) + goto unlock; + } /* * In case prot_numa, we are under mmap_read_lock(mm). It's critical * to not clear pmd intermittently to avoid race with MADV_DONTNEED diff --git a/mm/mprotect.c b/mm/mprotect.c index 0138dfcdb1d8..2fe03e695c81 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -29,6 +29,7 @@ #include <linux/uaccess.h> #include <linux/mm_inline.h> #include <linux/pgtable.h> +#include <linux/sched/sysctl.h> #include <asm/cacheflush.h> #include <asm/mmu_context.h> #include <asm/tlbflush.h> @@ -83,6 +84,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, */ if (prot_numa) { struct page *page; + int nid; /* Avoid TLB flush if possible */ if (pte_protnone(oldpte)) @@ -109,7 +111,16 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, * Don't mess with PTEs if page is already on the node * a single-threaded process is running on. */ - if (target_node == page_to_nid(page)) + nid = page_to_nid(page); + if (target_node == nid) + continue; + + /* + * Skip scanning top tier node if normal numa + * balancing is disabled + */ + if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) && + node_is_toptier(nid)) continue; }