Message ID | 20200629234509.8F89C4EF@viggo.jf.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Migrate Pages in lieu of discard | expand |
On Mon, 29 Jun 2020, Dave Hansen wrote: > From: Dave Hansen <dave.hansen@linux.intel.com> > > If a memory node has a preferred migration path to demote cold pages, > attempt to move those inactive pages to that migration node before > reclaiming. This will better utilize available memory, provide a faster > tier than swapping or discarding, and allow such pages to be reused > immediately without IO to retrieve the data. > > When handling anonymous pages, this will be considered before swap if > enabled. Should the demotion fail for any reason, the page reclaim > will proceed as if the demotion feature was not enabled. > Thanks for sharing these patches and kick-starting the conversation, Dave. Could this cause us to break a user's mbind() or allow a user to circumvent their cpuset.mems? Because we don't have a mapping of the page back to its allocation context (or the process context in which it was allocated), it seems like both are possible. So let's assume that migration nodes cannot be other DRAM nodes. Otherwise, memory pressure could be intentionally or unintentionally induced to migrate these pages to another node. Do we have such a restriction on migration nodes? > Some places we would like to see this used: > > 1. Persistent memory being as a slower, cheaper DRAM replacement > 2. Remote memory-only "expansion" NUMA nodes > 3. Resolving memory imbalances where one NUMA node is seeing more > allocation activity than another. This helps keep more recent > allocations closer to the CPUs on the node doing the allocating. > (3) is the concerning one given the above if we are to use migrate_demote_mapping() for DRAM node balancing. > Yang Shi's patches used an alternative approach where to-be-discarded > pages were collected on a separate discard list and then discarded > as a batch with migrate_pages(). This results in simpler code and > has all the performance advantages of batching, but has the > disadvantage that pages which fail to migrate never get swapped. > > #Signed-off-by: Keith Busch <keith.busch@intel.com> > Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> > Cc: Keith Busch <kbusch@kernel.org> > Cc: Yang Shi <yang.shi@linux.alibaba.com> > Cc: David Rientjes <rientjes@google.com> > Cc: Huang Ying <ying.huang@intel.com> > Cc: Dan Williams <dan.j.williams@intel.com> > --- > > b/include/linux/migrate.h | 6 ++++ > b/include/trace/events/migrate.h | 3 +- > b/mm/debug.c | 1 > b/mm/migrate.c | 52 +++++++++++++++++++++++++++++++++++++++ > b/mm/vmscan.c | 25 ++++++++++++++++++ > 5 files changed, 86 insertions(+), 1 deletion(-) > > diff -puN include/linux/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard include/linux/migrate.h > --- a/include/linux/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2020-06-29 16:34:38.950312604 -0700 > +++ b/include/linux/migrate.h 2020-06-29 16:34:38.963312604 -0700 > @@ -25,6 +25,7 @@ enum migrate_reason { > MR_MEMPOLICY_MBIND, > MR_NUMA_MISPLACED, > MR_CONTIG_RANGE, > + MR_DEMOTION, > MR_TYPES > }; > > @@ -78,6 +79,7 @@ extern int migrate_huge_page_move_mappin > struct page *newpage, struct page *page); > extern int migrate_page_move_mapping(struct address_space *mapping, > struct page *newpage, struct page *page, int extra_count); > +extern int migrate_demote_mapping(struct page *page); > #else > > static inline void putback_movable_pages(struct list_head *l) {} > @@ -104,6 +106,10 @@ static inline int migrate_huge_page_move > return -ENOSYS; > } > > +static inline int migrate_demote_mapping(struct page *page) > +{ > + return -ENOSYS; > +} > #endif /* CONFIG_MIGRATION */ > > #ifdef CONFIG_COMPACTION > diff -puN include/trace/events/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard include/trace/events/migrate.h > --- a/include/trace/events/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2020-06-29 16:34:38.952312604 -0700 > +++ b/include/trace/events/migrate.h 2020-06-29 16:34:38.963312604 -0700 > @@ -20,7 +20,8 @@ > EM( MR_SYSCALL, "syscall_or_cpuset") \ > EM( MR_MEMPOLICY_MBIND, "mempolicy_mbind") \ > EM( MR_NUMA_MISPLACED, "numa_misplaced") \ > - EMe(MR_CONTIG_RANGE, "contig_range") > + EM( MR_CONTIG_RANGE, "contig_range") \ > + EMe(MR_DEMOTION, "demotion") > > /* > * First define the enums in the above macros to be exported to userspace > diff -puN mm/debug.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/debug.c > --- a/mm/debug.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2020-06-29 16:34:38.954312604 -0700 > +++ b/mm/debug.c 2020-06-29 16:34:38.963312604 -0700 > @@ -25,6 +25,7 @@ const char *migrate_reason_names[MR_TYPE > "mempolicy_mbind", > "numa_misplaced", > "cma", > + "demotion", > }; > > const struct trace_print_flags pageflag_names[] = { > diff -puN mm/migrate.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/migrate.c > --- a/mm/migrate.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2020-06-29 16:34:38.956312604 -0700 > +++ b/mm/migrate.c 2020-06-29 16:34:38.964312604 -0700 > @@ -1151,6 +1151,58 @@ int next_demotion_node(int node) > return node; > } > > +static struct page *alloc_demote_node_page(struct page *page, unsigned long node) > +{ > + /* > + * 'mask' targets allocation only to the desired node in the > + * migration path, and fails fast if the allocation can not be > + * immediately satisfied. Reclaim is already active and heroic > + * allocation efforts are unwanted. > + */ > + gfp_t mask = GFP_NOWAIT | __GFP_NOWARN | __GFP_NORETRY | > + __GFP_NOMEMALLOC | __GFP_THISNODE | __GFP_HIGHMEM | > + __GFP_MOVABLE; GFP_NOWAIT has the side-effect that it does __GFP_KSWAPD_RECLAIM: do we actually want to kick kswapd on the pmem node? If not, GFP_TRANSHUGE_LIGHT does a trick where it does GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM. You could probably do the same here although the __GFP_IO and __GFP_FS would be unnecessary (but not harmful). > + struct page *newpage; > + > + if (PageTransHuge(page)) { > + mask |= __GFP_COMP; > + newpage = alloc_pages_node(node, mask, HPAGE_PMD_ORDER); > + if (newpage) > + prep_transhuge_page(newpage); > + } else > + newpage = alloc_pages_node(node, mask, 0); > + > + return newpage; > +} > + > +/** > + * migrate_demote_mapping() - Migrate this page and its mappings to its > + * demotion node. > + * @page: A locked, isolated, non-huge page that should migrate to its current > + * node's demotion target, if available. Since this is intended to be > + * called during memory reclaim, all flag options are set to fail fast. > + * > + * @returns: MIGRATEPAGE_SUCCESS if successful, -errno otherwise. > + */ > +int migrate_demote_mapping(struct page *page) > +{ > + int next_nid = next_demotion_node(page_to_nid(page)); > + > + VM_BUG_ON_PAGE(!PageLocked(page), page); > + VM_BUG_ON_PAGE(PageHuge(page), page); > + VM_BUG_ON_PAGE(PageLRU(page), page); > + > + if (next_nid == NUMA_NO_NODE) > + return -ENOSYS; > + if (PageTransHuge(page) && !thp_migration_supported()) > + return -ENOMEM; > + > + /* MIGRATE_ASYNC is the most light weight and never blocks.*/ > + return __unmap_and_move(alloc_demote_node_page, NULL, next_nid, > + page, MIGRATE_ASYNC, MR_DEMOTION); > +} > + > + > /* > * gcc 4.7 and 4.8 on arm get an ICEs when inlining unmap_and_move(). Work > * around it. > diff -puN mm/vmscan.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/vmscan.c > --- a/mm/vmscan.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2020-06-29 16:34:38.959312604 -0700 > +++ b/mm/vmscan.c 2020-06-29 16:34:38.965312604 -0700 > @@ -1077,6 +1077,7 @@ static unsigned long shrink_page_list(st > LIST_HEAD(free_pages); > unsigned nr_reclaimed = 0; > unsigned pgactivate = 0; > + int rc; > > memset(stat, 0, sizeof(*stat)); > cond_resched(); > @@ -1229,6 +1230,30 @@ static unsigned long shrink_page_list(st > ; /* try to reclaim the page below */ > } > > + rc = migrate_demote_mapping(page); > + /* > + * -ENOMEM on a THP may indicate either migration is > + * unsupported or there was not enough contiguous > + * space. Split the THP into base pages and retry the > + * head immediately. The tail pages will be considered > + * individually within the current loop's page list. > + */ > + if (rc == -ENOMEM && PageTransHuge(page) && > + !split_huge_page_to_list(page, page_list)) > + rc = migrate_demote_mapping(page); > + > + if (rc == MIGRATEPAGE_SUCCESS) { > + unlock_page(page); > + if (likely(put_page_testzero(page))) > + goto free_it; > + /* > + * Speculative reference will free this page, > + * so leave it off the LRU. > + */ > + nr_reclaimed++; nr_reclaimed += nr_pages instead? > + continue; > + } > + > /* > * Anonymous process memory has backing store? > * Try to allocate it some swap space here.
On 6/30/20 5:47 PM, David Rientjes wrote: > On Mon, 29 Jun 2020, Dave Hansen wrote: > >> From: Dave Hansen <dave.hansen@linux.intel.com> >> >> If a memory node has a preferred migration path to demote cold pages, >> attempt to move those inactive pages to that migration node before >> reclaiming. This will better utilize available memory, provide a faster >> tier than swapping or discarding, and allow such pages to be reused >> immediately without IO to retrieve the data. >> >> When handling anonymous pages, this will be considered before swap if >> enabled. Should the demotion fail for any reason, the page reclaim >> will proceed as if the demotion feature was not enabled. >> > Thanks for sharing these patches and kick-starting the conversation, Dave. > > Could this cause us to break a user's mbind() or allow a user to > circumvent their cpuset.mems? > > Because we don't have a mapping of the page back to its allocation > context (or the process context in which it was allocated), it seems like > both are possible. Yes, this could break the memory placement policy enforced by mbind and cpuset. I discussed this with Michal on mailing list and tried to find a way to solve it, but unfortunately it seems not easy as what you mentioned above. The memory policy and cpuset is stored in task_struct rather than mm_struct. It is not easy to trace back to task_struct from page (owner field of mm_struct might be helpful, but it depends on CONFIG_MEMCG and is not preferred way). > > So let's assume that migration nodes cannot be other DRAM nodes. > Otherwise, memory pressure could be intentionally or unintentionally > induced to migrate these pages to another node. Do we have such a > restriction on migration nodes? > >> Some places we would like to see this used: >> >> 1. Persistent memory being as a slower, cheaper DRAM replacement >> 2. Remote memory-only "expansion" NUMA nodes >> 3. Resolving memory imbalances where one NUMA node is seeing more >> allocation activity than another. This helps keep more recent >> allocations closer to the CPUs on the node doing the allocating. >> > (3) is the concerning one given the above if we are to use > migrate_demote_mapping() for DRAM node balancing. > >> Yang Shi's patches used an alternative approach where to-be-discarded >> pages were collected on a separate discard list and then discarded >> as a batch with migrate_pages(). This results in simpler code and >> has all the performance advantages of batching, but has the >> disadvantage that pages which fail to migrate never get swapped. >> >> #Signed-off-by: Keith Busch <keith.busch@intel.com> >> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> >> Cc: Keith Busch <kbusch@kernel.org> >> Cc: Yang Shi <yang.shi@linux.alibaba.com> >> Cc: David Rientjes <rientjes@google.com> >> Cc: Huang Ying <ying.huang@intel.com> >> Cc: Dan Williams <dan.j.williams@intel.com> >> --- >> >> b/include/linux/migrate.h | 6 ++++ >> b/include/trace/events/migrate.h | 3 +- >> b/mm/debug.c | 1 >> b/mm/migrate.c | 52 +++++++++++++++++++++++++++++++++++++++ >> b/mm/vmscan.c | 25 ++++++++++++++++++ >> 5 files changed, 86 insertions(+), 1 deletion(-) >> >> diff -puN include/linux/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard include/linux/migrate.h >> --- a/include/linux/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2020-06-29 16:34:38.950312604 -0700 >> +++ b/include/linux/migrate.h 2020-06-29 16:34:38.963312604 -0700 >> @@ -25,6 +25,7 @@ enum migrate_reason { >> MR_MEMPOLICY_MBIND, >> MR_NUMA_MISPLACED, >> MR_CONTIG_RANGE, >> + MR_DEMOTION, >> MR_TYPES >> }; >> >> @@ -78,6 +79,7 @@ extern int migrate_huge_page_move_mappin >> struct page *newpage, struct page *page); >> extern int migrate_page_move_mapping(struct address_space *mapping, >> struct page *newpage, struct page *page, int extra_count); >> +extern int migrate_demote_mapping(struct page *page); >> #else >> >> static inline void putback_movable_pages(struct list_head *l) {} >> @@ -104,6 +106,10 @@ static inline int migrate_huge_page_move >> return -ENOSYS; >> } >> >> +static inline int migrate_demote_mapping(struct page *page) >> +{ >> + return -ENOSYS; >> +} >> #endif /* CONFIG_MIGRATION */ >> >> #ifdef CONFIG_COMPACTION >> diff -puN include/trace/events/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard include/trace/events/migrate.h >> --- a/include/trace/events/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2020-06-29 16:34:38.952312604 -0700 >> +++ b/include/trace/events/migrate.h 2020-06-29 16:34:38.963312604 -0700 >> @@ -20,7 +20,8 @@ >> EM( MR_SYSCALL, "syscall_or_cpuset") \ >> EM( MR_MEMPOLICY_MBIND, "mempolicy_mbind") \ >> EM( MR_NUMA_MISPLACED, "numa_misplaced") \ >> - EMe(MR_CONTIG_RANGE, "contig_range") >> + EM( MR_CONTIG_RANGE, "contig_range") \ >> + EMe(MR_DEMOTION, "demotion") >> >> /* >> * First define the enums in the above macros to be exported to userspace >> diff -puN mm/debug.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/debug.c >> --- a/mm/debug.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2020-06-29 16:34:38.954312604 -0700 >> +++ b/mm/debug.c 2020-06-29 16:34:38.963312604 -0700 >> @@ -25,6 +25,7 @@ const char *migrate_reason_names[MR_TYPE >> "mempolicy_mbind", >> "numa_misplaced", >> "cma", >> + "demotion", >> }; >> >> const struct trace_print_flags pageflag_names[] = { >> diff -puN mm/migrate.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/migrate.c >> --- a/mm/migrate.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2020-06-29 16:34:38.956312604 -0700 >> +++ b/mm/migrate.c 2020-06-29 16:34:38.964312604 -0700 >> @@ -1151,6 +1151,58 @@ int next_demotion_node(int node) >> return node; >> } >> >> +static struct page *alloc_demote_node_page(struct page *page, unsigned long node) >> +{ >> + /* >> + * 'mask' targets allocation only to the desired node in the >> + * migration path, and fails fast if the allocation can not be >> + * immediately satisfied. Reclaim is already active and heroic >> + * allocation efforts are unwanted. >> + */ >> + gfp_t mask = GFP_NOWAIT | __GFP_NOWARN | __GFP_NORETRY | >> + __GFP_NOMEMALLOC | __GFP_THISNODE | __GFP_HIGHMEM | >> + __GFP_MOVABLE; > GFP_NOWAIT has the side-effect that it does __GFP_KSWAPD_RECLAIM: do we > actually want to kick kswapd on the pmem node? > > If not, GFP_TRANSHUGE_LIGHT does a trick where it does > GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM. You could probably do the same > here although the __GFP_IO and __GFP_FS would be unnecessary (but not > harmful). I'm not sure how Dave thought about this, however, IMHO kicking kswapd on pmem node would help to free memory then improve migration success rate. In my implementation, as Dave mentioned in the commit log, the migration candidates are put on a separate list then migrated in batch by calling migrate_pages(). Kicking kswapd on pmem would help to improve success rate since migrate_pages() will retry a couple of times. Dave's implementation (as you see in this patch) does migration for per page basis, if migration is failed it will try swap. Kicking kswapd on pmem would also help the later migration. However, IMHO it seems migration retry should be still faster than swap. > >> + struct page *newpage; >> + >> + if (PageTransHuge(page)) { >> + mask |= __GFP_COMP; >> + newpage = alloc_pages_node(node, mask, HPAGE_PMD_ORDER); >> + if (newpage) >> + prep_transhuge_page(newpage); >> + } else >> + newpage = alloc_pages_node(node, mask, 0); >> + >> + return newpage; >> +} >> + >> +/** >> + * migrate_demote_mapping() - Migrate this page and its mappings to its >> + * demotion node. >> + * @page: A locked, isolated, non-huge page that should migrate to its current >> + * node's demotion target, if available. Since this is intended to be >> + * called during memory reclaim, all flag options are set to fail fast. >> + * >> + * @returns: MIGRATEPAGE_SUCCESS if successful, -errno otherwise. >> + */ >> +int migrate_demote_mapping(struct page *page) >> +{ >> + int next_nid = next_demotion_node(page_to_nid(page)); >> + >> + VM_BUG_ON_PAGE(!PageLocked(page), page); >> + VM_BUG_ON_PAGE(PageHuge(page), page); >> + VM_BUG_ON_PAGE(PageLRU(page), page); >> + >> + if (next_nid == NUMA_NO_NODE) >> + return -ENOSYS; >> + if (PageTransHuge(page) && !thp_migration_supported()) >> + return -ENOMEM; >> + >> + /* MIGRATE_ASYNC is the most light weight and never blocks.*/ >> + return __unmap_and_move(alloc_demote_node_page, NULL, next_nid, >> + page, MIGRATE_ASYNC, MR_DEMOTION); >> +} >> + >> + >> /* >> * gcc 4.7 and 4.8 on arm get an ICEs when inlining unmap_and_move(). Work >> * around it. >> diff -puN mm/vmscan.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/vmscan.c >> --- a/mm/vmscan.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2020-06-29 16:34:38.959312604 -0700 >> +++ b/mm/vmscan.c 2020-06-29 16:34:38.965312604 -0700 >> @@ -1077,6 +1077,7 @@ static unsigned long shrink_page_list(st >> LIST_HEAD(free_pages); >> unsigned nr_reclaimed = 0; >> unsigned pgactivate = 0; >> + int rc; >> >> memset(stat, 0, sizeof(*stat)); >> cond_resched(); >> @@ -1229,6 +1230,30 @@ static unsigned long shrink_page_list(st >> ; /* try to reclaim the page below */ >> } >> >> + rc = migrate_demote_mapping(page); >> + /* >> + * -ENOMEM on a THP may indicate either migration is >> + * unsupported or there was not enough contiguous >> + * space. Split the THP into base pages and retry the >> + * head immediately. The tail pages will be considered >> + * individually within the current loop's page list. >> + */ >> + if (rc == -ENOMEM && PageTransHuge(page) && >> + !split_huge_page_to_list(page, page_list)) >> + rc = migrate_demote_mapping(page); >> + >> + if (rc == MIGRATEPAGE_SUCCESS) { >> + unlock_page(page); >> + if (likely(put_page_testzero(page))) >> + goto free_it; >> + /* >> + * Speculative reference will free this page, >> + * so leave it off the LRU. >> + */ >> + nr_reclaimed++; > nr_reclaimed += nr_pages instead? > >> + continue; >> + } >> + >> /* >> * Anonymous process memory has backing store? >> * Try to allocate it some swap space here.
David Rientjes <rientjes@google.com> writes: > On Mon, 29 Jun 2020, Dave Hansen wrote: > >> From: Dave Hansen <dave.hansen@linux.intel.com> >> >> If a memory node has a preferred migration path to demote cold pages, >> attempt to move those inactive pages to that migration node before >> reclaiming. This will better utilize available memory, provide a faster >> tier than swapping or discarding, and allow such pages to be reused >> immediately without IO to retrieve the data. >> >> When handling anonymous pages, this will be considered before swap if >> enabled. Should the demotion fail for any reason, the page reclaim >> will proceed as if the demotion feature was not enabled. >> > > Thanks for sharing these patches and kick-starting the conversation, Dave. > > Could this cause us to break a user's mbind() or allow a user to > circumvent their cpuset.mems? > > Because we don't have a mapping of the page back to its allocation > context (or the process context in which it was allocated), it seems like > both are possible. For mbind, I think we don't have enough information during reclaim to enforce the node binding policy. But for cpuset, if cgroup v2 (with the unified hierarchy) is used, it's possible to get the node binding policy via something like, cgroup_get_e_css(page->mem_cgroup, &cpuset_cgrp_subsys) > So let's assume that migration nodes cannot be other DRAM nodes. > Otherwise, memory pressure could be intentionally or unintentionally > induced to migrate these pages to another node. Do we have such a > restriction on migration nodes? > >> Some places we would like to see this used: >> >> 1. Persistent memory being as a slower, cheaper DRAM replacement >> 2. Remote memory-only "expansion" NUMA nodes >> 3. Resolving memory imbalances where one NUMA node is seeing more >> allocation activity than another. This helps keep more recent >> allocations closer to the CPUs on the node doing the allocating. >> > > (3) is the concerning one given the above if we are to use > migrate_demote_mapping() for DRAM node balancing. > >> Yang Shi's patches used an alternative approach where to-be-discarded >> pages were collected on a separate discard list and then discarded >> as a batch with migrate_pages(). This results in simpler code and >> has all the performance advantages of batching, but has the >> disadvantage that pages which fail to migrate never get swapped. >> >> #Signed-off-by: Keith Busch <keith.busch@intel.com> >> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> >> Cc: Keith Busch <kbusch@kernel.org> >> Cc: Yang Shi <yang.shi@linux.alibaba.com> >> Cc: David Rientjes <rientjes@google.com> >> Cc: Huang Ying <ying.huang@intel.com> >> Cc: Dan Williams <dan.j.williams@intel.com> >> --- >> >> b/include/linux/migrate.h | 6 ++++ >> b/include/trace/events/migrate.h | 3 +- >> b/mm/debug.c | 1 >> b/mm/migrate.c | 52 +++++++++++++++++++++++++++++++++++++++ >> b/mm/vmscan.c | 25 ++++++++++++++++++ >> 5 files changed, 86 insertions(+), 1 deletion(-) >> >> diff -puN include/linux/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard include/linux/migrate.h >> --- a/include/linux/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2020-06-29 16:34:38.950312604 -0700 >> +++ b/include/linux/migrate.h 2020-06-29 16:34:38.963312604 -0700 >> @@ -25,6 +25,7 @@ enum migrate_reason { >> MR_MEMPOLICY_MBIND, >> MR_NUMA_MISPLACED, >> MR_CONTIG_RANGE, >> + MR_DEMOTION, >> MR_TYPES >> }; >> >> @@ -78,6 +79,7 @@ extern int migrate_huge_page_move_mappin >> struct page *newpage, struct page *page); >> extern int migrate_page_move_mapping(struct address_space *mapping, >> struct page *newpage, struct page *page, int extra_count); >> +extern int migrate_demote_mapping(struct page *page); >> #else >> >> static inline void putback_movable_pages(struct list_head *l) {} >> @@ -104,6 +106,10 @@ static inline int migrate_huge_page_move >> return -ENOSYS; >> } >> >> +static inline int migrate_demote_mapping(struct page *page) >> +{ >> + return -ENOSYS; >> +} >> #endif /* CONFIG_MIGRATION */ >> >> #ifdef CONFIG_COMPACTION >> diff -puN include/trace/events/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard include/trace/events/migrate.h >> --- a/include/trace/events/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2020-06-29 16:34:38.952312604 -0700 >> +++ b/include/trace/events/migrate.h 2020-06-29 16:34:38.963312604 -0700 >> @@ -20,7 +20,8 @@ >> EM( MR_SYSCALL, "syscall_or_cpuset") \ >> EM( MR_MEMPOLICY_MBIND, "mempolicy_mbind") \ >> EM( MR_NUMA_MISPLACED, "numa_misplaced") \ >> - EMe(MR_CONTIG_RANGE, "contig_range") >> + EM( MR_CONTIG_RANGE, "contig_range") \ >> + EMe(MR_DEMOTION, "demotion") >> >> /* >> * First define the enums in the above macros to be exported to userspace >> diff -puN mm/debug.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/debug.c >> --- a/mm/debug.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2020-06-29 16:34:38.954312604 -0700 >> +++ b/mm/debug.c 2020-06-29 16:34:38.963312604 -0700 >> @@ -25,6 +25,7 @@ const char *migrate_reason_names[MR_TYPE >> "mempolicy_mbind", >> "numa_misplaced", >> "cma", >> + "demotion", >> }; >> >> const struct trace_print_flags pageflag_names[] = { >> diff -puN mm/migrate.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/migrate.c >> --- a/mm/migrate.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2020-06-29 16:34:38.956312604 -0700 >> +++ b/mm/migrate.c 2020-06-29 16:34:38.964312604 -0700 >> @@ -1151,6 +1151,58 @@ int next_demotion_node(int node) >> return node; >> } >> >> +static struct page *alloc_demote_node_page(struct page *page, unsigned long node) >> +{ >> + /* >> + * 'mask' targets allocation only to the desired node in the >> + * migration path, and fails fast if the allocation can not be >> + * immediately satisfied. Reclaim is already active and heroic >> + * allocation efforts are unwanted. >> + */ >> + gfp_t mask = GFP_NOWAIT | __GFP_NOWARN | __GFP_NORETRY | >> + __GFP_NOMEMALLOC | __GFP_THISNODE | __GFP_HIGHMEM | >> + __GFP_MOVABLE; > > GFP_NOWAIT has the side-effect that it does __GFP_KSWAPD_RECLAIM: do we > actually want to kick kswapd on the pmem node? I think it should be a good idea to kick kswapd on the PMEM node. Because otherwise, we will discard more pages in DRAM node. And in general, the DRAM pages are hotter than the PMEM pages, because the cold DRAM pages are migrated to the PMEM node. > If not, GFP_TRANSHUGE_LIGHT does a trick where it does > GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM. You could probably do the same > here although the __GFP_IO and __GFP_FS would be unnecessary (but not > harmful). > >> + struct page *newpage; >> + >> + if (PageTransHuge(page)) { >> + mask |= __GFP_COMP; >> + newpage = alloc_pages_node(node, mask, HPAGE_PMD_ORDER); >> + if (newpage) >> + prep_transhuge_page(newpage); >> + } else >> + newpage = alloc_pages_node(node, mask, 0); >> + >> + return newpage; >> +} >> + Best Regards, Huang, Ying
On Tue, 30 Jun 2020, Yang Shi wrote: > > > From: Dave Hansen <dave.hansen@linux.intel.com> > > > > > > If a memory node has a preferred migration path to demote cold pages, > > > attempt to move those inactive pages to that migration node before > > > reclaiming. This will better utilize available memory, provide a faster > > > tier than swapping or discarding, and allow such pages to be reused > > > immediately without IO to retrieve the data. > > > > > > When handling anonymous pages, this will be considered before swap if > > > enabled. Should the demotion fail for any reason, the page reclaim > > > will proceed as if the demotion feature was not enabled. > > > > > Thanks for sharing these patches and kick-starting the conversation, Dave. > > > > Could this cause us to break a user's mbind() or allow a user to > > circumvent their cpuset.mems? > > > > Because we don't have a mapping of the page back to its allocation > > context (or the process context in which it was allocated), it seems like > > both are possible. > > Yes, this could break the memory placement policy enforced by mbind and > cpuset. I discussed this with Michal on mailing list and tried to find a way > to solve it, but unfortunately it seems not easy as what you mentioned above. > The memory policy and cpuset is stored in task_struct rather than mm_struct. > It is not easy to trace back to task_struct from page (owner field of > mm_struct might be helpful, but it depends on CONFIG_MEMCG and is not > preferred way). > Yeah, and Ying made a similar response to this message. We can do this if we consider pmem not to be a separate memory tier from the system perspective, however, but rather the socket perspective. In other words, a node can only demote to a series of exclusive pmem ranges and promote to the same series of ranges in reverse order. So DRAM node 0 can only demote to PMEM node 2 while DRAM node 1 can only demote to PMEM node 3 -- a pmem range cannot be demoted to, or promoted from, more than one DRAM node. This naturally takes care of mbind() and cpuset.mems if we consider pmem just to be slower volatile memory and we don't need to deal with the latency concerns of cross socket migration. A user page will never be demoted to a pmem range across the socket and will never be promoted to a different DRAM node that it doesn't have access to. That can work with the NUMA abstraction for pmem, but it could also theoretically be a new memory zone instead. If all memory living on pmem is migratable (the natural way that memory hotplug is done, so we can offline), this zone would live above ZONE_MOVABLE. Zonelist ordering would determine whether we can allocate directly from this memory based on system config or a new gfp flag that could be set for users of a mempolicy that allows allocations directly from pmem. If abstracted as a NUMA node instead, interleave over nodes {0,2,3} or a cpuset.mems of {0,2,3} doesn't make much sense. Kswapd would need to be enlightened for proper pgdat and pmem balancing but in theory it should be simpler because it only has its own node to manage. Existing per-zone watermarks might be easy to use to fine tune the policy from userspace: the scale factor determines how much memory we try to keep free on DRAM for migration from pmem, for example. We also wouldn't have to deal with node hotplug or updating of demotion/promotion node chains. Maybe the strongest advantage of the node abstraction is the ability to use autonuma and migrate_pages()/move_pages() API for moving pages explicitly? Mempolicies could be used for migration to "top-tier" memory, i.e. ZONE_NORMAL or ZONE_MOVABLE, instead.
David Rientjes <rientjes@google.com> writes: > On Tue, 30 Jun 2020, Yang Shi wrote: > >> > > From: Dave Hansen <dave.hansen@linux.intel.com> >> > > >> > > If a memory node has a preferred migration path to demote cold pages, >> > > attempt to move those inactive pages to that migration node before >> > > reclaiming. This will better utilize available memory, provide a faster >> > > tier than swapping or discarding, and allow such pages to be reused >> > > immediately without IO to retrieve the data. >> > > >> > > When handling anonymous pages, this will be considered before swap if >> > > enabled. Should the demotion fail for any reason, the page reclaim >> > > will proceed as if the demotion feature was not enabled. >> > > >> > Thanks for sharing these patches and kick-starting the conversation, Dave. >> > >> > Could this cause us to break a user's mbind() or allow a user to >> > circumvent their cpuset.mems? >> > >> > Because we don't have a mapping of the page back to its allocation >> > context (or the process context in which it was allocated), it seems like >> > both are possible. >> >> Yes, this could break the memory placement policy enforced by mbind and >> cpuset. I discussed this with Michal on mailing list and tried to find a way >> to solve it, but unfortunately it seems not easy as what you mentioned above. >> The memory policy and cpuset is stored in task_struct rather than mm_struct. >> It is not easy to trace back to task_struct from page (owner field of >> mm_struct might be helpful, but it depends on CONFIG_MEMCG and is not >> preferred way). >> > > Yeah, and Ying made a similar response to this message. > > We can do this if we consider pmem not to be a separate memory tier from > the system perspective, however, but rather the socket perspective. In > other words, a node can only demote to a series of exclusive pmem ranges > and promote to the same series of ranges in reverse order. So DRAM node 0 > can only demote to PMEM node 2 while DRAM node 1 can only demote to PMEM > node 3 -- a pmem range cannot be demoted to, or promoted from, more than > one DRAM node. > > This naturally takes care of mbind() and cpuset.mems if we consider pmem > just to be slower volatile memory and we don't need to deal with the > latency concerns of cross socket migration. A user page will never be > demoted to a pmem range across the socket and will never be promoted to a > different DRAM node that it doesn't have access to. > > That can work with the NUMA abstraction for pmem, but it could also > theoretically be a new memory zone instead. If all memory living on pmem > is migratable (the natural way that memory hotplug is done, so we can > offline), this zone would live above ZONE_MOVABLE. Zonelist ordering > would determine whether we can allocate directly from this memory based on > system config or a new gfp flag that could be set for users of a mempolicy > that allows allocations directly from pmem. If abstracted as a NUMA node > instead, interleave over nodes {0,2,3} or a cpuset.mems of {0,2,3} doesn't > make much sense. Why can not we just bind the memory of the application to node 0, 2, 3 via mbind() or cpuset.mems? Then the application can allocate memory directly from PMEM. And if we bind the memory of the application via mbind() to node 0, we can only allocate memory directly from DRAM. Best Regards, Huang, Ying
On 6/30/20 10:41 PM, David Rientjes wrote: > Maybe the strongest advantage of the node abstraction is the ability to > use autonuma and migrate_pages()/move_pages() API for moving pages > explicitly? Mempolicies could be used for migration to "top-tier" memory, > i.e. ZONE_NORMAL or ZONE_MOVABLE, instead. I totally agree that we _could_ introduce this new memory class as a zone. Doing it as nodes is pretty natural since the firmware today describes both slow (versus DRAM) and fast memory as separate nodes. It also means that apps can get visibility into placement with existing NUMA tooling and ABIs. To me, those are the two strongest reasons for PMEM. Looking to the future, I don't think the zone approach scales. I know folks want to build stuff within a single socket which is a mix of: 1. High-Bandwidth, on-package memory (a la MCDRAM) 2. DRAM 3. DRAM-cached PMEM (aka. "memory mode" PMEM) 4. Non-cached PMEM Right now, #1 doesn't exist on modern platform and #3/#4 can't be mixed (you only get 3 _or_ 4 at once). I'd love to provide something here that Intel can use to build future crazy platform configurations that don't require kernel enabling.
On 6/30/20 5:47 PM, David Rientjes wrote: > On Mon, 29 Jun 2020, Dave Hansen wrote: >> From: Dave Hansen <dave.hansen@linux.intel.com> >> >> If a memory node has a preferred migration path to demote cold pages, >> attempt to move those inactive pages to that migration node before >> reclaiming. This will better utilize available memory, provide a faster >> tier than swapping or discarding, and allow such pages to be reused >> immediately without IO to retrieve the data. >> >> When handling anonymous pages, this will be considered before swap if >> enabled. Should the demotion fail for any reason, the page reclaim >> will proceed as if the demotion feature was not enabled. >> > > Thanks for sharing these patches and kick-starting the conversation, Dave. > > Could this cause us to break a user's mbind() or allow a user to > circumvent their cpuset.mems? In its current form, yes. My current rationale for this is that while it's not as deferential as it can be to the user/kernel ABI contract, it's good *overall* behavior. The auto-migration only kicks in when the data is about to go away. So while the user's data might be slower than they like, it is *WAY* faster than they deserve because it should be off on the disk. > Because we don't have a mapping of the page back to its allocation > context (or the process context in which it was allocated), it seems like > both are possible. > > So let's assume that migration nodes cannot be other DRAM nodes. > Otherwise, memory pressure could be intentionally or unintentionally > induced to migrate these pages to another node. Do we have such a > restriction on migration nodes? There's nothing explicit. On a normal, balanced system where there's a 1:1:1 relationship between CPU sockets, DRAM nodes and PMEM nodes, it's implicit since the migration path is one deep and goes from DRAM->PMEM. If there were some oddball system where there was a memory only DRAM node, it might very well end up being a migration target. >> Some places we would like to see this used: >> >> 1. Persistent memory being as a slower, cheaper DRAM replacement >> 2. Remote memory-only "expansion" NUMA nodes >> 3. Resolving memory imbalances where one NUMA node is seeing more >> allocation activity than another. This helps keep more recent >> allocations closer to the CPUs on the node doing the allocating. > > (3) is the concerning one given the above if we are to use > migrate_demote_mapping() for DRAM node balancing. Yeah, agreed. That's the sketchiest of the three. :) >> +static struct page *alloc_demote_node_page(struct page *page, unsigned long node) >> +{ >> + /* >> + * 'mask' targets allocation only to the desired node in the >> + * migration path, and fails fast if the allocation can not be >> + * immediately satisfied. Reclaim is already active and heroic >> + * allocation efforts are unwanted. >> + */ >> + gfp_t mask = GFP_NOWAIT | __GFP_NOWARN | __GFP_NORETRY | >> + __GFP_NOMEMALLOC | __GFP_THISNODE | __GFP_HIGHMEM | >> + __GFP_MOVABLE; > > GFP_NOWAIT has the side-effect that it does __GFP_KSWAPD_RECLAIM: do we > actually want to kick kswapd on the pmem node? In my mental model, cold data flows from: DRAM -> PMEM -> swap Kicking kswapd here ensures that while we're doing DRAM->PMEM migrations for kinda cold data, kswapd can be working on doing the PMEM->swap part on really cold data. ... >> @@ -1229,6 +1230,30 @@ static unsigned long shrink_page_list(st >> ; /* try to reclaim the page below */ >> } >> >> + rc = migrate_demote_mapping(page); >> + /* >> + * -ENOMEM on a THP may indicate either migration is >> + * unsupported or there was not enough contiguous >> + * space. Split the THP into base pages and retry the >> + * head immediately. The tail pages will be considered >> + * individually within the current loop's page list. >> + */ >> + if (rc == -ENOMEM && PageTransHuge(page) && >> + !split_huge_page_to_list(page, page_list)) >> + rc = migrate_demote_mapping(page); >> + >> + if (rc == MIGRATEPAGE_SUCCESS) { >> + unlock_page(page); >> + if (likely(put_page_testzero(page))) >> + goto free_it; >> + /* >> + * Speculative reference will free this page, >> + * so leave it off the LRU. >> + */ >> + nr_reclaimed++; > > nr_reclaimed += nr_pages instead? Oh, good catch. I also need to go double-check that 'nr_pages' isn't wrong elsewhere because of the split.
On 6/30/20 10:41 PM, David Rientjes wrote: > On Tue, 30 Jun 2020, Yang Shi wrote: > >>>> From: Dave Hansen <dave.hansen@linux.intel.com> >>>> >>>> If a memory node has a preferred migration path to demote cold pages, >>>> attempt to move those inactive pages to that migration node before >>>> reclaiming. This will better utilize available memory, provide a faster >>>> tier than swapping or discarding, and allow such pages to be reused >>>> immediately without IO to retrieve the data. >>>> >>>> When handling anonymous pages, this will be considered before swap if >>>> enabled. Should the demotion fail for any reason, the page reclaim >>>> will proceed as if the demotion feature was not enabled. >>>> >>> Thanks for sharing these patches and kick-starting the conversation, Dave. >>> >>> Could this cause us to break a user's mbind() or allow a user to >>> circumvent their cpuset.mems? >>> >>> Because we don't have a mapping of the page back to its allocation >>> context (or the process context in which it was allocated), it seems like >>> both are possible. >> Yes, this could break the memory placement policy enforced by mbind and >> cpuset. I discussed this with Michal on mailing list and tried to find a way >> to solve it, but unfortunately it seems not easy as what you mentioned above. >> The memory policy and cpuset is stored in task_struct rather than mm_struct. >> It is not easy to trace back to task_struct from page (owner field of >> mm_struct might be helpful, but it depends on CONFIG_MEMCG and is not >> preferred way). >> > Yeah, and Ying made a similar response to this message. > > We can do this if we consider pmem not to be a separate memory tier from > the system perspective, however, but rather the socket perspective. In > other words, a node can only demote to a series of exclusive pmem ranges > and promote to the same series of ranges in reverse order. So DRAM node 0 > can only demote to PMEM node 2 while DRAM node 1 can only demote to PMEM > node 3 -- a pmem range cannot be demoted to, or promoted from, more than > one DRAM node. > > This naturally takes care of mbind() and cpuset.mems if we consider pmem > just to be slower volatile memory and we don't need to deal with the > latency concerns of cross socket migration. A user page will never be > demoted to a pmem range across the socket and will never be promoted to a > different DRAM node that it doesn't have access to. But I don't see too much benefit to limit the migration target to the so-called *paired* pmem node. IMHO it is fine to migrate to a remote (on a different socket) pmem node since even the cross socket access should be much faster then refault or swap from disk. > > That can work with the NUMA abstraction for pmem, but it could also > theoretically be a new memory zone instead. If all memory living on pmem > is migratable (the natural way that memory hotplug is done, so we can > offline), this zone would live above ZONE_MOVABLE. Zonelist ordering > would determine whether we can allocate directly from this memory based on > system config or a new gfp flag that could be set for users of a mempolicy > that allows allocations directly from pmem. If abstracted as a NUMA node > instead, interleave over nodes {0,2,3} or a cpuset.mems of {0,2,3} doesn't > make much sense. > > Kswapd would need to be enlightened for proper pgdat and pmem balancing > but in theory it should be simpler because it only has its own node to > manage. Existing per-zone watermarks might be easy to use to fine tune > the policy from userspace: the scale factor determines how much memory we > try to keep free on DRAM for migration from pmem, for example. We also > wouldn't have to deal with node hotplug or updating of demotion/promotion > node chains. > > Maybe the strongest advantage of the node abstraction is the ability to > use autonuma and migrate_pages()/move_pages() API for moving pages > explicitly? Mempolicies could be used for migration to "top-tier" memory, > i.e. ZONE_NORMAL or ZONE_MOVABLE, instead. I think using pmem as a node is more natural than zone and less intrusive since we can just reuse all the numa APIs. If we treat pmem as a new zone I think the implementation may be more intrusive and complicated (i.e. need a new gfp flag) and user can't control the memory placement. Actually there had been such proposal before, please see https://www.spinics.net/lists/linux-mm/msg151788.html
On 7/1/20 1:54 AM, Huang, Ying wrote: > Why can not we just bind the memory of the application to node 0, 2, 3 > via mbind() or cpuset.mems? Then the application can allocate memory > directly from PMEM. And if we bind the memory of the application via > mbind() to node 0, we can only allocate memory directly from DRAM. Applications use cpuset.mems precisely because they don't want to allocate directly from PMEM. They want the good, deterministic, performance they get from DRAM. Even if they don't allocate directly from PMEM, is it OK for such an app to get its cold data migrated to PMEM? That's a much more subtle question and I suspect the kernel isn't going to have a single answer for it. I suspect we'll need a cpuset-level knob to turn auto-demotion on or off.
On Wed, 1 Jul 2020, Dave Hansen wrote: > > Could this cause us to break a user's mbind() or allow a user to > > circumvent their cpuset.mems? > > In its current form, yes. > > My current rationale for this is that while it's not as deferential as > it can be to the user/kernel ABI contract, it's good *overall* behavior. > The auto-migration only kicks in when the data is about to go away. So > while the user's data might be slower than they like, it is *WAY* faster > than they deserve because it should be off on the disk. > It's outside the scope of this patchset, but eventually there will be a promotion path that I think requires a strict 1:1 relationship between DRAM and PMEM nodes because otherwise mbind(), set_mempolicy(), and cpuset.mems become ineffective for nodes facing memory pressure. For the purposes of this patchset, agreed that DRAM -> PMEM -> swap makes perfect sense. Theoretically, I think you could have DRAM N0 and N1 and then a single PMEM N2 and this N2 can be the terminal node for both N0 and N1. On promotion, I think we need to rely on something stronger than autonuma to decide which DRAM node to promote to: specifically any user policy put into effect (memory tiering or autonuma shouldn't be allowed to subvert these user policies). As others have mentioned, we lose the allocation or process context at the time of demotion or promotion and any workaround for that requires some hacks, such as mapping the page to cpuset (what is the right solution for shared pages?) or adding NUMA locality handling to memcg. I think a 1:1 relationship between DRAM and PMEM nodes is required if we consider the eventual promotion of this memory so that user memory can't eventually reappear on a DRAM node that is not allowed by mbind(), set_mempolicy(), or cpuset.mems. I think it also makes this patchset much simpler. > > Because we don't have a mapping of the page back to its allocation > > context (or the process context in which it was allocated), it seems like > > both are possible. > > > > So let's assume that migration nodes cannot be other DRAM nodes. > > Otherwise, memory pressure could be intentionally or unintentionally > > induced to migrate these pages to another node. Do we have such a > > restriction on migration nodes? > > There's nothing explicit. On a normal, balanced system where there's a > 1:1:1 relationship between CPU sockets, DRAM nodes and PMEM nodes, it's > implicit since the migration path is one deep and goes from DRAM->PMEM. > > If there were some oddball system where there was a memory only DRAM > node, it might very well end up being a migration target. > Shouldn't DRAM->DRAM demotion be banned? It's all DRAM and within the control of mempolicies and cpusets today, so I had assumed this is outside the scope of memory tiering support. I had assumed that memory tiering support was all about separate tiers :) > >> +static struct page *alloc_demote_node_page(struct page *page, unsigned long node) > >> +{ > >> + /* > >> + * 'mask' targets allocation only to the desired node in the > >> + * migration path, and fails fast if the allocation can not be > >> + * immediately satisfied. Reclaim is already active and heroic > >> + * allocation efforts are unwanted. > >> + */ > >> + gfp_t mask = GFP_NOWAIT | __GFP_NOWARN | __GFP_NORETRY | > >> + __GFP_NOMEMALLOC | __GFP_THISNODE | __GFP_HIGHMEM | > >> + __GFP_MOVABLE; > > > > GFP_NOWAIT has the side-effect that it does __GFP_KSWAPD_RECLAIM: do we > > actually want to kick kswapd on the pmem node? > > In my mental model, cold data flows from: > > DRAM -> PMEM -> swap > > Kicking kswapd here ensures that while we're doing DRAM->PMEM migrations > for kinda cold data, kswapd can be working on doing the PMEM->swap part > on really cold data. > Makes sense.
On Wed, 1 Jul 2020, Yang Shi wrote: > > We can do this if we consider pmem not to be a separate memory tier from > > the system perspective, however, but rather the socket perspective. In > > other words, a node can only demote to a series of exclusive pmem ranges > > and promote to the same series of ranges in reverse order. So DRAM node 0 > > can only demote to PMEM node 2 while DRAM node 1 can only demote to PMEM > > node 3 -- a pmem range cannot be demoted to, or promoted from, more than > > one DRAM node. > > > > This naturally takes care of mbind() and cpuset.mems if we consider pmem > > just to be slower volatile memory and we don't need to deal with the > > latency concerns of cross socket migration. A user page will never be > > demoted to a pmem range across the socket and will never be promoted to a > > different DRAM node that it doesn't have access to. > > But I don't see too much benefit to limit the migration target to the > so-called *paired* pmem node. IMHO it is fine to migrate to a remote (on a > different socket) pmem node since even the cross socket access should be much > faster then refault or swap from disk. > Hi Yang, Right, but any eventual promotion path would allow this to subvert the user mempolicy or cpuset.mems if the demoted memory is eventually promoted to a DRAM node on its socket. We've discussed not having the ability to map from the demoted page to either of these contexts and it becomes more difficult for shared memory. We have page_to_nid() and page_zone() so we can always find the appropriate demotion or promotion node for a given page if there is a 1:1 relationship. Do we lose anything with the strict 1:1 relationship between DRAM and PMEM nodes? It seems much simpler in terms of implementation and is more intuitive. > I think using pmem as a node is more natural than zone and less intrusive > since we can just reuse all the numa APIs. If we treat pmem as a new zone I > think the implementation may be more intrusive and complicated (i.e. need a > new gfp flag) and user can't control the memory placement. > This is an important decision to make, I'm not sure that we actually *want* all of these NUMA APIs :) If my memory is demoted, I can simply do migrate_pages() back to DRAM and cause other memory to be demoted in its place. Things like MPOL_INTERLEAVE over nodes {0,1,2} don't make sense. Kswapd for a DRAM node putting pressure on a PMEM node for demotion that then puts the kswapd for the PMEM node under pressure to reclaim it serves *only* to spend unnecessary cpu cycles. Users could control the memory placement through a new mempolicy flag, which I think are needed anyway for explicit allocation policies for PMEM nodes. Consider if PMEM is a zone so that it has the natural 1:1 relationship with DRAM, now your system only has nodes {0,1} as today, no new NUMA topology to consider, and a mempolicy flag MPOL_F_TOPTIER that specifies memory must be allocated from ZONE_MOVABLE or ZONE_NORMAL (and I can then mlock() if I want to disable demotion on memory pressure).
On Wed, 1 Jul 2020, Dave Hansen wrote: > Even if they don't allocate directly from PMEM, is it OK for such an app > to get its cold data migrated to PMEM? That's a much more subtle > question and I suspect the kernel isn't going to have a single answer > for it. I suspect we'll need a cpuset-level knob to turn auto-demotion > on or off. > I think the answer is whether the app's cold data can be reclaimed, otherwise migration to PMEM is likely better in terms of performance. So any such app today should just be mlocking its cold data if it can't handle overhead from reclaim?
David Rientjes <rientjes@google.com> writes: > On Wed, 1 Jul 2020, Dave Hansen wrote: > >> Even if they don't allocate directly from PMEM, is it OK for such an app >> to get its cold data migrated to PMEM? That's a much more subtle >> question and I suspect the kernel isn't going to have a single answer >> for it. I suspect we'll need a cpuset-level knob to turn auto-demotion >> on or off. >> > > I think the answer is whether the app's cold data can be reclaimed, > otherwise migration to PMEM is likely better in terms of performance. So > any such app today should just be mlocking its cold data if it can't > handle overhead from reclaim? Yes. That's a way to solve the problem. A cpuset-level knob may be more flexible, because you don't need to change the application source code. Best Regards, Huang, Ying
David Rientjes <rientjes@google.com> writes: > On Wed, 1 Jul 2020, Dave Hansen wrote: > >> > Could this cause us to break a user's mbind() or allow a user to >> > circumvent their cpuset.mems? >> >> In its current form, yes. >> >> My current rationale for this is that while it's not as deferential as >> it can be to the user/kernel ABI contract, it's good *overall* behavior. >> The auto-migration only kicks in when the data is about to go away. So >> while the user's data might be slower than they like, it is *WAY* faster >> than they deserve because it should be off on the disk. >> > > It's outside the scope of this patchset, but eventually there will be a > promotion path that I think requires a strict 1:1 relationship between > DRAM and PMEM nodes because otherwise mbind(), set_mempolicy(), and > cpuset.mems become ineffective for nodes facing memory pressure. I have posted an patchset for AutoNUMA based promotion support, https://lore.kernel.org/lkml/20200218082634.1596727-1-ying.huang@intel.com/ Where, the page is promoted upon NUMA hint page fault. So all memory policy (mbind(), set_mempolicy(), and cpuset.mems) are available. We can refuse promoting the page to the DRAM nodes that are not allowed by any memory policy. So, 1:1 relationship isn't necessary for promotion. > For the purposes of this patchset, agreed that DRAM -> PMEM -> swap makes > perfect sense. Theoretically, I think you could have DRAM N0 and N1 and > then a single PMEM N2 and this N2 can be the terminal node for both N0 and > N1. On promotion, I think we need to rely on something stronger than > autonuma to decide which DRAM node to promote to: specifically any user > policy put into effect (memory tiering or autonuma shouldn't be allowed to > subvert these user policies). > > As others have mentioned, we lose the allocation or process context at the > time of demotion or promotion As above, we have process context at time of promotion. > and any workaround for that requires some > hacks, such as mapping the page to cpuset (what is the right solution for > shared pages?) or adding NUMA locality handling to memcg. It sounds natural to me to add NUMA nodes restriction to memcg. Best Regards, Huang, Ying
On Wed, 1 Jul 2020 12:45:17 -0700 David Rientjes <rientjes@google.com> wrote: > On Wed, 1 Jul 2020, Yang Shi wrote: > > > > We can do this if we consider pmem not to be a separate memory tier from > > > the system perspective, however, but rather the socket perspective. In > > > other words, a node can only demote to a series of exclusive pmem ranges > > > and promote to the same series of ranges in reverse order. So DRAM node 0 > > > can only demote to PMEM node 2 while DRAM node 1 can only demote to PMEM > > > node 3 -- a pmem range cannot be demoted to, or promoted from, more than > > > one DRAM node. > > > > > > This naturally takes care of mbind() and cpuset.mems if we consider pmem > > > just to be slower volatile memory and we don't need to deal with the > > > latency concerns of cross socket migration. A user page will never be > > > demoted to a pmem range across the socket and will never be promoted to a > > > different DRAM node that it doesn't have access to. > > > > But I don't see too much benefit to limit the migration target to the > > so-called *paired* pmem node. IMHO it is fine to migrate to a remote (on a > > different socket) pmem node since even the cross socket access should be much > > faster then refault or swap from disk. > > > > Hi Yang, > > Right, but any eventual promotion path would allow this to subvert the > user mempolicy or cpuset.mems if the demoted memory is eventually promoted > to a DRAM node on its socket. We've discussed not having the ability to > map from the demoted page to either of these contexts and it becomes more > difficult for shared memory. We have page_to_nid() and page_zone() so we > can always find the appropriate demotion or promotion node for a given > page if there is a 1:1 relationship. > > Do we lose anything with the strict 1:1 relationship between DRAM and PMEM > nodes? It seems much simpler in terms of implementation and is more > intuitive. Hi David, Yang, The 1:1 mapping implies a particular system topology. In the medium term we are likely to see systems with a central pool of persistent memory with equal access characteristics from multiple CPU containing nodes, each with local DRAM. Clearly we could fake a split of such a pmem pool to keep the 1:1 mapping but it's certainly not elegant and may be very wasteful for resources. Can a zone based approach work well without such a hard wall? Jonathan > > > I think using pmem as a node is more natural than zone and less intrusive > > since we can just reuse all the numa APIs. If we treat pmem as a new zone I > > think the implementation may be more intrusive and complicated (i.e. need a > > new gfp flag) and user can't control the memory placement. > > > > This is an important decision to make, I'm not sure that we actually > *want* all of these NUMA APIs :) If my memory is demoted, I can simply do > migrate_pages() back to DRAM and cause other memory to be demoted in its > place. Things like MPOL_INTERLEAVE over nodes {0,1,2} don't make sense. > Kswapd for a DRAM node putting pressure on a PMEM node for demotion that > then puts the kswapd for the PMEM node under pressure to reclaim it serves > *only* to spend unnecessary cpu cycles. > > Users could control the memory placement through a new mempolicy flag, > which I think are needed anyway for explicit allocation policies for PMEM > nodes. Consider if PMEM is a zone so that it has the natural 1:1 > relationship with DRAM, now your system only has nodes {0,1} as today, no > new NUMA topology to consider, and a mempolicy flag MPOL_F_TOPTIER that > specifies memory must be allocated from ZONE_MOVABLE or ZONE_NORMAL (and I > can then mlock() if I want to disable demotion on memory pressure). >
diff -puN include/linux/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard include/linux/migrate.h --- a/include/linux/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2020-06-29 16:34:38.950312604 -0700 +++ b/include/linux/migrate.h 2020-06-29 16:34:38.963312604 -0700 @@ -25,6 +25,7 @@ enum migrate_reason { MR_MEMPOLICY_MBIND, MR_NUMA_MISPLACED, MR_CONTIG_RANGE, + MR_DEMOTION, MR_TYPES }; @@ -78,6 +79,7 @@ extern int migrate_huge_page_move_mappin struct page *newpage, struct page *page); extern int migrate_page_move_mapping(struct address_space *mapping, struct page *newpage, struct page *page, int extra_count); +extern int migrate_demote_mapping(struct page *page); #else static inline void putback_movable_pages(struct list_head *l) {} @@ -104,6 +106,10 @@ static inline int migrate_huge_page_move return -ENOSYS; } +static inline int migrate_demote_mapping(struct page *page) +{ + return -ENOSYS; +} #endif /* CONFIG_MIGRATION */ #ifdef CONFIG_COMPACTION diff -puN include/trace/events/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard include/trace/events/migrate.h --- a/include/trace/events/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2020-06-29 16:34:38.952312604 -0700 +++ b/include/trace/events/migrate.h 2020-06-29 16:34:38.963312604 -0700 @@ -20,7 +20,8 @@ EM( MR_SYSCALL, "syscall_or_cpuset") \ EM( MR_MEMPOLICY_MBIND, "mempolicy_mbind") \ EM( MR_NUMA_MISPLACED, "numa_misplaced") \ - EMe(MR_CONTIG_RANGE, "contig_range") + EM( MR_CONTIG_RANGE, "contig_range") \ + EMe(MR_DEMOTION, "demotion") /* * First define the enums in the above macros to be exported to userspace diff -puN mm/debug.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/debug.c --- a/mm/debug.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2020-06-29 16:34:38.954312604 -0700 +++ b/mm/debug.c 2020-06-29 16:34:38.963312604 -0700 @@ -25,6 +25,7 @@ const char *migrate_reason_names[MR_TYPE "mempolicy_mbind", "numa_misplaced", "cma", + "demotion", }; const struct trace_print_flags pageflag_names[] = { diff -puN mm/migrate.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/migrate.c --- a/mm/migrate.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2020-06-29 16:34:38.956312604 -0700 +++ b/mm/migrate.c 2020-06-29 16:34:38.964312604 -0700 @@ -1151,6 +1151,58 @@ int next_demotion_node(int node) return node; } +static struct page *alloc_demote_node_page(struct page *page, unsigned long node) +{ + /* + * 'mask' targets allocation only to the desired node in the + * migration path, and fails fast if the allocation can not be + * immediately satisfied. Reclaim is already active and heroic + * allocation efforts are unwanted. + */ + gfp_t mask = GFP_NOWAIT | __GFP_NOWARN | __GFP_NORETRY | + __GFP_NOMEMALLOC | __GFP_THISNODE | __GFP_HIGHMEM | + __GFP_MOVABLE; + struct page *newpage; + + if (PageTransHuge(page)) { + mask |= __GFP_COMP; + newpage = alloc_pages_node(node, mask, HPAGE_PMD_ORDER); + if (newpage) + prep_transhuge_page(newpage); + } else + newpage = alloc_pages_node(node, mask, 0); + + return newpage; +} + +/** + * migrate_demote_mapping() - Migrate this page and its mappings to its + * demotion node. + * @page: A locked, isolated, non-huge page that should migrate to its current + * node's demotion target, if available. Since this is intended to be + * called during memory reclaim, all flag options are set to fail fast. + * + * @returns: MIGRATEPAGE_SUCCESS if successful, -errno otherwise. + */ +int migrate_demote_mapping(struct page *page) +{ + int next_nid = next_demotion_node(page_to_nid(page)); + + VM_BUG_ON_PAGE(!PageLocked(page), page); + VM_BUG_ON_PAGE(PageHuge(page), page); + VM_BUG_ON_PAGE(PageLRU(page), page); + + if (next_nid == NUMA_NO_NODE) + return -ENOSYS; + if (PageTransHuge(page) && !thp_migration_supported()) + return -ENOMEM; + + /* MIGRATE_ASYNC is the most light weight and never blocks.*/ + return __unmap_and_move(alloc_demote_node_page, NULL, next_nid, + page, MIGRATE_ASYNC, MR_DEMOTION); +} + + /* * gcc 4.7 and 4.8 on arm get an ICEs when inlining unmap_and_move(). Work * around it. diff -puN mm/vmscan.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/vmscan.c --- a/mm/vmscan.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard 2020-06-29 16:34:38.959312604 -0700 +++ b/mm/vmscan.c 2020-06-29 16:34:38.965312604 -0700 @@ -1077,6 +1077,7 @@ static unsigned long shrink_page_list(st LIST_HEAD(free_pages); unsigned nr_reclaimed = 0; unsigned pgactivate = 0; + int rc; memset(stat, 0, sizeof(*stat)); cond_resched(); @@ -1229,6 +1230,30 @@ static unsigned long shrink_page_list(st ; /* try to reclaim the page below */ } + rc = migrate_demote_mapping(page); + /* + * -ENOMEM on a THP may indicate either migration is + * unsupported or there was not enough contiguous + * space. Split the THP into base pages and retry the + * head immediately. The tail pages will be considered + * individually within the current loop's page list. + */ + if (rc == -ENOMEM && PageTransHuge(page) && + !split_huge_page_to_list(page, page_list)) + rc = migrate_demote_mapping(page); + + if (rc == MIGRATEPAGE_SUCCESS) { + unlock_page(page); + if (likely(put_page_testzero(page))) + goto free_it; + /* + * Speculative reference will free this page, + * so leave it off the LRU. + */ + nr_reclaimed++; + continue; + } + /* * Anonymous process memory has backing store? * Try to allocate it some swap space here.