diff mbox series

[RFC,3/8] mm/vmscan: Attempt to migrate page in lieu of discard

Message ID 20200629234509.8F89C4EF@viggo.jf.intel.com (mailing list archive)
State New, archived
Headers show
Series Migrate Pages in lieu of discard | expand

Commit Message

Dave Hansen June 29, 2020, 11:45 p.m. UTC
From: Dave Hansen <dave.hansen@linux.intel.com>

If a memory node has a preferred migration path to demote cold pages,
attempt to move those inactive pages to that migration node before
reclaiming. This will better utilize available memory, provide a faster
tier than swapping or discarding, and allow such pages to be reused
immediately without IO to retrieve the data.

When handling anonymous pages, this will be considered before swap if
enabled. Should the demotion fail for any reason, the page reclaim
will proceed as if the demotion feature was not enabled.

Some places we would like to see this used:

  1. Persistent memory being as a slower, cheaper DRAM replacement
  2. Remote memory-only "expansion" NUMA nodes
  3. Resolving memory imbalances where one NUMA node is seeing more
     allocation activity than another.  This helps keep more recent
     allocations closer to the CPUs on the node doing the allocating.

Yang Shi's patches used an alternative approach where to-be-discarded
pages were collected on a separate discard list and then discarded
as a batch with migrate_pages().  This results in simpler code and
has all the performance advantages of batching, but has the
disadvantage that pages which fail to migrate never get swapped.

#Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
---

 b/include/linux/migrate.h        |    6 ++++
 b/include/trace/events/migrate.h |    3 +-
 b/mm/debug.c                     |    1 
 b/mm/migrate.c                   |   52 +++++++++++++++++++++++++++++++++++++++
 b/mm/vmscan.c                    |   25 ++++++++++++++++++
 5 files changed, 86 insertions(+), 1 deletion(-)

Comments

David Rientjes July 1, 2020, 12:47 a.m. UTC | #1
On Mon, 29 Jun 2020, Dave Hansen wrote:

> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> If a memory node has a preferred migration path to demote cold pages,
> attempt to move those inactive pages to that migration node before
> reclaiming. This will better utilize available memory, provide a faster
> tier than swapping or discarding, and allow such pages to be reused
> immediately without IO to retrieve the data.
> 
> When handling anonymous pages, this will be considered before swap if
> enabled. Should the demotion fail for any reason, the page reclaim
> will proceed as if the demotion feature was not enabled.
> 

Thanks for sharing these patches and kick-starting the conversation, Dave.

Could this cause us to break a user's mbind() or allow a user to 
circumvent their cpuset.mems?

Because we don't have a mapping of the page back to its allocation 
context (or the process context in which it was allocated), it seems like 
both are possible.

So let's assume that migration nodes cannot be other DRAM nodes.  
Otherwise, memory pressure could be intentionally or unintentionally 
induced to migrate these pages to another node.  Do we have such a 
restriction on migration nodes?

> Some places we would like to see this used:
> 
>   1. Persistent memory being as a slower, cheaper DRAM replacement
>   2. Remote memory-only "expansion" NUMA nodes
>   3. Resolving memory imbalances where one NUMA node is seeing more
>      allocation activity than another.  This helps keep more recent
>      allocations closer to the CPUs on the node doing the allocating.
> 

(3) is the concerning one given the above if we are to use 
migrate_demote_mapping() for DRAM node balancing.

> Yang Shi's patches used an alternative approach where to-be-discarded
> pages were collected on a separate discard list and then discarded
> as a batch with migrate_pages().  This results in simpler code and
> has all the performance advantages of batching, but has the
> disadvantage that pages which fail to migrate never get swapped.
> 
> #Signed-off-by: Keith Busch <keith.busch@intel.com>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Keith Busch <kbusch@kernel.org>
> Cc: Yang Shi <yang.shi@linux.alibaba.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> ---
> 
>  b/include/linux/migrate.h        |    6 ++++
>  b/include/trace/events/migrate.h |    3 +-
>  b/mm/debug.c                     |    1 
>  b/mm/migrate.c                   |   52 +++++++++++++++++++++++++++++++++++++++
>  b/mm/vmscan.c                    |   25 ++++++++++++++++++
>  5 files changed, 86 insertions(+), 1 deletion(-)
> 
> diff -puN include/linux/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard include/linux/migrate.h
> --- a/include/linux/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.950312604 -0700
> +++ b/include/linux/migrate.h	2020-06-29 16:34:38.963312604 -0700
> @@ -25,6 +25,7 @@ enum migrate_reason {
>  	MR_MEMPOLICY_MBIND,
>  	MR_NUMA_MISPLACED,
>  	MR_CONTIG_RANGE,
> +	MR_DEMOTION,
>  	MR_TYPES
>  };
>  
> @@ -78,6 +79,7 @@ extern int migrate_huge_page_move_mappin
>  				  struct page *newpage, struct page *page);
>  extern int migrate_page_move_mapping(struct address_space *mapping,
>  		struct page *newpage, struct page *page, int extra_count);
> +extern int migrate_demote_mapping(struct page *page);
>  #else
>  
>  static inline void putback_movable_pages(struct list_head *l) {}
> @@ -104,6 +106,10 @@ static inline int migrate_huge_page_move
>  	return -ENOSYS;
>  }
>  
> +static inline int migrate_demote_mapping(struct page *page)
> +{
> +	return -ENOSYS;
> +}
>  #endif /* CONFIG_MIGRATION */
>  
>  #ifdef CONFIG_COMPACTION
> diff -puN include/trace/events/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard include/trace/events/migrate.h
> --- a/include/trace/events/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.952312604 -0700
> +++ b/include/trace/events/migrate.h	2020-06-29 16:34:38.963312604 -0700
> @@ -20,7 +20,8 @@
>  	EM( MR_SYSCALL,		"syscall_or_cpuset")		\
>  	EM( MR_MEMPOLICY_MBIND,	"mempolicy_mbind")		\
>  	EM( MR_NUMA_MISPLACED,	"numa_misplaced")		\
> -	EMe(MR_CONTIG_RANGE,	"contig_range")
> +	EM( MR_CONTIG_RANGE,	"contig_range")			\
> +	EMe(MR_DEMOTION,	"demotion")
>  
>  /*
>   * First define the enums in the above macros to be exported to userspace
> diff -puN mm/debug.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/debug.c
> --- a/mm/debug.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.954312604 -0700
> +++ b/mm/debug.c	2020-06-29 16:34:38.963312604 -0700
> @@ -25,6 +25,7 @@ const char *migrate_reason_names[MR_TYPE
>  	"mempolicy_mbind",
>  	"numa_misplaced",
>  	"cma",
> +	"demotion",
>  };
>  
>  const struct trace_print_flags pageflag_names[] = {
> diff -puN mm/migrate.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/migrate.c
> --- a/mm/migrate.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.956312604 -0700
> +++ b/mm/migrate.c	2020-06-29 16:34:38.964312604 -0700
> @@ -1151,6 +1151,58 @@ int next_demotion_node(int node)
>  	return node;
>  }
>  
> +static struct page *alloc_demote_node_page(struct page *page, unsigned long node)
> +{
> +	/*
> +	 * 'mask' targets allocation only to the desired node in the
> +	 * migration path, and fails fast if the allocation can not be
> +	 * immediately satisfied.  Reclaim is already active and heroic
> +	 * allocation efforts are unwanted.
> +	 */
> +	gfp_t mask = GFP_NOWAIT | __GFP_NOWARN | __GFP_NORETRY |
> +			__GFP_NOMEMALLOC | __GFP_THISNODE | __GFP_HIGHMEM |
> +			__GFP_MOVABLE;

GFP_NOWAIT has the side-effect that it does __GFP_KSWAPD_RECLAIM: do we 
actually want to kick kswapd on the pmem node?

If not, GFP_TRANSHUGE_LIGHT does a trick where it does 
GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM.  You could probably do the same 
here although the __GFP_IO and __GFP_FS would be unnecessary (but not 
harmful).

> +	struct page *newpage;
> +
> +	if (PageTransHuge(page)) {
> +		mask |= __GFP_COMP;
> +		newpage = alloc_pages_node(node, mask, HPAGE_PMD_ORDER);
> +		if (newpage)
> +			prep_transhuge_page(newpage);
> +	} else
> +		newpage = alloc_pages_node(node, mask, 0);
> +
> +	return newpage;
> +}
> +
> +/**
> + * migrate_demote_mapping() - Migrate this page and its mappings to its
> + *                            demotion node.
> + * @page: A locked, isolated, non-huge page that should migrate to its current
> + *        node's demotion target, if available. Since this is intended to be
> + *        called during memory reclaim, all flag options are set to fail fast.
> + *
> + * @returns: MIGRATEPAGE_SUCCESS if successful, -errno otherwise.
> + */
> +int migrate_demote_mapping(struct page *page)
> +{
> +	int next_nid = next_demotion_node(page_to_nid(page));
> +
> +	VM_BUG_ON_PAGE(!PageLocked(page), page);
> +	VM_BUG_ON_PAGE(PageHuge(page), page);
> +	VM_BUG_ON_PAGE(PageLRU(page), page);
> +
> +	if (next_nid == NUMA_NO_NODE)
> +		return -ENOSYS;
> +	if (PageTransHuge(page) && !thp_migration_supported())
> +		return -ENOMEM;
> +
> +	/* MIGRATE_ASYNC is the most light weight and never blocks.*/
> +	return __unmap_and_move(alloc_demote_node_page, NULL, next_nid,
> +				page, MIGRATE_ASYNC, MR_DEMOTION);
> +}
> +
> +
>  /*
>   * gcc 4.7 and 4.8 on arm get an ICEs when inlining unmap_and_move().  Work
>   * around it.
> diff -puN mm/vmscan.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/vmscan.c
> --- a/mm/vmscan.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.959312604 -0700
> +++ b/mm/vmscan.c	2020-06-29 16:34:38.965312604 -0700
> @@ -1077,6 +1077,7 @@ static unsigned long shrink_page_list(st
>  	LIST_HEAD(free_pages);
>  	unsigned nr_reclaimed = 0;
>  	unsigned pgactivate = 0;
> +	int rc;
>  
>  	memset(stat, 0, sizeof(*stat));
>  	cond_resched();
> @@ -1229,6 +1230,30 @@ static unsigned long shrink_page_list(st
>  			; /* try to reclaim the page below */
>  		}
>  
> +		rc = migrate_demote_mapping(page);
> +		/*
> +		 * -ENOMEM on a THP may indicate either migration is
> +		 * unsupported or there was not enough contiguous
> +		 * space. Split the THP into base pages and retry the
> +		 * head immediately. The tail pages will be considered
> +		 * individually within the current loop's page list.
> +		 */
> +		if (rc == -ENOMEM && PageTransHuge(page) &&
> +		    !split_huge_page_to_list(page, page_list))
> +			rc = migrate_demote_mapping(page);
> +
> +		if (rc == MIGRATEPAGE_SUCCESS) {
> +			unlock_page(page);
> +			if (likely(put_page_testzero(page)))
> +				goto free_it;
> +			/*
> +			 * Speculative reference will free this page,
> +			 * so leave it off the LRU.
> +			 */
> +			nr_reclaimed++;

nr_reclaimed += nr_pages instead?

> +			continue;
> +		}
> +
>  		/*
>  		 * Anonymous process memory has backing store?
>  		 * Try to allocate it some swap space here.
Yang Shi July 1, 2020, 1:29 a.m. UTC | #2
On 6/30/20 5:47 PM, David Rientjes wrote:
> On Mon, 29 Jun 2020, Dave Hansen wrote:
>
>> From: Dave Hansen <dave.hansen@linux.intel.com>
>>
>> If a memory node has a preferred migration path to demote cold pages,
>> attempt to move those inactive pages to that migration node before
>> reclaiming. This will better utilize available memory, provide a faster
>> tier than swapping or discarding, and allow such pages to be reused
>> immediately without IO to retrieve the data.
>>
>> When handling anonymous pages, this will be considered before swap if
>> enabled. Should the demotion fail for any reason, the page reclaim
>> will proceed as if the demotion feature was not enabled.
>>
> Thanks for sharing these patches and kick-starting the conversation, Dave.
>
> Could this cause us to break a user's mbind() or allow a user to
> circumvent their cpuset.mems?
>
> Because we don't have a mapping of the page back to its allocation
> context (or the process context in which it was allocated), it seems like
> both are possible.

Yes, this could break the memory placement policy enforced by mbind and 
cpuset. I discussed this with Michal on mailing list and tried to find a 
way to solve it, but unfortunately it seems not easy as what you 
mentioned above. The memory policy and cpuset is stored in task_struct 
rather than mm_struct. It is not easy to trace back to task_struct from 
page (owner field of mm_struct might be helpful, but it depends on 
CONFIG_MEMCG and is not preferred way).

>
> So let's assume that migration nodes cannot be other DRAM nodes.
> Otherwise, memory pressure could be intentionally or unintentionally
> induced to migrate these pages to another node.  Do we have such a
> restriction on migration nodes?
>
>> Some places we would like to see this used:
>>
>>    1. Persistent memory being as a slower, cheaper DRAM replacement
>>    2. Remote memory-only "expansion" NUMA nodes
>>    3. Resolving memory imbalances where one NUMA node is seeing more
>>       allocation activity than another.  This helps keep more recent
>>       allocations closer to the CPUs on the node doing the allocating.
>>
> (3) is the concerning one given the above if we are to use
> migrate_demote_mapping() for DRAM node balancing.
>
>> Yang Shi's patches used an alternative approach where to-be-discarded
>> pages were collected on a separate discard list and then discarded
>> as a batch with migrate_pages().  This results in simpler code and
>> has all the performance advantages of batching, but has the
>> disadvantage that pages which fail to migrate never get swapped.
>>
>> #Signed-off-by: Keith Busch <keith.busch@intel.com>
>> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: Keith Busch <kbusch@kernel.org>
>> Cc: Yang Shi <yang.shi@linux.alibaba.com>
>> Cc: David Rientjes <rientjes@google.com>
>> Cc: Huang Ying <ying.huang@intel.com>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> ---
>>
>>   b/include/linux/migrate.h        |    6 ++++
>>   b/include/trace/events/migrate.h |    3 +-
>>   b/mm/debug.c                     |    1
>>   b/mm/migrate.c                   |   52 +++++++++++++++++++++++++++++++++++++++
>>   b/mm/vmscan.c                    |   25 ++++++++++++++++++
>>   5 files changed, 86 insertions(+), 1 deletion(-)
>>
>> diff -puN include/linux/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard include/linux/migrate.h
>> --- a/include/linux/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.950312604 -0700
>> +++ b/include/linux/migrate.h	2020-06-29 16:34:38.963312604 -0700
>> @@ -25,6 +25,7 @@ enum migrate_reason {
>>   	MR_MEMPOLICY_MBIND,
>>   	MR_NUMA_MISPLACED,
>>   	MR_CONTIG_RANGE,
>> +	MR_DEMOTION,
>>   	MR_TYPES
>>   };
>>   
>> @@ -78,6 +79,7 @@ extern int migrate_huge_page_move_mappin
>>   				  struct page *newpage, struct page *page);
>>   extern int migrate_page_move_mapping(struct address_space *mapping,
>>   		struct page *newpage, struct page *page, int extra_count);
>> +extern int migrate_demote_mapping(struct page *page);
>>   #else
>>   
>>   static inline void putback_movable_pages(struct list_head *l) {}
>> @@ -104,6 +106,10 @@ static inline int migrate_huge_page_move
>>   	return -ENOSYS;
>>   }
>>   
>> +static inline int migrate_demote_mapping(struct page *page)
>> +{
>> +	return -ENOSYS;
>> +}
>>   #endif /* CONFIG_MIGRATION */
>>   
>>   #ifdef CONFIG_COMPACTION
>> diff -puN include/trace/events/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard include/trace/events/migrate.h
>> --- a/include/trace/events/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.952312604 -0700
>> +++ b/include/trace/events/migrate.h	2020-06-29 16:34:38.963312604 -0700
>> @@ -20,7 +20,8 @@
>>   	EM( MR_SYSCALL,		"syscall_or_cpuset")		\
>>   	EM( MR_MEMPOLICY_MBIND,	"mempolicy_mbind")		\
>>   	EM( MR_NUMA_MISPLACED,	"numa_misplaced")		\
>> -	EMe(MR_CONTIG_RANGE,	"contig_range")
>> +	EM( MR_CONTIG_RANGE,	"contig_range")			\
>> +	EMe(MR_DEMOTION,	"demotion")
>>   
>>   /*
>>    * First define the enums in the above macros to be exported to userspace
>> diff -puN mm/debug.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/debug.c
>> --- a/mm/debug.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.954312604 -0700
>> +++ b/mm/debug.c	2020-06-29 16:34:38.963312604 -0700
>> @@ -25,6 +25,7 @@ const char *migrate_reason_names[MR_TYPE
>>   	"mempolicy_mbind",
>>   	"numa_misplaced",
>>   	"cma",
>> +	"demotion",
>>   };
>>   
>>   const struct trace_print_flags pageflag_names[] = {
>> diff -puN mm/migrate.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/migrate.c
>> --- a/mm/migrate.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.956312604 -0700
>> +++ b/mm/migrate.c	2020-06-29 16:34:38.964312604 -0700
>> @@ -1151,6 +1151,58 @@ int next_demotion_node(int node)
>>   	return node;
>>   }
>>   
>> +static struct page *alloc_demote_node_page(struct page *page, unsigned long node)
>> +{
>> +	/*
>> +	 * 'mask' targets allocation only to the desired node in the
>> +	 * migration path, and fails fast if the allocation can not be
>> +	 * immediately satisfied.  Reclaim is already active and heroic
>> +	 * allocation efforts are unwanted.
>> +	 */
>> +	gfp_t mask = GFP_NOWAIT | __GFP_NOWARN | __GFP_NORETRY |
>> +			__GFP_NOMEMALLOC | __GFP_THISNODE | __GFP_HIGHMEM |
>> +			__GFP_MOVABLE;
> GFP_NOWAIT has the side-effect that it does __GFP_KSWAPD_RECLAIM: do we
> actually want to kick kswapd on the pmem node?
>
> If not, GFP_TRANSHUGE_LIGHT does a trick where it does
> GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM.  You could probably do the same
> here although the __GFP_IO and __GFP_FS would be unnecessary (but not
> harmful).

I'm not sure how Dave thought about this, however, IMHO kicking kswapd 
on pmem node would help to free memory then improve migration success 
rate. In my implementation, as Dave mentioned in the commit log, the 
migration candidates are put on a separate list then migrated in batch 
by calling migrate_pages(). Kicking kswapd on pmem would help to improve 
success rate since migrate_pages() will retry a couple of times.

Dave's implementation (as you see in this patch) does migration for per 
page basis, if migration is failed it will try swap. Kicking kswapd on 
pmem would also help the later migration. However, IMHO it seems 
migration retry should be still faster than swap.

>
>> +	struct page *newpage;
>> +
>> +	if (PageTransHuge(page)) {
>> +		mask |= __GFP_COMP;
>> +		newpage = alloc_pages_node(node, mask, HPAGE_PMD_ORDER);
>> +		if (newpage)
>> +			prep_transhuge_page(newpage);
>> +	} else
>> +		newpage = alloc_pages_node(node, mask, 0);
>> +
>> +	return newpage;
>> +}
>> +
>> +/**
>> + * migrate_demote_mapping() - Migrate this page and its mappings to its
>> + *                            demotion node.
>> + * @page: A locked, isolated, non-huge page that should migrate to its current
>> + *        node's demotion target, if available. Since this is intended to be
>> + *        called during memory reclaim, all flag options are set to fail fast.
>> + *
>> + * @returns: MIGRATEPAGE_SUCCESS if successful, -errno otherwise.
>> + */
>> +int migrate_demote_mapping(struct page *page)
>> +{
>> +	int next_nid = next_demotion_node(page_to_nid(page));
>> +
>> +	VM_BUG_ON_PAGE(!PageLocked(page), page);
>> +	VM_BUG_ON_PAGE(PageHuge(page), page);
>> +	VM_BUG_ON_PAGE(PageLRU(page), page);
>> +
>> +	if (next_nid == NUMA_NO_NODE)
>> +		return -ENOSYS;
>> +	if (PageTransHuge(page) && !thp_migration_supported())
>> +		return -ENOMEM;
>> +
>> +	/* MIGRATE_ASYNC is the most light weight and never blocks.*/
>> +	return __unmap_and_move(alloc_demote_node_page, NULL, next_nid,
>> +				page, MIGRATE_ASYNC, MR_DEMOTION);
>> +}
>> +
>> +
>>   /*
>>    * gcc 4.7 and 4.8 on arm get an ICEs when inlining unmap_and_move().  Work
>>    * around it.
>> diff -puN mm/vmscan.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/vmscan.c
>> --- a/mm/vmscan.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.959312604 -0700
>> +++ b/mm/vmscan.c	2020-06-29 16:34:38.965312604 -0700
>> @@ -1077,6 +1077,7 @@ static unsigned long shrink_page_list(st
>>   	LIST_HEAD(free_pages);
>>   	unsigned nr_reclaimed = 0;
>>   	unsigned pgactivate = 0;
>> +	int rc;
>>   
>>   	memset(stat, 0, sizeof(*stat));
>>   	cond_resched();
>> @@ -1229,6 +1230,30 @@ static unsigned long shrink_page_list(st
>>   			; /* try to reclaim the page below */
>>   		}
>>   
>> +		rc = migrate_demote_mapping(page);
>> +		/*
>> +		 * -ENOMEM on a THP may indicate either migration is
>> +		 * unsupported or there was not enough contiguous
>> +		 * space. Split the THP into base pages and retry the
>> +		 * head immediately. The tail pages will be considered
>> +		 * individually within the current loop's page list.
>> +		 */
>> +		if (rc == -ENOMEM && PageTransHuge(page) &&
>> +		    !split_huge_page_to_list(page, page_list))
>> +			rc = migrate_demote_mapping(page);
>> +
>> +		if (rc == MIGRATEPAGE_SUCCESS) {
>> +			unlock_page(page);
>> +			if (likely(put_page_testzero(page)))
>> +				goto free_it;
>> +			/*
>> +			 * Speculative reference will free this page,
>> +			 * so leave it off the LRU.
>> +			 */
>> +			nr_reclaimed++;
> nr_reclaimed += nr_pages instead?
>
>> +			continue;
>> +		}
>> +
>>   		/*
>>   		 * Anonymous process memory has backing store?
>>   		 * Try to allocate it some swap space here.
Huang, Ying July 1, 2020, 1:40 a.m. UTC | #3
David Rientjes <rientjes@google.com> writes:

> On Mon, 29 Jun 2020, Dave Hansen wrote:
>
>> From: Dave Hansen <dave.hansen@linux.intel.com>
>> 
>> If a memory node has a preferred migration path to demote cold pages,
>> attempt to move those inactive pages to that migration node before
>> reclaiming. This will better utilize available memory, provide a faster
>> tier than swapping or discarding, and allow such pages to be reused
>> immediately without IO to retrieve the data.
>> 
>> When handling anonymous pages, this will be considered before swap if
>> enabled. Should the demotion fail for any reason, the page reclaim
>> will proceed as if the demotion feature was not enabled.
>> 
>
> Thanks for sharing these patches and kick-starting the conversation, Dave.
>
> Could this cause us to break a user's mbind() or allow a user to 
> circumvent their cpuset.mems?
>
> Because we don't have a mapping of the page back to its allocation 
> context (or the process context in which it was allocated), it seems like 
> both are possible.

For mbind, I think we don't have enough information during reclaim to
enforce the node binding policy.  But for cpuset, if cgroup v2 (with the
unified hierarchy) is used, it's possible to get the node binding policy
via something like,

  cgroup_get_e_css(page->mem_cgroup, &cpuset_cgrp_subsys)

> So let's assume that migration nodes cannot be other DRAM nodes.  
> Otherwise, memory pressure could be intentionally or unintentionally 
> induced to migrate these pages to another node.  Do we have such a 
> restriction on migration nodes?
>
>> Some places we would like to see this used:
>> 
>>   1. Persistent memory being as a slower, cheaper DRAM replacement
>>   2. Remote memory-only "expansion" NUMA nodes
>>   3. Resolving memory imbalances where one NUMA node is seeing more
>>      allocation activity than another.  This helps keep more recent
>>      allocations closer to the CPUs on the node doing the allocating.
>> 
>
> (3) is the concerning one given the above if we are to use 
> migrate_demote_mapping() for DRAM node balancing.
>
>> Yang Shi's patches used an alternative approach where to-be-discarded
>> pages were collected on a separate discard list and then discarded
>> as a batch with migrate_pages().  This results in simpler code and
>> has all the performance advantages of batching, but has the
>> disadvantage that pages which fail to migrate never get swapped.
>> 
>> #Signed-off-by: Keith Busch <keith.busch@intel.com>
>> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: Keith Busch <kbusch@kernel.org>
>> Cc: Yang Shi <yang.shi@linux.alibaba.com>
>> Cc: David Rientjes <rientjes@google.com>
>> Cc: Huang Ying <ying.huang@intel.com>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> ---
>> 
>>  b/include/linux/migrate.h        |    6 ++++
>>  b/include/trace/events/migrate.h |    3 +-
>>  b/mm/debug.c                     |    1 
>>  b/mm/migrate.c                   |   52 +++++++++++++++++++++++++++++++++++++++
>>  b/mm/vmscan.c                    |   25 ++++++++++++++++++
>>  5 files changed, 86 insertions(+), 1 deletion(-)
>> 
>> diff -puN include/linux/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard include/linux/migrate.h
>> --- a/include/linux/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.950312604 -0700
>> +++ b/include/linux/migrate.h	2020-06-29 16:34:38.963312604 -0700
>> @@ -25,6 +25,7 @@ enum migrate_reason {
>>  	MR_MEMPOLICY_MBIND,
>>  	MR_NUMA_MISPLACED,
>>  	MR_CONTIG_RANGE,
>> +	MR_DEMOTION,
>>  	MR_TYPES
>>  };
>>  
>> @@ -78,6 +79,7 @@ extern int migrate_huge_page_move_mappin
>>  				  struct page *newpage, struct page *page);
>>  extern int migrate_page_move_mapping(struct address_space *mapping,
>>  		struct page *newpage, struct page *page, int extra_count);
>> +extern int migrate_demote_mapping(struct page *page);
>>  #else
>>  
>>  static inline void putback_movable_pages(struct list_head *l) {}
>> @@ -104,6 +106,10 @@ static inline int migrate_huge_page_move
>>  	return -ENOSYS;
>>  }
>>  
>> +static inline int migrate_demote_mapping(struct page *page)
>> +{
>> +	return -ENOSYS;
>> +}
>>  #endif /* CONFIG_MIGRATION */
>>  
>>  #ifdef CONFIG_COMPACTION
>> diff -puN include/trace/events/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard include/trace/events/migrate.h
>> --- a/include/trace/events/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.952312604 -0700
>> +++ b/include/trace/events/migrate.h	2020-06-29 16:34:38.963312604 -0700
>> @@ -20,7 +20,8 @@
>>  	EM( MR_SYSCALL,		"syscall_or_cpuset")		\
>>  	EM( MR_MEMPOLICY_MBIND,	"mempolicy_mbind")		\
>>  	EM( MR_NUMA_MISPLACED,	"numa_misplaced")		\
>> -	EMe(MR_CONTIG_RANGE,	"contig_range")
>> +	EM( MR_CONTIG_RANGE,	"contig_range")			\
>> +	EMe(MR_DEMOTION,	"demotion")
>>  
>>  /*
>>   * First define the enums in the above macros to be exported to userspace
>> diff -puN mm/debug.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/debug.c
>> --- a/mm/debug.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.954312604 -0700
>> +++ b/mm/debug.c	2020-06-29 16:34:38.963312604 -0700
>> @@ -25,6 +25,7 @@ const char *migrate_reason_names[MR_TYPE
>>  	"mempolicy_mbind",
>>  	"numa_misplaced",
>>  	"cma",
>> +	"demotion",
>>  };
>>  
>>  const struct trace_print_flags pageflag_names[] = {
>> diff -puN mm/migrate.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/migrate.c
>> --- a/mm/migrate.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.956312604 -0700
>> +++ b/mm/migrate.c	2020-06-29 16:34:38.964312604 -0700
>> @@ -1151,6 +1151,58 @@ int next_demotion_node(int node)
>>  	return node;
>>  }
>>  
>> +static struct page *alloc_demote_node_page(struct page *page, unsigned long node)
>> +{
>> +	/*
>> +	 * 'mask' targets allocation only to the desired node in the
>> +	 * migration path, and fails fast if the allocation can not be
>> +	 * immediately satisfied.  Reclaim is already active and heroic
>> +	 * allocation efforts are unwanted.
>> +	 */
>> +	gfp_t mask = GFP_NOWAIT | __GFP_NOWARN | __GFP_NORETRY |
>> +			__GFP_NOMEMALLOC | __GFP_THISNODE | __GFP_HIGHMEM |
>> +			__GFP_MOVABLE;
>
> GFP_NOWAIT has the side-effect that it does __GFP_KSWAPD_RECLAIM: do we 
> actually want to kick kswapd on the pmem node?

I think it should be a good idea to kick kswapd on the PMEM node.
Because otherwise, we will discard more pages in DRAM node.  And in
general, the DRAM pages are hotter than the PMEM pages, because the cold
DRAM pages are migrated to the PMEM node.

> If not, GFP_TRANSHUGE_LIGHT does a trick where it does 
> GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM.  You could probably do the same 
> here although the __GFP_IO and __GFP_FS would be unnecessary (but not 
> harmful).
>
>> +	struct page *newpage;
>> +
>> +	if (PageTransHuge(page)) {
>> +		mask |= __GFP_COMP;
>> +		newpage = alloc_pages_node(node, mask, HPAGE_PMD_ORDER);
>> +		if (newpage)
>> +			prep_transhuge_page(newpage);
>> +	} else
>> +		newpage = alloc_pages_node(node, mask, 0);
>> +
>> +	return newpage;
>> +}
>> +

Best Regards,
Huang, Ying
David Rientjes July 1, 2020, 5:41 a.m. UTC | #4
On Tue, 30 Jun 2020, Yang Shi wrote:

> > > From: Dave Hansen <dave.hansen@linux.intel.com>
> > > 
> > > If a memory node has a preferred migration path to demote cold pages,
> > > attempt to move those inactive pages to that migration node before
> > > reclaiming. This will better utilize available memory, provide a faster
> > > tier than swapping or discarding, and allow such pages to be reused
> > > immediately without IO to retrieve the data.
> > > 
> > > When handling anonymous pages, this will be considered before swap if
> > > enabled. Should the demotion fail for any reason, the page reclaim
> > > will proceed as if the demotion feature was not enabled.
> > > 
> > Thanks for sharing these patches and kick-starting the conversation, Dave.
> > 
> > Could this cause us to break a user's mbind() or allow a user to
> > circumvent their cpuset.mems?
> > 
> > Because we don't have a mapping of the page back to its allocation
> > context (or the process context in which it was allocated), it seems like
> > both are possible.
> 
> Yes, this could break the memory placement policy enforced by mbind and
> cpuset. I discussed this with Michal on mailing list and tried to find a way
> to solve it, but unfortunately it seems not easy as what you mentioned above.
> The memory policy and cpuset is stored in task_struct rather than mm_struct.
> It is not easy to trace back to task_struct from page (owner field of
> mm_struct might be helpful, but it depends on CONFIG_MEMCG and is not
> preferred way).
> 

Yeah, and Ying made a similar response to this message.

We can do this if we consider pmem not to be a separate memory tier from 
the system perspective, however, but rather the socket perspective.  In 
other words, a node can only demote to a series of exclusive pmem ranges 
and promote to the same series of ranges in reverse order.  So DRAM node 0 
can only demote to PMEM node 2 while DRAM node 1 can only demote to PMEM 
node 3 -- a pmem range cannot be demoted to, or promoted from, more than 
one DRAM node.

This naturally takes care of mbind() and cpuset.mems if we consider pmem 
just to be slower volatile memory and we don't need to deal with the 
latency concerns of cross socket migration.  A user page will never be 
demoted to a pmem range across the socket and will never be promoted to a 
different DRAM node that it doesn't have access to.

That can work with the NUMA abstraction for pmem, but it could also 
theoretically be a new memory zone instead.  If all memory living on pmem 
is migratable (the natural way that memory hotplug is done, so we can 
offline), this zone would live above ZONE_MOVABLE.  Zonelist ordering 
would determine whether we can allocate directly from this memory based on 
system config or a new gfp flag that could be set for users of a mempolicy 
that allows allocations directly from pmem.  If abstracted as a NUMA node 
instead, interleave over nodes {0,2,3} or a cpuset.mems of {0,2,3} doesn't 
make much sense.

Kswapd would need to be enlightened for proper pgdat and pmem balancing 
but in theory it should be simpler because it only has its own node to 
manage.  Existing per-zone watermarks might be easy to use to fine tune 
the policy from userspace: the scale factor determines how much memory we 
try to keep free on DRAM for migration from pmem, for example.  We also 
wouldn't have to deal with node hotplug or updating of demotion/promotion 
node chains.

Maybe the strongest advantage of the node abstraction is the ability to 
use autonuma and migrate_pages()/move_pages() API for moving pages 
explicitly?  Mempolicies could be used for migration to "top-tier" memory, 
i.e. ZONE_NORMAL or ZONE_MOVABLE, instead.
Huang, Ying July 1, 2020, 8:54 a.m. UTC | #5
David Rientjes <rientjes@google.com> writes:

> On Tue, 30 Jun 2020, Yang Shi wrote:
>
>> > > From: Dave Hansen <dave.hansen@linux.intel.com>
>> > > 
>> > > If a memory node has a preferred migration path to demote cold pages,
>> > > attempt to move those inactive pages to that migration node before
>> > > reclaiming. This will better utilize available memory, provide a faster
>> > > tier than swapping or discarding, and allow such pages to be reused
>> > > immediately without IO to retrieve the data.
>> > > 
>> > > When handling anonymous pages, this will be considered before swap if
>> > > enabled. Should the demotion fail for any reason, the page reclaim
>> > > will proceed as if the demotion feature was not enabled.
>> > > 
>> > Thanks for sharing these patches and kick-starting the conversation, Dave.
>> > 
>> > Could this cause us to break a user's mbind() or allow a user to
>> > circumvent their cpuset.mems?
>> > 
>> > Because we don't have a mapping of the page back to its allocation
>> > context (or the process context in which it was allocated), it seems like
>> > both are possible.
>> 
>> Yes, this could break the memory placement policy enforced by mbind and
>> cpuset. I discussed this with Michal on mailing list and tried to find a way
>> to solve it, but unfortunately it seems not easy as what you mentioned above.
>> The memory policy and cpuset is stored in task_struct rather than mm_struct.
>> It is not easy to trace back to task_struct from page (owner field of
>> mm_struct might be helpful, but it depends on CONFIG_MEMCG and is not
>> preferred way).
>> 
>
> Yeah, and Ying made a similar response to this message.
>
> We can do this if we consider pmem not to be a separate memory tier from 
> the system perspective, however, but rather the socket perspective.  In 
> other words, a node can only demote to a series of exclusive pmem ranges 
> and promote to the same series of ranges in reverse order.  So DRAM node 0 
> can only demote to PMEM node 2 while DRAM node 1 can only demote to PMEM 
> node 3 -- a pmem range cannot be demoted to, or promoted from, more than 
> one DRAM node.
>
> This naturally takes care of mbind() and cpuset.mems if we consider pmem 
> just to be slower volatile memory and we don't need to deal with the 
> latency concerns of cross socket migration.  A user page will never be 
> demoted to a pmem range across the socket and will never be promoted to a 
> different DRAM node that it doesn't have access to.
>
> That can work with the NUMA abstraction for pmem, but it could also 
> theoretically be a new memory zone instead.  If all memory living on pmem 
> is migratable (the natural way that memory hotplug is done, so we can 
> offline), this zone would live above ZONE_MOVABLE.  Zonelist ordering 
> would determine whether we can allocate directly from this memory based on 
> system config or a new gfp flag that could be set for users of a mempolicy 
> that allows allocations directly from pmem.  If abstracted as a NUMA node 
> instead, interleave over nodes {0,2,3} or a cpuset.mems of {0,2,3} doesn't 
> make much sense.

Why can not we just bind the memory of the application to node 0, 2, 3
via mbind() or cpuset.mems?  Then the application can allocate memory
directly from PMEM.  And if we bind the memory of the application via
mbind() to node 0, we can only allocate memory directly from DRAM.

Best Regards,
Huang, Ying
Dave Hansen July 1, 2020, 3:15 p.m. UTC | #6
On 6/30/20 10:41 PM, David Rientjes wrote:
> Maybe the strongest advantage of the node abstraction is the ability to 
> use autonuma and migrate_pages()/move_pages() API for moving pages 
> explicitly?  Mempolicies could be used for migration to "top-tier" memory, 
> i.e. ZONE_NORMAL or ZONE_MOVABLE, instead.

I totally agree that we _could_ introduce this new memory class as a zone.

Doing it as nodes is pretty natural since the firmware today describes
both slow (versus DRAM) and fast memory as separate nodes.  It also
means that apps can get visibility into placement with existing NUMA
tooling and ABIs.  To me, those are the two strongest reasons for PMEM.

Looking to the future, I don't think the zone approach scales.  I know
folks want to build stuff within a single socket which is a mix of:

1. High-Bandwidth, on-package memory (a la MCDRAM)
2. DRAM
3. DRAM-cached PMEM (aka. "memory mode" PMEM)
4. Non-cached PMEM

Right now, #1 doesn't exist on modern platform  and #3/#4 can't be mixed
(you only get 3 _or_ 4 at once).  I'd love to provide something here
that Intel can use to build future crazy platform configurations that
don't require kernel enabling.
Dave Hansen July 1, 2020, 4:48 p.m. UTC | #7
On 6/30/20 5:47 PM, David Rientjes wrote:
> On Mon, 29 Jun 2020, Dave Hansen wrote:
>> From: Dave Hansen <dave.hansen@linux.intel.com>
>>
>> If a memory node has a preferred migration path to demote cold pages,
>> attempt to move those inactive pages to that migration node before
>> reclaiming. This will better utilize available memory, provide a faster
>> tier than swapping or discarding, and allow such pages to be reused
>> immediately without IO to retrieve the data.
>>
>> When handling anonymous pages, this will be considered before swap if
>> enabled. Should the demotion fail for any reason, the page reclaim
>> will proceed as if the demotion feature was not enabled.
>>
> 
> Thanks for sharing these patches and kick-starting the conversation, Dave.
> 
> Could this cause us to break a user's mbind() or allow a user to 
> circumvent their cpuset.mems?

In its current form, yes.

My current rationale for this is that while it's not as deferential as
it can be to the user/kernel ABI contract, it's good *overall* behavior.
 The auto-migration only kicks in when the data is about to go away.  So
while the user's data might be slower than they like, it is *WAY* faster
than they deserve because it should be off on the disk.

> Because we don't have a mapping of the page back to its allocation 
> context (or the process context in which it was allocated), it seems like 
> both are possible.
> 
> So let's assume that migration nodes cannot be other DRAM nodes.  
> Otherwise, memory pressure could be intentionally or unintentionally 
> induced to migrate these pages to another node.  Do we have such a 
> restriction on migration nodes?

There's nothing explicit.  On a normal, balanced system where there's a
1:1:1 relationship between CPU sockets, DRAM nodes and PMEM nodes, it's
implicit since the migration path is one deep and goes from DRAM->PMEM.

If there were some oddball system where there was a memory only DRAM
node, it might very well end up being a migration target.

>> Some places we would like to see this used:
>>
>>   1. Persistent memory being as a slower, cheaper DRAM replacement
>>   2. Remote memory-only "expansion" NUMA nodes
>>   3. Resolving memory imbalances where one NUMA node is seeing more
>>      allocation activity than another.  This helps keep more recent
>>      allocations closer to the CPUs on the node doing the allocating.
> 
> (3) is the concerning one given the above if we are to use 
> migrate_demote_mapping() for DRAM node balancing.

Yeah, agreed.  That's the sketchiest of the three.  :)

>> +static struct page *alloc_demote_node_page(struct page *page, unsigned long node)
>> +{
>> +	/*
>> +	 * 'mask' targets allocation only to the desired node in the
>> +	 * migration path, and fails fast if the allocation can not be
>> +	 * immediately satisfied.  Reclaim is already active and heroic
>> +	 * allocation efforts are unwanted.
>> +	 */
>> +	gfp_t mask = GFP_NOWAIT | __GFP_NOWARN | __GFP_NORETRY |
>> +			__GFP_NOMEMALLOC | __GFP_THISNODE | __GFP_HIGHMEM |
>> +			__GFP_MOVABLE;
> 
> GFP_NOWAIT has the side-effect that it does __GFP_KSWAPD_RECLAIM: do we 
> actually want to kick kswapd on the pmem node?

In my mental model, cold data flows from:

	DRAM -> PMEM -> swap

Kicking kswapd here ensures that while we're doing DRAM->PMEM migrations
for kinda cold data, kswapd can be working on doing the PMEM->swap part
on really cold data.

...
>> @@ -1229,6 +1230,30 @@ static unsigned long shrink_page_list(st
>>  			; /* try to reclaim the page below */
>>  		}
>>  
>> +		rc = migrate_demote_mapping(page);
>> +		/*
>> +		 * -ENOMEM on a THP may indicate either migration is
>> +		 * unsupported or there was not enough contiguous
>> +		 * space. Split the THP into base pages and retry the
>> +		 * head immediately. The tail pages will be considered
>> +		 * individually within the current loop's page list.
>> +		 */
>> +		if (rc == -ENOMEM && PageTransHuge(page) &&
>> +		    !split_huge_page_to_list(page, page_list))
>> +			rc = migrate_demote_mapping(page);
>> +
>> +		if (rc == MIGRATEPAGE_SUCCESS) {
>> +			unlock_page(page);
>> +			if (likely(put_page_testzero(page)))
>> +				goto free_it;
>> +			/*
>> +			 * Speculative reference will free this page,
>> +			 * so leave it off the LRU.
>> +			 */
>> +			nr_reclaimed++;
> 
> nr_reclaimed += nr_pages instead?

Oh, good catch.  I also need to go double-check that 'nr_pages' isn't
wrong elsewhere because of the split.
Yang Shi July 1, 2020, 5:21 p.m. UTC | #8
On 6/30/20 10:41 PM, David Rientjes wrote:
> On Tue, 30 Jun 2020, Yang Shi wrote:
>
>>>> From: Dave Hansen <dave.hansen@linux.intel.com>
>>>>
>>>> If a memory node has a preferred migration path to demote cold pages,
>>>> attempt to move those inactive pages to that migration node before
>>>> reclaiming. This will better utilize available memory, provide a faster
>>>> tier than swapping or discarding, and allow such pages to be reused
>>>> immediately without IO to retrieve the data.
>>>>
>>>> When handling anonymous pages, this will be considered before swap if
>>>> enabled. Should the demotion fail for any reason, the page reclaim
>>>> will proceed as if the demotion feature was not enabled.
>>>>
>>> Thanks for sharing these patches and kick-starting the conversation, Dave.
>>>
>>> Could this cause us to break a user's mbind() or allow a user to
>>> circumvent their cpuset.mems?
>>>
>>> Because we don't have a mapping of the page back to its allocation
>>> context (or the process context in which it was allocated), it seems like
>>> both are possible.
>> Yes, this could break the memory placement policy enforced by mbind and
>> cpuset. I discussed this with Michal on mailing list and tried to find a way
>> to solve it, but unfortunately it seems not easy as what you mentioned above.
>> The memory policy and cpuset is stored in task_struct rather than mm_struct.
>> It is not easy to trace back to task_struct from page (owner field of
>> mm_struct might be helpful, but it depends on CONFIG_MEMCG and is not
>> preferred way).
>>
> Yeah, and Ying made a similar response to this message.
>
> We can do this if we consider pmem not to be a separate memory tier from
> the system perspective, however, but rather the socket perspective.  In
> other words, a node can only demote to a series of exclusive pmem ranges
> and promote to the same series of ranges in reverse order.  So DRAM node 0
> can only demote to PMEM node 2 while DRAM node 1 can only demote to PMEM
> node 3 -- a pmem range cannot be demoted to, or promoted from, more than
> one DRAM node.
>
> This naturally takes care of mbind() and cpuset.mems if we consider pmem
> just to be slower volatile memory and we don't need to deal with the
> latency concerns of cross socket migration.  A user page will never be
> demoted to a pmem range across the socket and will never be promoted to a
> different DRAM node that it doesn't have access to.

But I don't see too much benefit to limit the migration target to the 
so-called *paired* pmem node. IMHO it is fine to migrate to a remote (on 
a different socket) pmem node since even the cross socket access should 
be much faster then refault or swap from disk.

>
> That can work with the NUMA abstraction for pmem, but it could also
> theoretically be a new memory zone instead.  If all memory living on pmem
> is migratable (the natural way that memory hotplug is done, so we can
> offline), this zone would live above ZONE_MOVABLE.  Zonelist ordering
> would determine whether we can allocate directly from this memory based on
> system config or a new gfp flag that could be set for users of a mempolicy
> that allows allocations directly from pmem.  If abstracted as a NUMA node
> instead, interleave over nodes {0,2,3} or a cpuset.mems of {0,2,3} doesn't
> make much sense.
>
> Kswapd would need to be enlightened for proper pgdat and pmem balancing
> but in theory it should be simpler because it only has its own node to
> manage.  Existing per-zone watermarks might be easy to use to fine tune
> the policy from userspace: the scale factor determines how much memory we
> try to keep free on DRAM for migration from pmem, for example.  We also
> wouldn't have to deal with node hotplug or updating of demotion/promotion
> node chains.
>
> Maybe the strongest advantage of the node abstraction is the ability to
> use autonuma and migrate_pages()/move_pages() API for moving pages
> explicitly?  Mempolicies could be used for migration to "top-tier" memory,
> i.e. ZONE_NORMAL or ZONE_MOVABLE, instead.

I think using pmem as a node is more natural than zone and less 
intrusive since we can just reuse all the numa APIs. If we treat pmem as 
a new zone I think the implementation may be more intrusive and 
complicated (i.e. need a new gfp flag) and user can't control the memory 
placement.

Actually there had been such proposal before, please see 
https://www.spinics.net/lists/linux-mm/msg151788.html
Dave Hansen July 1, 2020, 6:20 p.m. UTC | #9
On 7/1/20 1:54 AM, Huang, Ying wrote:
> Why can not we just bind the memory of the application to node 0, 2, 3
> via mbind() or cpuset.mems?  Then the application can allocate memory
> directly from PMEM.  And if we bind the memory of the application via
> mbind() to node 0, we can only allocate memory directly from DRAM.

Applications use cpuset.mems precisely because they don't want to
allocate directly from PMEM.  They want the good, deterministic,
performance they get from DRAM.

Even if they don't allocate directly from PMEM, is it OK for such an app
to get its cold data migrated to PMEM?  That's a much more subtle
question and I suspect the kernel isn't going to have a single answer
for it.  I suspect we'll need a cpuset-level knob to turn auto-demotion
on or off.
David Rientjes July 1, 2020, 7:25 p.m. UTC | #10
On Wed, 1 Jul 2020, Dave Hansen wrote:

> > Could this cause us to break a user's mbind() or allow a user to 
> > circumvent their cpuset.mems?
> 
> In its current form, yes.
> 
> My current rationale for this is that while it's not as deferential as
> it can be to the user/kernel ABI contract, it's good *overall* behavior.
>  The auto-migration only kicks in when the data is about to go away.  So
> while the user's data might be slower than they like, it is *WAY* faster
> than they deserve because it should be off on the disk.
> 

It's outside the scope of this patchset, but eventually there will be a 
promotion path that I think requires a strict 1:1 relationship between 
DRAM and PMEM nodes because otherwise mbind(), set_mempolicy(), and 
cpuset.mems become ineffective for nodes facing memory pressure.

For the purposes of this patchset, agreed that DRAM -> PMEM -> swap makes 
perfect sense.  Theoretically, I think you could have DRAM N0 and N1 and 
then a single PMEM N2 and this N2 can be the terminal node for both N0 and 
N1.  On promotion, I think we need to rely on something stronger than 
autonuma to decide which DRAM node to promote to: specifically any user 
policy put into effect (memory tiering or autonuma shouldn't be allowed to 
subvert these user policies).

As others have mentioned, we lose the allocation or process context at the 
time of demotion or promotion and any workaround for that requires some 
hacks, such as mapping the page to cpuset (what is the right solution for 
shared pages?) or adding NUMA locality handling to memcg.

I think a 1:1 relationship between DRAM and PMEM nodes is required if we 
consider the eventual promotion of this memory so that user memory can't 
eventually reappear on a DRAM node that is not allowed by mbind(), 
set_mempolicy(), or cpuset.mems.  I think it also makes this patchset much 
simpler.

> > Because we don't have a mapping of the page back to its allocation 
> > context (or the process context in which it was allocated), it seems like 
> > both are possible.
> > 
> > So let's assume that migration nodes cannot be other DRAM nodes.  
> > Otherwise, memory pressure could be intentionally or unintentionally 
> > induced to migrate these pages to another node.  Do we have such a 
> > restriction on migration nodes?
> 
> There's nothing explicit.  On a normal, balanced system where there's a
> 1:1:1 relationship between CPU sockets, DRAM nodes and PMEM nodes, it's
> implicit since the migration path is one deep and goes from DRAM->PMEM.
> 
> If there were some oddball system where there was a memory only DRAM
> node, it might very well end up being a migration target.
> 

Shouldn't DRAM->DRAM demotion be banned?  It's all DRAM and within the 
control of mempolicies and cpusets today, so I had assumed this is outside 
the scope of memory tiering support.  I had assumed that memory tiering 
support was all about separate tiers :)

> >> +static struct page *alloc_demote_node_page(struct page *page, unsigned long node)
> >> +{
> >> +	/*
> >> +	 * 'mask' targets allocation only to the desired node in the
> >> +	 * migration path, and fails fast if the allocation can not be
> >> +	 * immediately satisfied.  Reclaim is already active and heroic
> >> +	 * allocation efforts are unwanted.
> >> +	 */
> >> +	gfp_t mask = GFP_NOWAIT | __GFP_NOWARN | __GFP_NORETRY |
> >> +			__GFP_NOMEMALLOC | __GFP_THISNODE | __GFP_HIGHMEM |
> >> +			__GFP_MOVABLE;
> > 
> > GFP_NOWAIT has the side-effect that it does __GFP_KSWAPD_RECLAIM: do we 
> > actually want to kick kswapd on the pmem node?
> 
> In my mental model, cold data flows from:
> 
> 	DRAM -> PMEM -> swap
> 
> Kicking kswapd here ensures that while we're doing DRAM->PMEM migrations
> for kinda cold data, kswapd can be working on doing the PMEM->swap part
> on really cold data.
> 

Makes sense.
David Rientjes July 1, 2020, 7:45 p.m. UTC | #11
On Wed, 1 Jul 2020, Yang Shi wrote:

> > We can do this if we consider pmem not to be a separate memory tier from
> > the system perspective, however, but rather the socket perspective.  In
> > other words, a node can only demote to a series of exclusive pmem ranges
> > and promote to the same series of ranges in reverse order.  So DRAM node 0
> > can only demote to PMEM node 2 while DRAM node 1 can only demote to PMEM
> > node 3 -- a pmem range cannot be demoted to, or promoted from, more than
> > one DRAM node.
> > 
> > This naturally takes care of mbind() and cpuset.mems if we consider pmem
> > just to be slower volatile memory and we don't need to deal with the
> > latency concerns of cross socket migration.  A user page will never be
> > demoted to a pmem range across the socket and will never be promoted to a
> > different DRAM node that it doesn't have access to.
> 
> But I don't see too much benefit to limit the migration target to the
> so-called *paired* pmem node. IMHO it is fine to migrate to a remote (on a
> different socket) pmem node since even the cross socket access should be much
> faster then refault or swap from disk.
> 

Hi Yang,

Right, but any eventual promotion path would allow this to subvert the 
user mempolicy or cpuset.mems if the demoted memory is eventually promoted 
to a DRAM node on its socket.  We've discussed not having the ability to 
map from the demoted page to either of these contexts and it becomes more 
difficult for shared memory.  We have page_to_nid() and page_zone() so we 
can always find the appropriate demotion or promotion node for a given 
page if there is a 1:1 relationship.

Do we lose anything with the strict 1:1 relationship between DRAM and PMEM 
nodes?  It seems much simpler in terms of implementation and is more 
intuitive.

> I think using pmem as a node is more natural than zone and less intrusive
> since we can just reuse all the numa APIs. If we treat pmem as a new zone I
> think the implementation may be more intrusive and complicated (i.e. need a
> new gfp flag) and user can't control the memory placement.
> 

This is an important decision to make, I'm not sure that we actually 
*want* all of these NUMA APIs :)  If my memory is demoted, I can simply do 
migrate_pages() back to DRAM and cause other memory to be demoted in its 
place.  Things like MPOL_INTERLEAVE over nodes {0,1,2} don't make sense.  
Kswapd for a DRAM node putting pressure on a PMEM node for demotion that 
then puts the kswapd for the PMEM node under pressure to reclaim it serves 
*only* to spend unnecessary cpu cycles.

Users could control the memory placement through a new mempolicy flag, 
which I think are needed anyway for explicit allocation policies for PMEM 
nodes.  Consider if PMEM is a zone so that it has the natural 1:1 
relationship with DRAM, now your system only has nodes {0,1} as today, no 
new NUMA topology to consider, and a mempolicy flag MPOL_F_TOPTIER that 
specifies memory must be allocated from ZONE_MOVABLE or ZONE_NORMAL (and I 
can then mlock() if I want to disable demotion on memory pressure).
David Rientjes July 1, 2020, 7:50 p.m. UTC | #12
On Wed, 1 Jul 2020, Dave Hansen wrote:

> Even if they don't allocate directly from PMEM, is it OK for such an app
> to get its cold data migrated to PMEM?  That's a much more subtle
> question and I suspect the kernel isn't going to have a single answer
> for it.  I suspect we'll need a cpuset-level knob to turn auto-demotion
> on or off.
> 

I think the answer is whether the app's cold data can be reclaimed, 
otherwise migration to PMEM is likely better in terms of performance.  So 
any such app today should just be mlocking its cold data if it can't 
handle overhead from reclaim?
Huang, Ying July 2, 2020, 1:50 a.m. UTC | #13
David Rientjes <rientjes@google.com> writes:

> On Wed, 1 Jul 2020, Dave Hansen wrote:
>
>> Even if they don't allocate directly from PMEM, is it OK for such an app
>> to get its cold data migrated to PMEM?  That's a much more subtle
>> question and I suspect the kernel isn't going to have a single answer
>> for it.  I suspect we'll need a cpuset-level knob to turn auto-demotion
>> on or off.
>> 
>
> I think the answer is whether the app's cold data can be reclaimed, 
> otherwise migration to PMEM is likely better in terms of performance.  So 
> any such app today should just be mlocking its cold data if it can't 
> handle overhead from reclaim?

Yes.  That's a way to solve the problem.  A cpuset-level knob may be
more flexible, because you don't need to change the application source
code.

Best Regards,
Huang, Ying
Huang, Ying July 2, 2020, 5:02 a.m. UTC | #14
David Rientjes <rientjes@google.com> writes:

> On Wed, 1 Jul 2020, Dave Hansen wrote:
>
>> > Could this cause us to break a user's mbind() or allow a user to 
>> > circumvent their cpuset.mems?
>> 
>> In its current form, yes.
>> 
>> My current rationale for this is that while it's not as deferential as
>> it can be to the user/kernel ABI contract, it's good *overall* behavior.
>>  The auto-migration only kicks in when the data is about to go away.  So
>> while the user's data might be slower than they like, it is *WAY* faster
>> than they deserve because it should be off on the disk.
>> 
>
> It's outside the scope of this patchset, but eventually there will be a 
> promotion path that I think requires a strict 1:1 relationship between 
> DRAM and PMEM nodes because otherwise mbind(), set_mempolicy(), and 
> cpuset.mems become ineffective for nodes facing memory pressure.

I have posted an patchset for AutoNUMA based promotion support,

https://lore.kernel.org/lkml/20200218082634.1596727-1-ying.huang@intel.com/

Where, the page is promoted upon NUMA hint page fault.  So all memory
policy (mbind(), set_mempolicy(), and cpuset.mems) are available.  We
can refuse promoting the page to the DRAM nodes that are not allowed by
any memory policy.  So, 1:1 relationship isn't necessary for promotion.

> For the purposes of this patchset, agreed that DRAM -> PMEM -> swap makes 
> perfect sense.  Theoretically, I think you could have DRAM N0 and N1 and 
> then a single PMEM N2 and this N2 can be the terminal node for both N0 and 
> N1.  On promotion, I think we need to rely on something stronger than 
> autonuma to decide which DRAM node to promote to: specifically any user 
> policy put into effect (memory tiering or autonuma shouldn't be allowed to 
> subvert these user policies).
>
> As others have mentioned, we lose the allocation or process context at the 
> time of demotion or promotion

As above, we have process context at time of promotion.

> and any workaround for that requires some 
> hacks, such as mapping the page to cpuset (what is the right solution for 
> shared pages?) or adding NUMA locality handling to memcg.

It sounds natural to me to add NUMA nodes restriction to memcg.

Best Regards,
Huang, Ying
Jonathan Cameron July 2, 2020, 10:02 a.m. UTC | #15
On Wed, 1 Jul 2020 12:45:17 -0700
David Rientjes <rientjes@google.com> wrote:

> On Wed, 1 Jul 2020, Yang Shi wrote:
> 
> > > We can do this if we consider pmem not to be a separate memory tier from
> > > the system perspective, however, but rather the socket perspective.  In
> > > other words, a node can only demote to a series of exclusive pmem ranges
> > > and promote to the same series of ranges in reverse order.  So DRAM node 0
> > > can only demote to PMEM node 2 while DRAM node 1 can only demote to PMEM
> > > node 3 -- a pmem range cannot be demoted to, or promoted from, more than
> > > one DRAM node.
> > > 
> > > This naturally takes care of mbind() and cpuset.mems if we consider pmem
> > > just to be slower volatile memory and we don't need to deal with the
> > > latency concerns of cross socket migration.  A user page will never be
> > > demoted to a pmem range across the socket and will never be promoted to a
> > > different DRAM node that it doesn't have access to.  
> > 
> > But I don't see too much benefit to limit the migration target to the
> > so-called *paired* pmem node. IMHO it is fine to migrate to a remote (on a
> > different socket) pmem node since even the cross socket access should be much
> > faster then refault or swap from disk.
> >   
> 
> Hi Yang,
> 
> Right, but any eventual promotion path would allow this to subvert the 
> user mempolicy or cpuset.mems if the demoted memory is eventually promoted 
> to a DRAM node on its socket.  We've discussed not having the ability to 
> map from the demoted page to either of these contexts and it becomes more 
> difficult for shared memory.  We have page_to_nid() and page_zone() so we 
> can always find the appropriate demotion or promotion node for a given 
> page if there is a 1:1 relationship.
> 
> Do we lose anything with the strict 1:1 relationship between DRAM and PMEM 
> nodes?  It seems much simpler in terms of implementation and is more 
> intuitive.
Hi David, Yang,

The 1:1 mapping implies a particular system topology.  In the medium
term we are likely to see systems with a central pool of persistent memory
with equal access characteristics from multiple CPU containing nodes, each
with local DRAM. 

Clearly we could fake a split of such a pmem pool to keep the 1:1 mapping
but it's certainly not elegant and may be very wasteful for resources.

Can a zone based approach work well without such a hard wall?

Jonathan

> 
> > I think using pmem as a node is more natural than zone and less intrusive
> > since we can just reuse all the numa APIs. If we treat pmem as a new zone I
> > think the implementation may be more intrusive and complicated (i.e. need a
> > new gfp flag) and user can't control the memory placement.
> >   
> 
> This is an important decision to make, I'm not sure that we actually 
> *want* all of these NUMA APIs :)  If my memory is demoted, I can simply do 
> migrate_pages() back to DRAM and cause other memory to be demoted in its 
> place.  Things like MPOL_INTERLEAVE over nodes {0,1,2} don't make sense.  
> Kswapd for a DRAM node putting pressure on a PMEM node for demotion that 
> then puts the kswapd for the PMEM node under pressure to reclaim it serves 
> *only* to spend unnecessary cpu cycles.
> 
> Users could control the memory placement through a new mempolicy flag, 
> which I think are needed anyway for explicit allocation policies for PMEM 
> nodes.  Consider if PMEM is a zone so that it has the natural 1:1 
> relationship with DRAM, now your system only has nodes {0,1} as today, no 
> new NUMA topology to consider, and a mempolicy flag MPOL_F_TOPTIER that 
> specifies memory must be allocated from ZONE_MOVABLE or ZONE_NORMAL (and I 
> can then mlock() if I want to disable demotion on memory pressure).
>
diff mbox series

Patch

diff -puN include/linux/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard include/linux/migrate.h
--- a/include/linux/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.950312604 -0700
+++ b/include/linux/migrate.h	2020-06-29 16:34:38.963312604 -0700
@@ -25,6 +25,7 @@  enum migrate_reason {
 	MR_MEMPOLICY_MBIND,
 	MR_NUMA_MISPLACED,
 	MR_CONTIG_RANGE,
+	MR_DEMOTION,
 	MR_TYPES
 };
 
@@ -78,6 +79,7 @@  extern int migrate_huge_page_move_mappin
 				  struct page *newpage, struct page *page);
 extern int migrate_page_move_mapping(struct address_space *mapping,
 		struct page *newpage, struct page *page, int extra_count);
+extern int migrate_demote_mapping(struct page *page);
 #else
 
 static inline void putback_movable_pages(struct list_head *l) {}
@@ -104,6 +106,10 @@  static inline int migrate_huge_page_move
 	return -ENOSYS;
 }
 
+static inline int migrate_demote_mapping(struct page *page)
+{
+	return -ENOSYS;
+}
 #endif /* CONFIG_MIGRATION */
 
 #ifdef CONFIG_COMPACTION
diff -puN include/trace/events/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard include/trace/events/migrate.h
--- a/include/trace/events/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.952312604 -0700
+++ b/include/trace/events/migrate.h	2020-06-29 16:34:38.963312604 -0700
@@ -20,7 +20,8 @@ 
 	EM( MR_SYSCALL,		"syscall_or_cpuset")		\
 	EM( MR_MEMPOLICY_MBIND,	"mempolicy_mbind")		\
 	EM( MR_NUMA_MISPLACED,	"numa_misplaced")		\
-	EMe(MR_CONTIG_RANGE,	"contig_range")
+	EM( MR_CONTIG_RANGE,	"contig_range")			\
+	EMe(MR_DEMOTION,	"demotion")
 
 /*
  * First define the enums in the above macros to be exported to userspace
diff -puN mm/debug.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/debug.c
--- a/mm/debug.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.954312604 -0700
+++ b/mm/debug.c	2020-06-29 16:34:38.963312604 -0700
@@ -25,6 +25,7 @@  const char *migrate_reason_names[MR_TYPE
 	"mempolicy_mbind",
 	"numa_misplaced",
 	"cma",
+	"demotion",
 };
 
 const struct trace_print_flags pageflag_names[] = {
diff -puN mm/migrate.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/migrate.c
--- a/mm/migrate.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.956312604 -0700
+++ b/mm/migrate.c	2020-06-29 16:34:38.964312604 -0700
@@ -1151,6 +1151,58 @@  int next_demotion_node(int node)
 	return node;
 }
 
+static struct page *alloc_demote_node_page(struct page *page, unsigned long node)
+{
+	/*
+	 * 'mask' targets allocation only to the desired node in the
+	 * migration path, and fails fast if the allocation can not be
+	 * immediately satisfied.  Reclaim is already active and heroic
+	 * allocation efforts are unwanted.
+	 */
+	gfp_t mask = GFP_NOWAIT | __GFP_NOWARN | __GFP_NORETRY |
+			__GFP_NOMEMALLOC | __GFP_THISNODE | __GFP_HIGHMEM |
+			__GFP_MOVABLE;
+	struct page *newpage;
+
+	if (PageTransHuge(page)) {
+		mask |= __GFP_COMP;
+		newpage = alloc_pages_node(node, mask, HPAGE_PMD_ORDER);
+		if (newpage)
+			prep_transhuge_page(newpage);
+	} else
+		newpage = alloc_pages_node(node, mask, 0);
+
+	return newpage;
+}
+
+/**
+ * migrate_demote_mapping() - Migrate this page and its mappings to its
+ *                            demotion node.
+ * @page: A locked, isolated, non-huge page that should migrate to its current
+ *        node's demotion target, if available. Since this is intended to be
+ *        called during memory reclaim, all flag options are set to fail fast.
+ *
+ * @returns: MIGRATEPAGE_SUCCESS if successful, -errno otherwise.
+ */
+int migrate_demote_mapping(struct page *page)
+{
+	int next_nid = next_demotion_node(page_to_nid(page));
+
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	VM_BUG_ON_PAGE(PageHuge(page), page);
+	VM_BUG_ON_PAGE(PageLRU(page), page);
+
+	if (next_nid == NUMA_NO_NODE)
+		return -ENOSYS;
+	if (PageTransHuge(page) && !thp_migration_supported())
+		return -ENOMEM;
+
+	/* MIGRATE_ASYNC is the most light weight and never blocks.*/
+	return __unmap_and_move(alloc_demote_node_page, NULL, next_nid,
+				page, MIGRATE_ASYNC, MR_DEMOTION);
+}
+
+
 /*
  * gcc 4.7 and 4.8 on arm get an ICEs when inlining unmap_and_move().  Work
  * around it.
diff -puN mm/vmscan.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard mm/vmscan.c
--- a/mm/vmscan.c~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard	2020-06-29 16:34:38.959312604 -0700
+++ b/mm/vmscan.c	2020-06-29 16:34:38.965312604 -0700
@@ -1077,6 +1077,7 @@  static unsigned long shrink_page_list(st
 	LIST_HEAD(free_pages);
 	unsigned nr_reclaimed = 0;
 	unsigned pgactivate = 0;
+	int rc;
 
 	memset(stat, 0, sizeof(*stat));
 	cond_resched();
@@ -1229,6 +1230,30 @@  static unsigned long shrink_page_list(st
 			; /* try to reclaim the page below */
 		}
 
+		rc = migrate_demote_mapping(page);
+		/*
+		 * -ENOMEM on a THP may indicate either migration is
+		 * unsupported or there was not enough contiguous
+		 * space. Split the THP into base pages and retry the
+		 * head immediately. The tail pages will be considered
+		 * individually within the current loop's page list.
+		 */
+		if (rc == -ENOMEM && PageTransHuge(page) &&
+		    !split_huge_page_to_list(page, page_list))
+			rc = migrate_demote_mapping(page);
+
+		if (rc == MIGRATEPAGE_SUCCESS) {
+			unlock_page(page);
+			if (likely(put_page_testzero(page)))
+				goto free_it;
+			/*
+			 * Speculative reference will free this page,
+			 * so leave it off the LRU.
+			 */
+			nr_reclaimed++;
+			continue;
+		}
+
 		/*
 		 * Anonymous process memory has backing store?
 		 * Try to allocate it some swap space here.