diff mbox series

[-V8,02/10] mm/numa: automatically generate node migration order

Message ID 20210618061537.434999-3-ying.huang@intel.com (mailing list archive)
State New, archived
Headers show
Series Migrate Pages in lieu of discard | expand

Commit Message

Huang, Ying June 18, 2021, 6:15 a.m. UTC
From: Dave Hansen <dave.hansen@linux.intel.com>

When memory fills up on a node, memory contents can be
automatically migrated to another node.  The biggest problems are
knowing when to migrate and to where the migration should be
targeted.

The most straightforward way to generate the "to where" list would
be to follow the page allocator fallback lists.  Those lists
already tell us if memory is full where to look next.  It would
also be logical to move memory in that order.

But, the allocator fallback lists have a fatal flaw: most nodes
appear in all the lists.  This would potentially lead to migration
cycles (A->B, B->A, A->B, ...).

Instead of using the allocator fallback lists directly, keep a
separate node migration ordering.  But, reuse the same data used
to generate page allocator fallback in the first place:
find_next_best_node().

This means that the firmware data used to populate node distances
essentially dictates the ordering for now.  It should also be
architecture-neutral since all NUMA architectures have a working
find_next_best_node().

The protocol for node_demotion[] access and writing is not
standard.  It has no specific locking and is intended to be read
locklessly.  Readers must take care to avoid observing changes
that appear incoherent.  This was done so that node_demotion[]
locking has no chance of becoming a bottleneck on large systems
with lots of CPUs in direct reclaim.

This code is unused for now.  It will be called later in the
series.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Wei Xu <weixugc@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: osalvador <osalvador@suse.de>

--

Changes from 20200122:
 * Add big node_demotion[] comment
Changes from 20210302:
 * Fix typo in node_demotion[] comment
---
 mm/internal.h   |   5 ++
 mm/migrate.c    | 175 +++++++++++++++++++++++++++++++++++++++++++++++-
 mm/page_alloc.c |   2 +-
 3 files changed, 180 insertions(+), 2 deletions(-)

Comments

Zi Yan June 18, 2021, 3:14 p.m. UTC | #1
On 18 Jun 2021, at 2:15, Huang Ying wrote:

> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> When memory fills up on a node, memory contents can be
> automatically migrated to another node.  The biggest problems are
> knowing when to migrate and to where the migration should be
> targeted.
>
> The most straightforward way to generate the "to where" list would
> be to follow the page allocator fallback lists.  Those lists
> already tell us if memory is full where to look next.  It would
> also be logical to move memory in that order.
>
> But, the allocator fallback lists have a fatal flaw: most nodes
> appear in all the lists.  This would potentially lead to migration
> cycles (A->B, B->A, A->B, ...).
>
> Instead of using the allocator fallback lists directly, keep a
> separate node migration ordering.  But, reuse the same data used
> to generate page allocator fallback in the first place:
> find_next_best_node().
>
> This means that the firmware data used to populate node distances
> essentially dictates the ordering for now.  It should also be
> architecture-neutral since all NUMA architectures have a working
> find_next_best_node().
>
> The protocol for node_demotion[] access and writing is not
> standard.  It has no specific locking and is intended to be read
> locklessly.  Readers must take care to avoid observing changes
> that appear incoherent.  This was done so that node_demotion[]
> locking has no chance of becoming a bottleneck on large systems
> with lots of CPUs in direct reclaim.
>
> This code is unused for now.  It will be called later in the
> series.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Reviewed-by: Yang Shi <shy828301@gmail.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Wei Xu <weixugc@google.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: osalvador <osalvador@suse.de>
>
> --
>
> Changes from 20200122:
>  * Add big node_demotion[] comment
> Changes from 20210302:
>  * Fix typo in node_demotion[] comment
> ---
>  mm/internal.h   |   5 ++
>  mm/migrate.c    | 175 +++++++++++++++++++++++++++++++++++++++++++++++-
>  mm/page_alloc.c |   2 +-
>  3 files changed, 180 insertions(+), 2 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 2f1182948aa6..0344cd78e170 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -522,12 +522,17 @@ static inline void mminit_validate_memmodel_limits(unsigned long *start_pfn,
>
>  #ifdef CONFIG_NUMA
>  extern int node_reclaim(struct pglist_data *, gfp_t, unsigned int);
> +extern int find_next_best_node(int node, nodemask_t *used_node_mask);
>  #else
>  static inline int node_reclaim(struct pglist_data *pgdat, gfp_t mask,
>  				unsigned int order)
>  {
>  	return NODE_RECLAIM_NOSCAN;
>  }
> +static inline int find_next_best_node(int node, nodemask_t *used_node_mask)
> +{
> +	return NUMA_NO_NODE;
> +}
>  #endif
>
>  extern int hwpoison_filter(struct page *p);
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 6cab668132f9..111f8565f75d 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1136,6 +1136,44 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
>  	return rc;
>  }
>
> +
> +/*
> + * node_demotion[] example:
> + *
> + * Consider a system with two sockets.  Each socket has
> + * three classes of memory attached: fast, medium and slow.
> + * Each memory class is placed in its own NUMA node.  The
> + * CPUs are placed in the node with the "fast" memory.  The
> + * 6 NUMA nodes (0-5) might be split among the sockets like
> + * this:
> + *
> + *	Socket A: 0, 1, 2
> + *	Socket B: 3, 4, 5
> + *
> + * When Node 0 fills up, its memory should be migrated to
> + * Node 1.  When Node 1 fills up, it should be migrated to
> + * Node 2.  The migration path start on the nodes with the
> + * processors (since allocations default to this node) and
> + * fast memory, progress through medium and end with the
> + * slow memory:
> + *
> + *	0 -> 1 -> 2 -> stop
> + *	3 -> 4 -> 5 -> stop
> + *
> + * This is represented in the node_demotion[] like this:
> + *
> + *	{  1, // Node 0 migrates to 1
> + *	   2, // Node 1 migrates to 2
> + *	  -1, // Node 2 does not migrate
> + *	   4, // Node 3 migrates to 4
> + *	   5, // Node 4 migrates to 5
> + *	  -1} // Node 5 does not migrate
> + */
> +
> +/*
> + * Writes to this array occur without locking.  READ_ONCE()
> + * is recommended for readers to ensure consistent reads.
> + */
>  static int node_demotion[MAX_NUMNODES] __read_mostly =
>  	{[0 ...  MAX_NUMNODES - 1] = NUMA_NO_NODE};
>
> @@ -1150,7 +1188,13 @@ static int node_demotion[MAX_NUMNODES] __read_mostly =
>   */
>  int next_demotion_node(int node)
>  {
> -	return node_demotion[node];
> +	/*
> +	 * node_demotion[] is updated without excluding
> +	 * this function from running.  READ_ONCE() avoids
> +	 * reading multiple, inconsistent 'node' values
> +	 * during an update.
> +	 */
> +	return READ_ONCE(node_demotion[node]);
>  }

Is it necessary to have two separate patches to add node_demotion and
next_demotion_node() then modify it immediately? Maybe merge Patch 1 into 2?

Hmm, I just checked Patch 3 and it changes node_demotion again and uses RCU.
I guess it might be much simpler to just introduce node_demotion with RCU
in this patch and Patch 3 only takes care of hotplug events.

>
>  /*
> @@ -3144,3 +3188,132 @@ void migrate_vma_finalize(struct migrate_vma *migrate)
>  }
>  EXPORT_SYMBOL(migrate_vma_finalize);
>  #endif /* CONFIG_DEVICE_PRIVATE */
> +
> +/* Disable reclaim-based migration. */
> +static void disable_all_migrate_targets(void)
> +{
> +	int node;
> +
> +	for_each_online_node(node)
> +		node_demotion[node] = NUMA_NO_NODE;
> +}
> +
> +/*
> + * Find an automatic demotion target for 'node'.
> + * Failing here is OK.  It might just indicate
> + * being at the end of a chain.
> + */
> +static int establish_migrate_target(int node, nodemask_t *used)
> +{
> +	int migration_target;
> +
> +	/*
> +	 * Can not set a migration target on a
> +	 * node with it already set.
> +	 *
> +	 * No need for READ_ONCE() here since this
> +	 * in the write path for node_demotion[].
> +	 * This should be the only thread writing.
> +	 */
> +	if (node_demotion[node] != NUMA_NO_NODE)
> +		return NUMA_NO_NODE;
> +
> +	migration_target = find_next_best_node(node, used);
> +	if (migration_target == NUMA_NO_NODE)
> +		return NUMA_NO_NODE;
> +
> +	node_demotion[node] = migration_target;
> +
> +	return migration_target;
> +}
> +
> +/*
> + * When memory fills up on a node, memory contents can be
> + * automatically migrated to another node instead of
> + * discarded at reclaim.
> + *
> + * Establish a "migration path" which will start at nodes
> + * with CPUs and will follow the priorities used to build the
> + * page allocator zonelists.
> + *
> + * The difference here is that cycles must be avoided.  If
> + * node0 migrates to node1, then neither node1, nor anything
> + * node1 migrates to can migrate to node0.
> + *
> + * This function can run simultaneously with readers of
> + * node_demotion[].  However, it can not run simultaneously
> + * with itself.  Exclusion is provided by memory hotplug events
> + * being single-threaded.
> + */
> +static void __set_migration_target_nodes(void)
> +{
> +	nodemask_t next_pass	= NODE_MASK_NONE;
> +	nodemask_t this_pass	= NODE_MASK_NONE;
> +	nodemask_t used_targets = NODE_MASK_NONE;
> +	int node;
> +
> +	/*
> +	 * Avoid any oddities like cycles that could occur
> +	 * from changes in the topology.  This will leave
> +	 * a momentary gap when migration is disabled.
> +	 */
> +	disable_all_migrate_targets();
> +
> +	/*
> +	 * Ensure that the "disable" is visible across the system.
> +	 * Readers will see either a combination of before+disable
> +	 * state or disable+after.  They will never see before and
> +	 * after state together.
> +	 *
> +	 * The before+after state together might have cycles and
> +	 * could cause readers to do things like loop until this
> +	 * function finishes.  This ensures they can only see a
> +	 * single "bad" read and would, for instance, only loop
> +	 * once.
> +	 */
> +	smp_wmb();
> +
> +	/*
> +	 * Allocations go close to CPUs, first.  Assume that
> +	 * the migration path starts at the nodes with CPUs.
> +	 */
> +	next_pass = node_states[N_CPU];

Is there a plan of allowing user to change where the migration
path starts? Or maybe one step further providing an interface
to allow user to specify the demotion path. Something like
/sys/devices/system/node/node*/node_demotion.


> +again:
> +	this_pass = next_pass;
> +	next_pass = NODE_MASK_NONE;
> +	/*
> +	 * To avoid cycles in the migration "graph", ensure
> +	 * that migration sources are not future targets by
> +	 * setting them in 'used_targets'.  Do this only
> +	 * once per pass so that multiple source nodes can
> +	 * share a target node.
> +	 *
> +	 * 'used_targets' will become unavailable in future
> +	 * passes.  This limits some opportunities for
> +	 * multiple source nodes to share a destination.
> +	 */
> +	nodes_or(used_targets, used_targets, this_pass);
> +	for_each_node_mask(node, this_pass) {
> +		int target_node = establish_migrate_target(node, &used_targets);
> +
> +		if (target_node == NUMA_NO_NODE)
> +			continue;
> +
> +		/* Visit targets from this pass in the next pass: */
> +		node_set(target_node, next_pass);
> +	}
> +	/* Is another pass necessary? */
> +	if (!nodes_empty(next_pass))
> +		goto again;
> +}
> +
> +/*
> + * For callers that do not hold get_online_mems() already.
> + */
> +__maybe_unused // <- temporay to prevent warnings during bisects
> +static void set_migration_target_nodes(void)
> +{
> +	get_online_mems();
> +	__set_migration_target_nodes();
> +	put_online_mems();
> +}
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d1f5de1c1283..e033ae2e8bce 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5973,7 +5973,7 @@ static int node_load[MAX_NUMNODES];
>   *
>   * Return: node id of the found node or %NUMA_NO_NODE if no node is found.
>   */
> -static int find_next_best_node(int node, nodemask_t *used_node_mask)
> +int find_next_best_node(int node, nodemask_t *used_node_mask)
>  {
>  	int n, val;
>  	int min_val = INT_MAX;
> -- 
> 2.30.2


—
Best Regards,
Yan, Zi
Huang, Ying June 19, 2021, 8:18 a.m. UTC | #2
Zi Yan <ziy@nvidia.com> writes:

> On 18 Jun 2021, at 2:15, Huang Ying wrote:
>
>> From: Dave Hansen <dave.hansen@linux.intel.com>
>>
>> When memory fills up on a node, memory contents can be
>> automatically migrated to another node.  The biggest problems are
>> knowing when to migrate and to where the migration should be
>> targeted.
>>
>> The most straightforward way to generate the "to where" list would
>> be to follow the page allocator fallback lists.  Those lists
>> already tell us if memory is full where to look next.  It would
>> also be logical to move memory in that order.
>>
>> But, the allocator fallback lists have a fatal flaw: most nodes
>> appear in all the lists.  This would potentially lead to migration
>> cycles (A->B, B->A, A->B, ...).
>>
>> Instead of using the allocator fallback lists directly, keep a
>> separate node migration ordering.  But, reuse the same data used
>> to generate page allocator fallback in the first place:
>> find_next_best_node().
>>
>> This means that the firmware data used to populate node distances
>> essentially dictates the ordering for now.  It should also be
>> architecture-neutral since all NUMA architectures have a working
>> find_next_best_node().
>>
>> The protocol for node_demotion[] access and writing is not
>> standard.  It has no specific locking and is intended to be read
>> locklessly.  Readers must take care to avoid observing changes
>> that appear incoherent.  This was done so that node_demotion[]
>> locking has no chance of becoming a bottleneck on large systems
>> with lots of CPUs in direct reclaim.
>>
>> This code is unused for now.  It will be called later in the
>> series.
>>
>> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> Reviewed-by: Yang Shi <shy828301@gmail.com>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Wei Xu <weixugc@google.com>
>> Cc: David Rientjes <rientjes@google.com>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: osalvador <osalvador@suse.de>
>>
>> --
>>
>> Changes from 20200122:
>>  * Add big node_demotion[] comment
>> Changes from 20210302:
>>  * Fix typo in node_demotion[] comment
>> ---
>>  mm/internal.h   |   5 ++
>>  mm/migrate.c    | 175 +++++++++++++++++++++++++++++++++++++++++++++++-
>>  mm/page_alloc.c |   2 +-
>>  3 files changed, 180 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 2f1182948aa6..0344cd78e170 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -522,12 +522,17 @@ static inline void mminit_validate_memmodel_limits(unsigned long *start_pfn,
>>
>>  #ifdef CONFIG_NUMA
>>  extern int node_reclaim(struct pglist_data *, gfp_t, unsigned int);
>> +extern int find_next_best_node(int node, nodemask_t *used_node_mask);
>>  #else
>>  static inline int node_reclaim(struct pglist_data *pgdat, gfp_t mask,
>>  				unsigned int order)
>>  {
>>  	return NODE_RECLAIM_NOSCAN;
>>  }
>> +static inline int find_next_best_node(int node, nodemask_t *used_node_mask)
>> +{
>> +	return NUMA_NO_NODE;
>> +}
>>  #endif
>>
>>  extern int hwpoison_filter(struct page *p);
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 6cab668132f9..111f8565f75d 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -1136,6 +1136,44 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
>>  	return rc;
>>  }
>>
>> +
>> +/*
>> + * node_demotion[] example:
>> + *
>> + * Consider a system with two sockets.  Each socket has
>> + * three classes of memory attached: fast, medium and slow.
>> + * Each memory class is placed in its own NUMA node.  The
>> + * CPUs are placed in the node with the "fast" memory.  The
>> + * 6 NUMA nodes (0-5) might be split among the sockets like
>> + * this:
>> + *
>> + *	Socket A: 0, 1, 2
>> + *	Socket B: 3, 4, 5
>> + *
>> + * When Node 0 fills up, its memory should be migrated to
>> + * Node 1.  When Node 1 fills up, it should be migrated to
>> + * Node 2.  The migration path start on the nodes with the
>> + * processors (since allocations default to this node) and
>> + * fast memory, progress through medium and end with the
>> + * slow memory:
>> + *
>> + *	0 -> 1 -> 2 -> stop
>> + *	3 -> 4 -> 5 -> stop
>> + *
>> + * This is represented in the node_demotion[] like this:
>> + *
>> + *	{  1, // Node 0 migrates to 1
>> + *	   2, // Node 1 migrates to 2
>> + *	  -1, // Node 2 does not migrate
>> + *	   4, // Node 3 migrates to 4
>> + *	   5, // Node 4 migrates to 5
>> + *	  -1} // Node 5 does not migrate
>> + */
>> +
>> +/*
>> + * Writes to this array occur without locking.  READ_ONCE()
>> + * is recommended for readers to ensure consistent reads.
>> + */
>>  static int node_demotion[MAX_NUMNODES] __read_mostly =
>>  	{[0 ...  MAX_NUMNODES - 1] = NUMA_NO_NODE};
>>
>> @@ -1150,7 +1188,13 @@ static int node_demotion[MAX_NUMNODES] __read_mostly =
>>   */
>>  int next_demotion_node(int node)
>>  {
>> -	return node_demotion[node];
>> +	/*
>> +	 * node_demotion[] is updated without excluding
>> +	 * this function from running.  READ_ONCE() avoids
>> +	 * reading multiple, inconsistent 'node' values
>> +	 * during an update.
>> +	 */
>> +	return READ_ONCE(node_demotion[node]);
>>  }
>
> Is it necessary to have two separate patches to add node_demotion and
> next_demotion_node() then modify it immediately? Maybe merge Patch 1 into 2?
>
> Hmm, I just checked Patch 3 and it changes node_demotion again and uses RCU.
> I guess it might be much simpler to just introduce node_demotion with RCU
> in this patch and Patch 3 only takes care of hotplug events.

Hi, Dave,

What do you think about this?

>>
>>  /*
>> @@ -3144,3 +3188,132 @@ void migrate_vma_finalize(struct migrate_vma *migrate)
>>  }
>>  EXPORT_SYMBOL(migrate_vma_finalize);
>>  #endif /* CONFIG_DEVICE_PRIVATE */
>> +
>> +/* Disable reclaim-based migration. */
>> +static void disable_all_migrate_targets(void)
>> +{
>> +	int node;
>> +
>> +	for_each_online_node(node)
>> +		node_demotion[node] = NUMA_NO_NODE;
>> +}
>> +
>> +/*
>> + * Find an automatic demotion target for 'node'.
>> + * Failing here is OK.  It might just indicate
>> + * being at the end of a chain.
>> + */
>> +static int establish_migrate_target(int node, nodemask_t *used)
>> +{
>> +	int migration_target;
>> +
>> +	/*
>> +	 * Can not set a migration target on a
>> +	 * node with it already set.
>> +	 *
>> +	 * No need for READ_ONCE() here since this
>> +	 * in the write path for node_demotion[].
>> +	 * This should be the only thread writing.
>> +	 */
>> +	if (node_demotion[node] != NUMA_NO_NODE)
>> +		return NUMA_NO_NODE;
>> +
>> +	migration_target = find_next_best_node(node, used);
>> +	if (migration_target == NUMA_NO_NODE)
>> +		return NUMA_NO_NODE;
>> +
>> +	node_demotion[node] = migration_target;
>> +
>> +	return migration_target;
>> +}
>> +
>> +/*
>> + * When memory fills up on a node, memory contents can be
>> + * automatically migrated to another node instead of
>> + * discarded at reclaim.
>> + *
>> + * Establish a "migration path" which will start at nodes
>> + * with CPUs and will follow the priorities used to build the
>> + * page allocator zonelists.
>> + *
>> + * The difference here is that cycles must be avoided.  If
>> + * node0 migrates to node1, then neither node1, nor anything
>> + * node1 migrates to can migrate to node0.
>> + *
>> + * This function can run simultaneously with readers of
>> + * node_demotion[].  However, it can not run simultaneously
>> + * with itself.  Exclusion is provided by memory hotplug events
>> + * being single-threaded.
>> + */
>> +static void __set_migration_target_nodes(void)
>> +{
>> +	nodemask_t next_pass	= NODE_MASK_NONE;
>> +	nodemask_t this_pass	= NODE_MASK_NONE;
>> +	nodemask_t used_targets = NODE_MASK_NONE;
>> +	int node;
>> +
>> +	/*
>> +	 * Avoid any oddities like cycles that could occur
>> +	 * from changes in the topology.  This will leave
>> +	 * a momentary gap when migration is disabled.
>> +	 */
>> +	disable_all_migrate_targets();
>> +
>> +	/*
>> +	 * Ensure that the "disable" is visible across the system.
>> +	 * Readers will see either a combination of before+disable
>> +	 * state or disable+after.  They will never see before and
>> +	 * after state together.
>> +	 *
>> +	 * The before+after state together might have cycles and
>> +	 * could cause readers to do things like loop until this
>> +	 * function finishes.  This ensures they can only see a
>> +	 * single "bad" read and would, for instance, only loop
>> +	 * once.
>> +	 */
>> +	smp_wmb();
>> +
>> +	/*
>> +	 * Allocations go close to CPUs, first.  Assume that
>> +	 * the migration path starts at the nodes with CPUs.
>> +	 */
>> +	next_pass = node_states[N_CPU];
>
> Is there a plan of allowing user to change where the migration
> path starts? Or maybe one step further providing an interface
> to allow user to specify the demotion path. Something like
> /sys/devices/system/node/node*/node_demotion.

I don't think that's necessary at least for now.  Do you know any real
world use case for this?

Best Regards,
Huang, Ying

[snip]
Zi Yan June 21, 2021, 2:50 p.m. UTC | #3
On 19 Jun 2021, at 4:18, Huang, Ying wrote:

> Zi Yan <ziy@nvidia.com> writes:
>
>> On 18 Jun 2021, at 2:15, Huang Ying wrote:
>>
>>> From: Dave Hansen <dave.hansen@linux.intel.com>
>>>
>>> When memory fills up on a node, memory contents can be
>>> automatically migrated to another node.  The biggest problems are
>>> knowing when to migrate and to where the migration should be
>>> targeted.
>>>
>>> The most straightforward way to generate the "to where" list would
>>> be to follow the page allocator fallback lists.  Those lists
>>> already tell us if memory is full where to look next.  It would
>>> also be logical to move memory in that order.
>>>
>>> But, the allocator fallback lists have a fatal flaw: most nodes
>>> appear in all the lists.  This would potentially lead to migration
>>> cycles (A->B, B->A, A->B, ...).
>>>
>>> Instead of using the allocator fallback lists directly, keep a
>>> separate node migration ordering.  But, reuse the same data used
>>> to generate page allocator fallback in the first place:
>>> find_next_best_node().
>>>
>>> This means that the firmware data used to populate node distances
>>> essentially dictates the ordering for now.  It should also be
>>> architecture-neutral since all NUMA architectures have a working
>>> find_next_best_node().
>>>
>>> The protocol for node_demotion[] access and writing is not
>>> standard.  It has no specific locking and is intended to be read
>>> locklessly.  Readers must take care to avoid observing changes
>>> that appear incoherent.  This was done so that node_demotion[]
>>> locking has no chance of becoming a bottleneck on large systems
>>> with lots of CPUs in direct reclaim.
>>>
>>> This code is unused for now.  It will be called later in the
>>> series.
>>>
>>> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
>>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>>> Reviewed-by: Yang Shi <shy828301@gmail.com>
>>> Cc: Michal Hocko <mhocko@suse.com>
>>> Cc: Wei Xu <weixugc@google.com>
>>> Cc: David Rientjes <rientjes@google.com>
>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>> Cc: David Hildenbrand <david@redhat.com>
>>> Cc: osalvador <osalvador@suse.de>
>>>
>>> --
>>>
>>> Changes from 20200122:
>>>  * Add big node_demotion[] comment
>>> Changes from 20210302:
>>>  * Fix typo in node_demotion[] comment
>>> ---
>>>  mm/internal.h   |   5 ++
>>>  mm/migrate.c    | 175 +++++++++++++++++++++++++++++++++++++++++++++++-
>>>  mm/page_alloc.c |   2 +-
>>>  3 files changed, 180 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/internal.h b/mm/internal.h
>>> index 2f1182948aa6..0344cd78e170 100644
>>> --- a/mm/internal.h
>>> +++ b/mm/internal.h
>>> @@ -522,12 +522,17 @@ static inline void mminit_validate_memmodel_limits(unsigned long *start_pfn,
>>>
>>>  #ifdef CONFIG_NUMA
>>>  extern int node_reclaim(struct pglist_data *, gfp_t, unsigned int);
>>> +extern int find_next_best_node(int node, nodemask_t *used_node_mask);
>>>  #else
>>>  static inline int node_reclaim(struct pglist_data *pgdat, gfp_t mask,
>>>  				unsigned int order)
>>>  {
>>>  	return NODE_RECLAIM_NOSCAN;
>>>  }
>>> +static inline int find_next_best_node(int node, nodemask_t *used_node_mask)
>>> +{
>>> +	return NUMA_NO_NODE;
>>> +}
>>>  #endif
>>>
>>>  extern int hwpoison_filter(struct page *p);
>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>> index 6cab668132f9..111f8565f75d 100644
>>> --- a/mm/migrate.c
>>> +++ b/mm/migrate.c
>>> @@ -1136,6 +1136,44 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
>>>  	return rc;
>>>  }
>>>
>>> +
>>> +/*
>>> + * node_demotion[] example:
>>> + *
>>> + * Consider a system with two sockets.  Each socket has
>>> + * three classes of memory attached: fast, medium and slow.
>>> + * Each memory class is placed in its own NUMA node.  The
>>> + * CPUs are placed in the node with the "fast" memory.  The
>>> + * 6 NUMA nodes (0-5) might be split among the sockets like
>>> + * this:
>>> + *
>>> + *	Socket A: 0, 1, 2
>>> + *	Socket B: 3, 4, 5
>>> + *
>>> + * When Node 0 fills up, its memory should be migrated to
>>> + * Node 1.  When Node 1 fills up, it should be migrated to
>>> + * Node 2.  The migration path start on the nodes with the
>>> + * processors (since allocations default to this node) and
>>> + * fast memory, progress through medium and end with the
>>> + * slow memory:
>>> + *
>>> + *	0 -> 1 -> 2 -> stop
>>> + *	3 -> 4 -> 5 -> stop
>>> + *
>>> + * This is represented in the node_demotion[] like this:
>>> + *
>>> + *	{  1, // Node 0 migrates to 1
>>> + *	   2, // Node 1 migrates to 2
>>> + *	  -1, // Node 2 does not migrate
>>> + *	   4, // Node 3 migrates to 4
>>> + *	   5, // Node 4 migrates to 5
>>> + *	  -1} // Node 5 does not migrate
>>> + */
>>> +
>>> +/*
>>> + * Writes to this array occur without locking.  READ_ONCE()
>>> + * is recommended for readers to ensure consistent reads.
>>> + */
>>>  static int node_demotion[MAX_NUMNODES] __read_mostly =
>>>  	{[0 ...  MAX_NUMNODES - 1] = NUMA_NO_NODE};
>>>
>>> @@ -1150,7 +1188,13 @@ static int node_demotion[MAX_NUMNODES] __read_mostly =
>>>   */
>>>  int next_demotion_node(int node)
>>>  {
>>> -	return node_demotion[node];
>>> +	/*
>>> +	 * node_demotion[] is updated without excluding
>>> +	 * this function from running.  READ_ONCE() avoids
>>> +	 * reading multiple, inconsistent 'node' values
>>> +	 * during an update.
>>> +	 */
>>> +	return READ_ONCE(node_demotion[node]);
>>>  }
>>
>> Is it necessary to have two separate patches to add node_demotion and
>> next_demotion_node() then modify it immediately? Maybe merge Patch 1 into 2?
>>
>> Hmm, I just checked Patch 3 and it changes node_demotion again and uses RCU.
>> I guess it might be much simpler to just introduce node_demotion with RCU
>> in this patch and Patch 3 only takes care of hotplug events.
>
> Hi, Dave,
>
> What do you think about this?
>
>>>
>>>  /*
>>> @@ -3144,3 +3188,132 @@ void migrate_vma_finalize(struct migrate_vma *migrate)
>>>  }
>>>  EXPORT_SYMBOL(migrate_vma_finalize);
>>>  #endif /* CONFIG_DEVICE_PRIVATE */
>>> +
>>> +/* Disable reclaim-based migration. */
>>> +static void disable_all_migrate_targets(void)
>>> +{
>>> +	int node;
>>> +
>>> +	for_each_online_node(node)
>>> +		node_demotion[node] = NUMA_NO_NODE;
>>> +}
>>> +
>>> +/*
>>> + * Find an automatic demotion target for 'node'.
>>> + * Failing here is OK.  It might just indicate
>>> + * being at the end of a chain.
>>> + */
>>> +static int establish_migrate_target(int node, nodemask_t *used)
>>> +{
>>> +	int migration_target;
>>> +
>>> +	/*
>>> +	 * Can not set a migration target on a
>>> +	 * node with it already set.
>>> +	 *
>>> +	 * No need for READ_ONCE() here since this
>>> +	 * in the write path for node_demotion[].
>>> +	 * This should be the only thread writing.
>>> +	 */
>>> +	if (node_demotion[node] != NUMA_NO_NODE)
>>> +		return NUMA_NO_NODE;
>>> +
>>> +	migration_target = find_next_best_node(node, used);
>>> +	if (migration_target == NUMA_NO_NODE)
>>> +		return NUMA_NO_NODE;
>>> +
>>> +	node_demotion[node] = migration_target;
>>> +
>>> +	return migration_target;
>>> +}
>>> +
>>> +/*
>>> + * When memory fills up on a node, memory contents can be
>>> + * automatically migrated to another node instead of
>>> + * discarded at reclaim.
>>> + *
>>> + * Establish a "migration path" which will start at nodes
>>> + * with CPUs and will follow the priorities used to build the
>>> + * page allocator zonelists.
>>> + *
>>> + * The difference here is that cycles must be avoided.  If
>>> + * node0 migrates to node1, then neither node1, nor anything
>>> + * node1 migrates to can migrate to node0.
>>> + *
>>> + * This function can run simultaneously with readers of
>>> + * node_demotion[].  However, it can not run simultaneously
>>> + * with itself.  Exclusion is provided by memory hotplug events
>>> + * being single-threaded.
>>> + */
>>> +static void __set_migration_target_nodes(void)
>>> +{
>>> +	nodemask_t next_pass	= NODE_MASK_NONE;
>>> +	nodemask_t this_pass	= NODE_MASK_NONE;
>>> +	nodemask_t used_targets = NODE_MASK_NONE;
>>> +	int node;
>>> +
>>> +	/*
>>> +	 * Avoid any oddities like cycles that could occur
>>> +	 * from changes in the topology.  This will leave
>>> +	 * a momentary gap when migration is disabled.
>>> +	 */
>>> +	disable_all_migrate_targets();
>>> +
>>> +	/*
>>> +	 * Ensure that the "disable" is visible across the system.
>>> +	 * Readers will see either a combination of before+disable
>>> +	 * state or disable+after.  They will never see before and
>>> +	 * after state together.
>>> +	 *
>>> +	 * The before+after state together might have cycles and
>>> +	 * could cause readers to do things like loop until this
>>> +	 * function finishes.  This ensures they can only see a
>>> +	 * single "bad" read and would, for instance, only loop
>>> +	 * once.
>>> +	 */
>>> +	smp_wmb();
>>> +
>>> +	/*
>>> +	 * Allocations go close to CPUs, first.  Assume that
>>> +	 * the migration path starts at the nodes with CPUs.
>>> +	 */
>>> +	next_pass = node_states[N_CPU];
>>
>> Is there a plan of allowing user to change where the migration
>> path starts? Or maybe one step further providing an interface
>> to allow user to specify the demotion path. Something like
>> /sys/devices/system/node/node*/node_demotion.
>
> I don't think that's necessary at least for now.  Do you know any real
> world use case for this?

In our P9+volta system, GPU memory is exposed as a NUMA node.
For the GPU workloads with data size greater than GPU memory size,
it will be very helpful to allow pages in GPU memory to be migrated/demoted
to CPU memory. With your current assumption, GPU memory -> CPU memory
demotion seems not possible, right? This should also apply to any
system with a device memory exposed as a NUMA node and workloads running
on the device and using CPU memory as a lower tier memory than the device
memory.


—
Best Regards,
Yan, Zi
Yang Shi June 21, 2021, 7:51 p.m. UTC | #4
On Sat, Jun 19, 2021 at 1:19 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Zi Yan <ziy@nvidia.com> writes:
>
> > On 18 Jun 2021, at 2:15, Huang Ying wrote:
> >
> >> From: Dave Hansen <dave.hansen@linux.intel.com>
> >>
> >> When memory fills up on a node, memory contents can be
> >> automatically migrated to another node.  The biggest problems are
> >> knowing when to migrate and to where the migration should be
> >> targeted.
> >>
> >> The most straightforward way to generate the "to where" list would
> >> be to follow the page allocator fallback lists.  Those lists
> >> already tell us if memory is full where to look next.  It would
> >> also be logical to move memory in that order.
> >>
> >> But, the allocator fallback lists have a fatal flaw: most nodes
> >> appear in all the lists.  This would potentially lead to migration
> >> cycles (A->B, B->A, A->B, ...).
> >>
> >> Instead of using the allocator fallback lists directly, keep a
> >> separate node migration ordering.  But, reuse the same data used
> >> to generate page allocator fallback in the first place:
> >> find_next_best_node().
> >>
> >> This means that the firmware data used to populate node distances
> >> essentially dictates the ordering for now.  It should also be
> >> architecture-neutral since all NUMA architectures have a working
> >> find_next_best_node().
> >>
> >> The protocol for node_demotion[] access and writing is not
> >> standard.  It has no specific locking and is intended to be read
> >> locklessly.  Readers must take care to avoid observing changes
> >> that appear incoherent.  This was done so that node_demotion[]
> >> locking has no chance of becoming a bottleneck on large systems
> >> with lots of CPUs in direct reclaim.
> >>
> >> This code is unused for now.  It will be called later in the
> >> series.
> >>
> >> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> >> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> >> Reviewed-by: Yang Shi <shy828301@gmail.com>
> >> Cc: Michal Hocko <mhocko@suse.com>
> >> Cc: Wei Xu <weixugc@google.com>
> >> Cc: David Rientjes <rientjes@google.com>
> >> Cc: Dan Williams <dan.j.williams@intel.com>
> >> Cc: David Hildenbrand <david@redhat.com>
> >> Cc: osalvador <osalvador@suse.de>
> >>
> >> --
> >>
> >> Changes from 20200122:
> >>  * Add big node_demotion[] comment
> >> Changes from 20210302:
> >>  * Fix typo in node_demotion[] comment
> >> ---
> >>  mm/internal.h   |   5 ++
> >>  mm/migrate.c    | 175 +++++++++++++++++++++++++++++++++++++++++++++++-
> >>  mm/page_alloc.c |   2 +-
> >>  3 files changed, 180 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/mm/internal.h b/mm/internal.h
> >> index 2f1182948aa6..0344cd78e170 100644
> >> --- a/mm/internal.h
> >> +++ b/mm/internal.h
> >> @@ -522,12 +522,17 @@ static inline void mminit_validate_memmodel_limits(unsigned long *start_pfn,
> >>
> >>  #ifdef CONFIG_NUMA
> >>  extern int node_reclaim(struct pglist_data *, gfp_t, unsigned int);
> >> +extern int find_next_best_node(int node, nodemask_t *used_node_mask);
> >>  #else
> >>  static inline int node_reclaim(struct pglist_data *pgdat, gfp_t mask,
> >>                              unsigned int order)
> >>  {
> >>      return NODE_RECLAIM_NOSCAN;
> >>  }
> >> +static inline int find_next_best_node(int node, nodemask_t *used_node_mask)
> >> +{
> >> +    return NUMA_NO_NODE;
> >> +}
> >>  #endif
> >>
> >>  extern int hwpoison_filter(struct page *p);
> >> diff --git a/mm/migrate.c b/mm/migrate.c
> >> index 6cab668132f9..111f8565f75d 100644
> >> --- a/mm/migrate.c
> >> +++ b/mm/migrate.c
> >> @@ -1136,6 +1136,44 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
> >>      return rc;
> >>  }
> >>
> >> +
> >> +/*
> >> + * node_demotion[] example:
> >> + *
> >> + * Consider a system with two sockets.  Each socket has
> >> + * three classes of memory attached: fast, medium and slow.
> >> + * Each memory class is placed in its own NUMA node.  The
> >> + * CPUs are placed in the node with the "fast" memory.  The
> >> + * 6 NUMA nodes (0-5) might be split among the sockets like
> >> + * this:
> >> + *
> >> + *  Socket A: 0, 1, 2
> >> + *  Socket B: 3, 4, 5
> >> + *
> >> + * When Node 0 fills up, its memory should be migrated to
> >> + * Node 1.  When Node 1 fills up, it should be migrated to
> >> + * Node 2.  The migration path start on the nodes with the
> >> + * processors (since allocations default to this node) and
> >> + * fast memory, progress through medium and end with the
> >> + * slow memory:
> >> + *
> >> + *  0 -> 1 -> 2 -> stop
> >> + *  3 -> 4 -> 5 -> stop
> >> + *
> >> + * This is represented in the node_demotion[] like this:
> >> + *
> >> + *  {  1, // Node 0 migrates to 1
> >> + *     2, // Node 1 migrates to 2
> >> + *    -1, // Node 2 does not migrate
> >> + *     4, // Node 3 migrates to 4
> >> + *     5, // Node 4 migrates to 5
> >> + *    -1} // Node 5 does not migrate
> >> + */
> >> +
> >> +/*
> >> + * Writes to this array occur without locking.  READ_ONCE()
> >> + * is recommended for readers to ensure consistent reads.
> >> + */
> >>  static int node_demotion[MAX_NUMNODES] __read_mostly =
> >>      {[0 ...  MAX_NUMNODES - 1] = NUMA_NO_NODE};
> >>
> >> @@ -1150,7 +1188,13 @@ static int node_demotion[MAX_NUMNODES] __read_mostly =
> >>   */
> >>  int next_demotion_node(int node)
> >>  {
> >> -    return node_demotion[node];
> >> +    /*
> >> +     * node_demotion[] is updated without excluding
> >> +     * this function from running.  READ_ONCE() avoids
> >> +     * reading multiple, inconsistent 'node' values
> >> +     * during an update.
> >> +     */
> >> +    return READ_ONCE(node_demotion[node]);
> >>  }
> >
> > Is it necessary to have two separate patches to add node_demotion and
> > next_demotion_node() then modify it immediately? Maybe merge Patch 1 into 2?
> >
> > Hmm, I just checked Patch 3 and it changes node_demotion again and uses RCU.
> > I guess it might be much simpler to just introduce node_demotion with RCU
> > in this patch and Patch 3 only takes care of hotplug events.
>
> Hi, Dave,
>
> What do you think about this?

Squashing patch #1 and #2 had been mentioned in the previous review
and it seems Dave agreed.
https://lore.kernel.org/linux-mm/4573cb9a-31ca-3b3d-96bc-5d94876b9709@intel.com/

>
> >>
> >>  /*
> >> @@ -3144,3 +3188,132 @@ void migrate_vma_finalize(struct migrate_vma *migrate)
> >>  }
> >>  EXPORT_SYMBOL(migrate_vma_finalize);
> >>  #endif /* CONFIG_DEVICE_PRIVATE */
> >> +
> >> +/* Disable reclaim-based migration. */
> >> +static void disable_all_migrate_targets(void)
> >> +{
> >> +    int node;
> >> +
> >> +    for_each_online_node(node)
> >> +            node_demotion[node] = NUMA_NO_NODE;
> >> +}
> >> +
> >> +/*
> >> + * Find an automatic demotion target for 'node'.
> >> + * Failing here is OK.  It might just indicate
> >> + * being at the end of a chain.
> >> + */
> >> +static int establish_migrate_target(int node, nodemask_t *used)
> >> +{
> >> +    int migration_target;
> >> +
> >> +    /*
> >> +     * Can not set a migration target on a
> >> +     * node with it already set.
> >> +     *
> >> +     * No need for READ_ONCE() here since this
> >> +     * in the write path for node_demotion[].
> >> +     * This should be the only thread writing.
> >> +     */
> >> +    if (node_demotion[node] != NUMA_NO_NODE)
> >> +            return NUMA_NO_NODE;
> >> +
> >> +    migration_target = find_next_best_node(node, used);
> >> +    if (migration_target == NUMA_NO_NODE)
> >> +            return NUMA_NO_NODE;
> >> +
> >> +    node_demotion[node] = migration_target;
> >> +
> >> +    return migration_target;
> >> +}
> >> +
> >> +/*
> >> + * When memory fills up on a node, memory contents can be
> >> + * automatically migrated to another node instead of
> >> + * discarded at reclaim.
> >> + *
> >> + * Establish a "migration path" which will start at nodes
> >> + * with CPUs and will follow the priorities used to build the
> >> + * page allocator zonelists.
> >> + *
> >> + * The difference here is that cycles must be avoided.  If
> >> + * node0 migrates to node1, then neither node1, nor anything
> >> + * node1 migrates to can migrate to node0.
> >> + *
> >> + * This function can run simultaneously with readers of
> >> + * node_demotion[].  However, it can not run simultaneously
> >> + * with itself.  Exclusion is provided by memory hotplug events
> >> + * being single-threaded.
> >> + */
> >> +static void __set_migration_target_nodes(void)
> >> +{
> >> +    nodemask_t next_pass    = NODE_MASK_NONE;
> >> +    nodemask_t this_pass    = NODE_MASK_NONE;
> >> +    nodemask_t used_targets = NODE_MASK_NONE;
> >> +    int node;
> >> +
> >> +    /*
> >> +     * Avoid any oddities like cycles that could occur
> >> +     * from changes in the topology.  This will leave
> >> +     * a momentary gap when migration is disabled.
> >> +     */
> >> +    disable_all_migrate_targets();
> >> +
> >> +    /*
> >> +     * Ensure that the "disable" is visible across the system.
> >> +     * Readers will see either a combination of before+disable
> >> +     * state or disable+after.  They will never see before and
> >> +     * after state together.
> >> +     *
> >> +     * The before+after state together might have cycles and
> >> +     * could cause readers to do things like loop until this
> >> +     * function finishes.  This ensures they can only see a
> >> +     * single "bad" read and would, for instance, only loop
> >> +     * once.
> >> +     */
> >> +    smp_wmb();
> >> +
> >> +    /*
> >> +     * Allocations go close to CPUs, first.  Assume that
> >> +     * the migration path starts at the nodes with CPUs.
> >> +     */
> >> +    next_pass = node_states[N_CPU];
> >
> > Is there a plan of allowing user to change where the migration
> > path starts? Or maybe one step further providing an interface
> > to allow user to specify the demotion path. Something like
> > /sys/devices/system/node/node*/node_demotion.
>
> I don't think that's necessary at least for now.  Do you know any real
> world use case for this?
>
> Best Regards,
> Huang, Ying
>
> [snip]
Dave Hansen June 21, 2021, 7:53 p.m. UTC | #5
On 6/19/21 1:18 AM, Huang, Ying wrote:
>>>  int next_demotion_node(int node)
>>>  {
>>> -	return node_demotion[node];
>>> +	/*
>>> +	 * node_demotion[] is updated without excluding
>>> +	 * this function from running.  READ_ONCE() avoids
>>> +	 * reading multiple, inconsistent 'node' values
>>> +	 * during an update.
>>> +	 */
>>> +	return READ_ONCE(node_demotion[node]);
>>>  }
>> Is it necessary to have two separate patches to add node_demotion and
>> next_demotion_node() then modify it immediately? Maybe merge Patch 1 into 2?
>>
>> Hmm, I just checked Patch 3 and it changes node_demotion again and uses RCU.
>> I guess it might be much simpler to just introduce node_demotion with RCU
>> in this patch and Patch 3 only takes care of hotplug events.
> Hi, Dave,
> 
> What do you think about this?
> 

Squashing them seems like a good idea to me.
Huang, Ying June 22, 2021, 12:54 a.m. UTC | #6
Dave Hansen <dave.hansen@intel.com> writes:

> On 6/19/21 1:18 AM, Huang, Ying wrote:
>>>>  int next_demotion_node(int node)
>>>>  {
>>>> -	return node_demotion[node];
>>>> +	/*
>>>> +	 * node_demotion[] is updated without excluding
>>>> +	 * this function from running.  READ_ONCE() avoids
>>>> +	 * reading multiple, inconsistent 'node' values
>>>> +	 * during an update.
>>>> +	 */
>>>> +	return READ_ONCE(node_demotion[node]);
>>>>  }
>>> Is it necessary to have two separate patches to add node_demotion and
>>> next_demotion_node() then modify it immediately? Maybe merge Patch 1 into 2?
>>>
>>> Hmm, I just checked Patch 3 and it changes node_demotion again and uses RCU.
>>> I guess it might be much simpler to just introduce node_demotion with RCU
>>> in this patch and Patch 3 only takes care of hotplug events.
>> Hi, Dave,
>> 
>> What do you think about this?
>> 
>
> Squashing them seems like a good idea to me.

Sure.  Will do that.  How about move RCU from 3/10 to the squashed one?

Best Regards,
Huang, Ying
Huang, Ying June 22, 2021, 12:55 a.m. UTC | #7
Yang Shi <shy828301@gmail.com> writes:

> On Sat, Jun 19, 2021 at 1:19 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Zi Yan <ziy@nvidia.com> writes:
>>
>> > On 18 Jun 2021, at 2:15, Huang Ying wrote:
>> >
>> >> From: Dave Hansen <dave.hansen@linux.intel.com>
>> >>
>> >> When memory fills up on a node, memory contents can be
>> >> automatically migrated to another node.  The biggest problems are
>> >> knowing when to migrate and to where the migration should be
>> >> targeted.
>> >>
>> >> The most straightforward way to generate the "to where" list would
>> >> be to follow the page allocator fallback lists.  Those lists
>> >> already tell us if memory is full where to look next.  It would
>> >> also be logical to move memory in that order.
>> >>
>> >> But, the allocator fallback lists have a fatal flaw: most nodes
>> >> appear in all the lists.  This would potentially lead to migration
>> >> cycles (A->B, B->A, A->B, ...).
>> >>
>> >> Instead of using the allocator fallback lists directly, keep a
>> >> separate node migration ordering.  But, reuse the same data used
>> >> to generate page allocator fallback in the first place:
>> >> find_next_best_node().
>> >>
>> >> This means that the firmware data used to populate node distances
>> >> essentially dictates the ordering for now.  It should also be
>> >> architecture-neutral since all NUMA architectures have a working
>> >> find_next_best_node().
>> >>
>> >> The protocol for node_demotion[] access and writing is not
>> >> standard.  It has no specific locking and is intended to be read
>> >> locklessly.  Readers must take care to avoid observing changes
>> >> that appear incoherent.  This was done so that node_demotion[]
>> >> locking has no chance of becoming a bottleneck on large systems
>> >> with lots of CPUs in direct reclaim.
>> >>
>> >> This code is unused for now.  It will be called later in the
>> >> series.
>> >>
>> >> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
>> >> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> >> Reviewed-by: Yang Shi <shy828301@gmail.com>
>> >> Cc: Michal Hocko <mhocko@suse.com>
>> >> Cc: Wei Xu <weixugc@google.com>
>> >> Cc: David Rientjes <rientjes@google.com>
>> >> Cc: Dan Williams <dan.j.williams@intel.com>
>> >> Cc: David Hildenbrand <david@redhat.com>
>> >> Cc: osalvador <osalvador@suse.de>
>> >>
>> >> --
>> >>
>> >> Changes from 20200122:
>> >>  * Add big node_demotion[] comment
>> >> Changes from 20210302:
>> >>  * Fix typo in node_demotion[] comment
>> >> ---
>> >>  mm/internal.h   |   5 ++
>> >>  mm/migrate.c    | 175 +++++++++++++++++++++++++++++++++++++++++++++++-
>> >>  mm/page_alloc.c |   2 +-
>> >>  3 files changed, 180 insertions(+), 2 deletions(-)
>> >>
>> >> diff --git a/mm/internal.h b/mm/internal.h
>> >> index 2f1182948aa6..0344cd78e170 100644
>> >> --- a/mm/internal.h
>> >> +++ b/mm/internal.h
>> >> @@ -522,12 +522,17 @@ static inline void mminit_validate_memmodel_limits(unsigned long *start_pfn,
>> >>
>> >>  #ifdef CONFIG_NUMA
>> >>  extern int node_reclaim(struct pglist_data *, gfp_t, unsigned int);
>> >> +extern int find_next_best_node(int node, nodemask_t *used_node_mask);
>> >>  #else
>> >>  static inline int node_reclaim(struct pglist_data *pgdat, gfp_t mask,
>> >>                              unsigned int order)
>> >>  {
>> >>      return NODE_RECLAIM_NOSCAN;
>> >>  }
>> >> +static inline int find_next_best_node(int node, nodemask_t *used_node_mask)
>> >> +{
>> >> +    return NUMA_NO_NODE;
>> >> +}
>> >>  #endif
>> >>
>> >>  extern int hwpoison_filter(struct page *p);
>> >> diff --git a/mm/migrate.c b/mm/migrate.c
>> >> index 6cab668132f9..111f8565f75d 100644
>> >> --- a/mm/migrate.c
>> >> +++ b/mm/migrate.c
>> >> @@ -1136,6 +1136,44 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
>> >>      return rc;
>> >>  }
>> >>
>> >> +
>> >> +/*
>> >> + * node_demotion[] example:
>> >> + *
>> >> + * Consider a system with two sockets.  Each socket has
>> >> + * three classes of memory attached: fast, medium and slow.
>> >> + * Each memory class is placed in its own NUMA node.  The
>> >> + * CPUs are placed in the node with the "fast" memory.  The
>> >> + * 6 NUMA nodes (0-5) might be split among the sockets like
>> >> + * this:
>> >> + *
>> >> + *  Socket A: 0, 1, 2
>> >> + *  Socket B: 3, 4, 5
>> >> + *
>> >> + * When Node 0 fills up, its memory should be migrated to
>> >> + * Node 1.  When Node 1 fills up, it should be migrated to
>> >> + * Node 2.  The migration path start on the nodes with the
>> >> + * processors (since allocations default to this node) and
>> >> + * fast memory, progress through medium and end with the
>> >> + * slow memory:
>> >> + *
>> >> + *  0 -> 1 -> 2 -> stop
>> >> + *  3 -> 4 -> 5 -> stop
>> >> + *
>> >> + * This is represented in the node_demotion[] like this:
>> >> + *
>> >> + *  {  1, // Node 0 migrates to 1
>> >> + *     2, // Node 1 migrates to 2
>> >> + *    -1, // Node 2 does not migrate
>> >> + *     4, // Node 3 migrates to 4
>> >> + *     5, // Node 4 migrates to 5
>> >> + *    -1} // Node 5 does not migrate
>> >> + */
>> >> +
>> >> +/*
>> >> + * Writes to this array occur without locking.  READ_ONCE()
>> >> + * is recommended for readers to ensure consistent reads.
>> >> + */
>> >>  static int node_demotion[MAX_NUMNODES] __read_mostly =
>> >>      {[0 ...  MAX_NUMNODES - 1] = NUMA_NO_NODE};
>> >>
>> >> @@ -1150,7 +1188,13 @@ static int node_demotion[MAX_NUMNODES] __read_mostly =
>> >>   */
>> >>  int next_demotion_node(int node)
>> >>  {
>> >> -    return node_demotion[node];
>> >> +    /*
>> >> +     * node_demotion[] is updated without excluding
>> >> +     * this function from running.  READ_ONCE() avoids
>> >> +     * reading multiple, inconsistent 'node' values
>> >> +     * during an update.
>> >> +     */
>> >> +    return READ_ONCE(node_demotion[node]);
>> >>  }
>> >
>> > Is it necessary to have two separate patches to add node_demotion and
>> > next_demotion_node() then modify it immediately? Maybe merge Patch 1 into 2?
>> >
>> > Hmm, I just checked Patch 3 and it changes node_demotion again and uses RCU.
>> > I guess it might be much simpler to just introduce node_demotion with RCU
>> > in this patch and Patch 3 only takes care of hotplug events.
>>
>> Hi, Dave,
>>
>> What do you think about this?
>
> Squashing patch #1 and #2 had been mentioned in the previous review
> and it seems Dave agreed.
> https://lore.kernel.org/linux-mm/4573cb9a-31ca-3b3d-96bc-5d94876b9709@intel.com/

Thanks a lot for your information!

Best Regards,
Huang, Ying

[snip]
Huang, Ying June 22, 2021, 1:14 a.m. UTC | #8
Zi Yan <ziy@nvidia.com> writes:

> On 19 Jun 2021, at 4:18, Huang, Ying wrote:
>
>> Zi Yan <ziy@nvidia.com> writes:
>>
>>> On 18 Jun 2021, at 2:15, Huang Ying wrote:

[snip]

>>>> +/*
>>>> + * When memory fills up on a node, memory contents can be
>>>> + * automatically migrated to another node instead of
>>>> + * discarded at reclaim.
>>>> + *
>>>> + * Establish a "migration path" which will start at nodes
>>>> + * with CPUs and will follow the priorities used to build the
>>>> + * page allocator zonelists.
>>>> + *
>>>> + * The difference here is that cycles must be avoided.  If
>>>> + * node0 migrates to node1, then neither node1, nor anything
>>>> + * node1 migrates to can migrate to node0.
>>>> + *
>>>> + * This function can run simultaneously with readers of
>>>> + * node_demotion[].  However, it can not run simultaneously
>>>> + * with itself.  Exclusion is provided by memory hotplug events
>>>> + * being single-threaded.
>>>> + */
>>>> +static void __set_migration_target_nodes(void)
>>>> +{
>>>> +	nodemask_t next_pass	= NODE_MASK_NONE;
>>>> +	nodemask_t this_pass	= NODE_MASK_NONE;
>>>> +	nodemask_t used_targets = NODE_MASK_NONE;
>>>> +	int node;
>>>> +
>>>> +	/*
>>>> +	 * Avoid any oddities like cycles that could occur
>>>> +	 * from changes in the topology.  This will leave
>>>> +	 * a momentary gap when migration is disabled.
>>>> +	 */
>>>> +	disable_all_migrate_targets();
>>>> +
>>>> +	/*
>>>> +	 * Ensure that the "disable" is visible across the system.
>>>> +	 * Readers will see either a combination of before+disable
>>>> +	 * state or disable+after.  They will never see before and
>>>> +	 * after state together.
>>>> +	 *
>>>> +	 * The before+after state together might have cycles and
>>>> +	 * could cause readers to do things like loop until this
>>>> +	 * function finishes.  This ensures they can only see a
>>>> +	 * single "bad" read and would, for instance, only loop
>>>> +	 * once.
>>>> +	 */
>>>> +	smp_wmb();
>>>> +
>>>> +	/*
>>>> +	 * Allocations go close to CPUs, first.  Assume that
>>>> +	 * the migration path starts at the nodes with CPUs.
>>>> +	 */
>>>> +	next_pass = node_states[N_CPU];
>>>
>>> Is there a plan of allowing user to change where the migration
>>> path starts? Or maybe one step further providing an interface
>>> to allow user to specify the demotion path. Something like
>>> /sys/devices/system/node/node*/node_demotion.
>>
>> I don't think that's necessary at least for now.  Do you know any real
>> world use case for this?
>
> In our P9+volta system, GPU memory is exposed as a NUMA node.
> For the GPU workloads with data size greater than GPU memory size,
> it will be very helpful to allow pages in GPU memory to be migrated/demoted
> to CPU memory. With your current assumption, GPU memory -> CPU memory
> demotion seems not possible, right? This should also apply to any
> system with a device memory exposed as a NUMA node and workloads running
> on the device and using CPU memory as a lower tier memory than the device
> memory.

Thanks a lot for your use case!  It appears that the demotion path
specified by users is one possible way to satisfy your requirement.  And
I think it's possible to enable that on top of this patchset.  But we
still have no specific plan to work on that at least for now.

Best Regards,
Huang, Ying
Dave Hansen June 22, 2021, 12:06 p.m. UTC | #9
Yan, your reply came through in HTML.  It doesn't bother me too much,
but you'll find your replies dropped by LKML and other mailing lists
if you do this.

On 6/21/21 7:50 AM, Zi Yan wrote:
> Is there a plan of allowing user to change where the migration path
> starts? Or maybe one step further providing an interface to allow
> user to specify the demotion path. Something like
> /sys/devices/system/node/node*/node_demotion.

We actually had this in an earlier series.  I pulled it out because we
don't really *need* this ABI at the moment.  But, I totally agree that
it would be handy for many things, including any non-obvious topology
where the built-in ordering isn't optimal.

> I don't think that's necessary at least for now. Do you know any
> real world use case for this?
>
> In our P9+volta system, GPU memory is exposed as a NUMA node. For
> the GPU workloads with data size greater than GPU memory size, it
> will be very helpful to allow pages in GPU memory to be
> migrated/demoted to CPU memory. With your current assumption, GPU
> memory -> CPU memory demotion seems not possible, right? This
> should also apply to any system with a device memory exposed as a
> NUMA node and workloads running on the device and using CPU memory
> as a lower tier memory than the device memory.

Yes, with the current ordering, CPU memory would be demoted to the
GPU, not the other way around.  The right way to fix this (on ACPI
platforms at least) is probably to use the HMAT table and build the
demotion based on any memory targets rather than just CPUs.

That would be a great future enhancement to all of this.  But, because
not all systems have HMATs, we also need something more basic, which
is what is in this series.
Dave Hansen June 22, 2021, 12:13 p.m. UTC | #10
On 6/21/21 6:14 PM, Huang, Ying wrote:
>> In our P9+volta system, GPU memory is exposed as a NUMA node.
>> For the GPU workloads with data size greater than GPU memory size,
>> it will be very helpful to allow pages in GPU memory to be migrated/demoted
>> to CPU memory. With your current assumption, GPU memory -> CPU memory
>> demotion seems not possible, right? This should also apply to any
>> system with a device memory exposed as a NUMA node and workloads running
>> on the device and using CPU memory as a lower tier memory than the device
>> memory.
> Thanks a lot for your use case!  It appears that the demotion path
> specified by users is one possible way to satisfy your requirement.  And
> I think it's possible to enable that on top of this patchset.  But we
> still have no specific plan to work on that at least for now.

In other words, patches to make adapt this to your use case would be
most welcome!
Zi Yan June 22, 2021, 12:48 p.m. UTC | #11
On 22 Jun 2021, at 8:06, Dave Hansen wrote:

> Yan, your reply came through in HTML.  It doesn't bother me too much,
> but you'll find your replies dropped by LKML and other mailing lists
> if you do this.

Apologies. I used the wrong text mode. Thanks for letting me know.

>
> On 6/21/21 7:50 AM, Zi Yan wrote:
>> Is there a plan of allowing user to change where the migration path
>> starts? Or maybe one step further providing an interface to allow
>> user to specify the demotion path. Something like
>> /sys/devices/system/node/node*/node_demotion.
>
> We actually had this in an earlier series.  I pulled it out because we
> don't really *need* this ABI at the moment.  But, I totally agree that
> it would be handy for many things, including any non-obvious topology
> where the built-in ordering isn't optimal.
>
>> I don't think that's necessary at least for now. Do you know any
>> real world use case for this?
>>
>> In our P9+volta system, GPU memory is exposed as a NUMA node. For
>> the GPU workloads with data size greater than GPU memory size, it
>> will be very helpful to allow pages in GPU memory to be
>> migrated/demoted to CPU memory. With your current assumption, GPU
>> memory -> CPU memory demotion seems not possible, right? This
>> should also apply to any system with a device memory exposed as a
>> NUMA node and workloads running on the device and using CPU memory
>> as a lower tier memory than the device memory.
>
> Yes, with the current ordering, CPU memory would be demoted to the
> GPU, not the other way around.  The right way to fix this (on ACPI
> platforms at least) is probably to use the HMAT table and build the
> demotion based on any memory targets rather than just CPUs.
>
> That would be a great future enhancement to all of this.  But, because
> not all systems have HMATs, we also need something more basic, which
> is what is in this series.

This information is very helpful. I agree that reading HMAT table is
the right way. I will look into it. Thanks!


—
Best Regards,
Yan, Zi
diff mbox series

Patch

diff --git a/mm/internal.h b/mm/internal.h
index 2f1182948aa6..0344cd78e170 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -522,12 +522,17 @@  static inline void mminit_validate_memmodel_limits(unsigned long *start_pfn,
 
 #ifdef CONFIG_NUMA
 extern int node_reclaim(struct pglist_data *, gfp_t, unsigned int);
+extern int find_next_best_node(int node, nodemask_t *used_node_mask);
 #else
 static inline int node_reclaim(struct pglist_data *pgdat, gfp_t mask,
 				unsigned int order)
 {
 	return NODE_RECLAIM_NOSCAN;
 }
+static inline int find_next_best_node(int node, nodemask_t *used_node_mask)
+{
+	return NUMA_NO_NODE;
+}
 #endif
 
 extern int hwpoison_filter(struct page *p);
diff --git a/mm/migrate.c b/mm/migrate.c
index 6cab668132f9..111f8565f75d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1136,6 +1136,44 @@  static int __unmap_and_move(struct page *page, struct page *newpage,
 	return rc;
 }
 
+
+/*
+ * node_demotion[] example:
+ *
+ * Consider a system with two sockets.  Each socket has
+ * three classes of memory attached: fast, medium and slow.
+ * Each memory class is placed in its own NUMA node.  The
+ * CPUs are placed in the node with the "fast" memory.  The
+ * 6 NUMA nodes (0-5) might be split among the sockets like
+ * this:
+ *
+ *	Socket A: 0, 1, 2
+ *	Socket B: 3, 4, 5
+ *
+ * When Node 0 fills up, its memory should be migrated to
+ * Node 1.  When Node 1 fills up, it should be migrated to
+ * Node 2.  The migration path start on the nodes with the
+ * processors (since allocations default to this node) and
+ * fast memory, progress through medium and end with the
+ * slow memory:
+ *
+ *	0 -> 1 -> 2 -> stop
+ *	3 -> 4 -> 5 -> stop
+ *
+ * This is represented in the node_demotion[] like this:
+ *
+ *	{  1, // Node 0 migrates to 1
+ *	   2, // Node 1 migrates to 2
+ *	  -1, // Node 2 does not migrate
+ *	   4, // Node 3 migrates to 4
+ *	   5, // Node 4 migrates to 5
+ *	  -1} // Node 5 does not migrate
+ */
+
+/*
+ * Writes to this array occur without locking.  READ_ONCE()
+ * is recommended for readers to ensure consistent reads.
+ */
 static int node_demotion[MAX_NUMNODES] __read_mostly =
 	{[0 ...  MAX_NUMNODES - 1] = NUMA_NO_NODE};
 
@@ -1150,7 +1188,13 @@  static int node_demotion[MAX_NUMNODES] __read_mostly =
  */
 int next_demotion_node(int node)
 {
-	return node_demotion[node];
+	/*
+	 * node_demotion[] is updated without excluding
+	 * this function from running.  READ_ONCE() avoids
+	 * reading multiple, inconsistent 'node' values
+	 * during an update.
+	 */
+	return READ_ONCE(node_demotion[node]);
 }
 
 /*
@@ -3144,3 +3188,132 @@  void migrate_vma_finalize(struct migrate_vma *migrate)
 }
 EXPORT_SYMBOL(migrate_vma_finalize);
 #endif /* CONFIG_DEVICE_PRIVATE */
+
+/* Disable reclaim-based migration. */
+static void disable_all_migrate_targets(void)
+{
+	int node;
+
+	for_each_online_node(node)
+		node_demotion[node] = NUMA_NO_NODE;
+}
+
+/*
+ * Find an automatic demotion target for 'node'.
+ * Failing here is OK.  It might just indicate
+ * being at the end of a chain.
+ */
+static int establish_migrate_target(int node, nodemask_t *used)
+{
+	int migration_target;
+
+	/*
+	 * Can not set a migration target on a
+	 * node with it already set.
+	 *
+	 * No need for READ_ONCE() here since this
+	 * in the write path for node_demotion[].
+	 * This should be the only thread writing.
+	 */
+	if (node_demotion[node] != NUMA_NO_NODE)
+		return NUMA_NO_NODE;
+
+	migration_target = find_next_best_node(node, used);
+	if (migration_target == NUMA_NO_NODE)
+		return NUMA_NO_NODE;
+
+	node_demotion[node] = migration_target;
+
+	return migration_target;
+}
+
+/*
+ * When memory fills up on a node, memory contents can be
+ * automatically migrated to another node instead of
+ * discarded at reclaim.
+ *
+ * Establish a "migration path" which will start at nodes
+ * with CPUs and will follow the priorities used to build the
+ * page allocator zonelists.
+ *
+ * The difference here is that cycles must be avoided.  If
+ * node0 migrates to node1, then neither node1, nor anything
+ * node1 migrates to can migrate to node0.
+ *
+ * This function can run simultaneously with readers of
+ * node_demotion[].  However, it can not run simultaneously
+ * with itself.  Exclusion is provided by memory hotplug events
+ * being single-threaded.
+ */
+static void __set_migration_target_nodes(void)
+{
+	nodemask_t next_pass	= NODE_MASK_NONE;
+	nodemask_t this_pass	= NODE_MASK_NONE;
+	nodemask_t used_targets = NODE_MASK_NONE;
+	int node;
+
+	/*
+	 * Avoid any oddities like cycles that could occur
+	 * from changes in the topology.  This will leave
+	 * a momentary gap when migration is disabled.
+	 */
+	disable_all_migrate_targets();
+
+	/*
+	 * Ensure that the "disable" is visible across the system.
+	 * Readers will see either a combination of before+disable
+	 * state or disable+after.  They will never see before and
+	 * after state together.
+	 *
+	 * The before+after state together might have cycles and
+	 * could cause readers to do things like loop until this
+	 * function finishes.  This ensures they can only see a
+	 * single "bad" read and would, for instance, only loop
+	 * once.
+	 */
+	smp_wmb();
+
+	/*
+	 * Allocations go close to CPUs, first.  Assume that
+	 * the migration path starts at the nodes with CPUs.
+	 */
+	next_pass = node_states[N_CPU];
+again:
+	this_pass = next_pass;
+	next_pass = NODE_MASK_NONE;
+	/*
+	 * To avoid cycles in the migration "graph", ensure
+	 * that migration sources are not future targets by
+	 * setting them in 'used_targets'.  Do this only
+	 * once per pass so that multiple source nodes can
+	 * share a target node.
+	 *
+	 * 'used_targets' will become unavailable in future
+	 * passes.  This limits some opportunities for
+	 * multiple source nodes to share a destination.
+	 */
+	nodes_or(used_targets, used_targets, this_pass);
+	for_each_node_mask(node, this_pass) {
+		int target_node = establish_migrate_target(node, &used_targets);
+
+		if (target_node == NUMA_NO_NODE)
+			continue;
+
+		/* Visit targets from this pass in the next pass: */
+		node_set(target_node, next_pass);
+	}
+	/* Is another pass necessary? */
+	if (!nodes_empty(next_pass))
+		goto again;
+}
+
+/*
+ * For callers that do not hold get_online_mems() already.
+ */
+__maybe_unused // <- temporay to prevent warnings during bisects
+static void set_migration_target_nodes(void)
+{
+	get_online_mems();
+	__set_migration_target_nodes();
+	put_online_mems();
+}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d1f5de1c1283..e033ae2e8bce 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5973,7 +5973,7 @@  static int node_load[MAX_NUMNODES];
  *
  * Return: node id of the found node or %NUMA_NO_NODE if no node is found.
  */
-static int find_next_best_node(int node, nodemask_t *used_node_mask)
+int find_next_best_node(int node, nodemask_t *used_node_mask)
 {
 	int n, val;
 	int min_val = INT_MAX;