diff mbox series

mm: migrate: set demotion targets differently

Message ID 20220329115222.8923-1-jvgediya@linux.ibm.com (mailing list archive)
State New
Headers show
Series mm: migrate: set demotion targets differently | expand

Commit Message

Jagdish Gediya March 29, 2022, 11:52 a.m. UTC
The current implementation to identify the demotion
targets limits some of the opportunities to share
the demotion targets between multiple source nodes.

Implement a logic to identify the loop in the demotion
targets such that all the possibilities of demotion can
be utilized. Don't share the used targets between all
the nodes, instead create the used targets from scratch
for each individual node based on for what all node this
node is a demotion target. This helps to share the demotion
targets without missing any possible way of demotion.

e.g. with below NUMA topology, where node 0 & 1 are
cpu + dram nodes, node 2 & 3 are equally slower memory
only nodes, and node 4 is slowest memory only node,

available: 5 nodes (0-4)
node 0 cpus: 0 1
node 0 size: n MB
node 0 free: n MB
node 1 cpus: 2 3
node 1 size: n MB
node 1 free: n MB
node 2 cpus:
node 2 size: n MB
node 2 free: n MB
node 3 cpus:
node 3 size: n MB
node 3 free: n MB
node 4 cpus:
node 4 size: n MB
node 4 free: n MB
node distances:
node   0   1   2   3   4
  0:  10  20  40  40  80
  1:  20  10  40  40  80
  2:  40  40  10  40  80
  3:  40  40  40  10  80
  4:  80  80  80  80  10

The existing implementation gives below demotion targets,

node    demotion_target
 0              3, 2
 1              4
 2              X
 3              X
 4		X

With this patch applied, below are the demotion targets,

node    demotion_target
 0              3, 2
 1              3, 2
 2              3
 3              4
 4		X

e.g. with below NUMA topology, where node 0, 1 & 2 are
cpu + dram nodes and node 3 is slow memory node,

available: 4 nodes (0-3)
node 0 cpus: 0 1
node 0 size: n MB
node 0 free: n MB
node 1 cpus: 2 3
node 1 size: n MB
node 1 free: n MB
node 2 cpus: 4 5
node 2 size: n MB
node 2 free: n MB
node 3 cpus:
node 3 size: n MB
node 3 free: n MB
node distances:
node   0   1   2   3
  0:  10  20  20  40
  1:  20  10  20  40
  2:  20  20  10  40
  3:  40  40  40  10

The existing implementation gives below demotion targets,

node    demotion_target
 0              3
 1              X
 2              X
 3              X

With this patch applied, below are the demotion targets,

node    demotion_target
 0              3
 1              3
 2              3
 3              X

with below NUMA topology, where node 0 & 2 are cpu + dram
nodes and node 1 & 3 are slow memory nodes,

available: 4 nodes (0-3)
node 0 cpus: 0 1
node 0 size: n MB
node 0 free: n MB
node 1 cpus:
node 1 size: n MB
node 1 free: n MB
node 2 cpus: 2 3
node 2 size: n MB
node 2 free: n MB
node 3 cpus:
node 3 size: n MB
node 3 free: n MB
node distances:
node   0   1   2   3
  0:  10  40  20  80
  1:  40  10  80  80
  2:  20  80  10  40
  3:  80  80  40  10

The existing implementation gives below demotion targets,

node    demotion_target
 0              3
 1              X
 2              3
 3              X

With this patch applied, below are the demotion targets,

node    demotion_target
 0              1
 1              3
 2              3
 3              X

As it can be seen above, node 3 can be demotion target for node
1 but existing implementation doesn't configure it that way. It
is better to move pages from node 1 to node 3 instead of moving
it from node 1 to swap.

Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 mm/migrate.c | 75 ++++++++++++++++++++++++++++------------------------
 1 file changed, 41 insertions(+), 34 deletions(-)

Comments

Baolin Wang March 29, 2022, 12:26 p.m. UTC | #1
Hi Jagdish,

On 3/29/2022 7:52 PM, Jagdish Gediya wrote:
> The current implementation to identify the demotion
> targets limits some of the opportunities to share
> the demotion targets between multiple source nodes.
> 
> Implement a logic to identify the loop in the demotion
> targets such that all the possibilities of demotion can
> be utilized. Don't share the used targets between all
> the nodes, instead create the used targets from scratch
> for each individual node based on for what all node this
> node is a demotion target. This helps to share the demotion
> targets without missing any possible way of demotion.
> 
> e.g. with below NUMA topology, where node 0 & 1 are
> cpu + dram nodes, node 2 & 3 are equally slower memory
> only nodes, and node 4 is slowest memory only node,
> 
> available: 5 nodes (0-4)
> node 0 cpus: 0 1
> node 0 size: n MB
> node 0 free: n MB
> node 1 cpus: 2 3
> node 1 size: n MB
> node 1 free: n MB
> node 2 cpus:
> node 2 size: n MB
> node 2 free: n MB
> node 3 cpus:
> node 3 size: n MB
> node 3 free: n MB
> node 4 cpus:
> node 4 size: n MB
> node 4 free: n MB
> node distances:
> node   0   1   2   3   4
>    0:  10  20  40  40  80
>    1:  20  10  40  40  80
>    2:  40  40  10  40  80
>    3:  40  40  40  10  80
>    4:  80  80  80  80  10
> 
> The existing implementation gives below demotion targets,
> 
> node    demotion_target
>   0              3, 2
>   1              4
>   2              X
>   3              X
>   4		X
> 
> With this patch applied, below are the demotion targets,
> 
> node    demotion_target
>   0              3, 2
>   1              3, 2
>   2              3
>   3              4
>   4		X

Node 2 and node 3 both are slow memory and have same distance, why node 
2 should demote cold memory to node 3? They should have the same target 
demotion node 4, which is the slowest memory node, right?

> 
> e.g. with below NUMA topology, where node 0, 1 & 2 are
> cpu + dram nodes and node 3 is slow memory node,
> 
> available: 4 nodes (0-3)
> node 0 cpus: 0 1
> node 0 size: n MB
> node 0 free: n MB
> node 1 cpus: 2 3
> node 1 size: n MB
> node 1 free: n MB
> node 2 cpus: 4 5
> node 2 size: n MB
> node 2 free: n MB
> node 3 cpus:
> node 3 size: n MB
> node 3 free: n MB
> node distances:
> node   0   1   2   3
>    0:  10  20  20  40
>    1:  20  10  20  40
>    2:  20  20  10  40
>    3:  40  40  40  10
> 
> The existing implementation gives below demotion targets,
> 
> node    demotion_target
>   0              3
>   1              X
>   2              X
>   3              X
> 
> With this patch applied, below are the demotion targets,
> 
> node    demotion_target
>   0              3
>   1              3
>   2              3
>   3              X

Sounds reasonable.

> 
> with below NUMA topology, where node 0 & 2 are cpu + dram
> nodes and node 1 & 3 are slow memory nodes,
> 
> available: 4 nodes (0-3)
> node 0 cpus: 0 1
> node 0 size: n MB
> node 0 free: n MB
> node 1 cpus:
> node 1 size: n MB
> node 1 free: n MB
> node 2 cpus: 2 3
> node 2 size: n MB
> node 2 free: n MB
> node 3 cpus:
> node 3 size: n MB
> node 3 free: n MB
> node distances:
> node   0   1   2   3
>    0:  10  40  20  80
>    1:  40  10  80  80
>    2:  20  80  10  40
>    3:  80  80  40  10
> 
> The existing implementation gives below demotion targets,
> 
> node    demotion_target
>   0              3
>   1              X
>   2              3
>   3              X

If I understand correctly, this is not true. The demotion route should 
be as below with existing implementation:
node 0 ---> node 1
node 1 ---> X
node 2 ---> node 3
node 3 ---> X

> 
> With this patch applied, below are the demotion targets,
> 
> node    demotion_target
>   0              1
>   1              3
>   2              3
>   3              X
> 
> As it can be seen above, node 3 can be demotion target for node
> 1 but existing implementation doesn't configure it that way. It
> is better to move pages from node 1 to node 3 instead of moving
> it from node 1 to swap.

Which means node 3 is the slowest memory node?

> 
> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>   mm/migrate.c | 75 ++++++++++++++++++++++++++++------------------------
>   1 file changed, 41 insertions(+), 34 deletions(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 3d60823afd2d..7ec8d934e706 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2381,10 +2381,13 @@ static int establish_migrate_target(int node, nodemask_t *used,
>    */
>   static void __set_migration_target_nodes(void)
>   {
> -	nodemask_t next_pass	= NODE_MASK_NONE;
> -	nodemask_t this_pass	= NODE_MASK_NONE;
>   	nodemask_t used_targets = NODE_MASK_NONE;
>   	int node, best_distance;
> +	nodemask_t *src_nodes;
> +
> +	src_nodes = kcalloc(nr_node_ids, sizeof(nodemask_t), GFP_KERNEL);
> +	if (!src_nodes)
> +		return;
>   
>   	/*
>   	 * Avoid any oddities like cycles that could occur
> @@ -2393,29 +2396,39 @@ static void __set_migration_target_nodes(void)
>   	 */
>   	disable_all_migrate_targets();
>   
> -	/*
> -	 * Allocations go close to CPUs, first.  Assume that
> -	 * the migration path starts at the nodes with CPUs.
> -	 */
> -	next_pass = node_states[N_CPU];
> -again:
> -	this_pass = next_pass;
> -	next_pass = NODE_MASK_NONE;
> -	/*
> -	 * To avoid cycles in the migration "graph", ensure
> -	 * that migration sources are not future targets by
> -	 * setting them in 'used_targets'.  Do this only
> -	 * once per pass so that multiple source nodes can
> -	 * share a target node.
> -	 *
> -	 * 'used_targets' will become unavailable in future
> -	 * passes.  This limits some opportunities for
> -	 * multiple source nodes to share a destination.
> -	 */
> -	nodes_or(used_targets, used_targets, this_pass);
> +	for_each_online_node(node) {
> +		int tmp_node;
>   
> -	for_each_node_mask(node, this_pass) {
>   		best_distance = -1;
> +		used_targets = NODE_MASK_NONE;
> +
> +		/*
> +		 * Avoid adding same node as the demotion target.
> +		 */
> +		node_set(node, used_targets);
> +
> +		/*
> +		 * Add CPU NUMA nodes to the used target list so that it
> +		 * won't be considered a demotion target.
> +		 */
> +		nodes_or(used_targets, used_targets, node_states[N_CPU]);
> +
> +		/*
> +		 * Add all nodes that has appeared as source node of demotion
> +		 * for this target node.
> +		 *
> +		 * To avoid cycles in the migration "graph", ensure
> +		 * that migration sources are not future targets by
> +		 * setting them in 'used_targets'.
> +		 */
> +		for_each_node_mask(tmp_node, src_nodes[node])
> +			nodes_or(used_targets, used_targets, src_nodes[tmp_node]);
> +
> +		/*
> +		 * Now update the demotion src nodes with other nodes in graph
> +		 * which got computed above.
> +		 */
> +		nodes_or(src_nodes[node], src_nodes[node], used_targets);
>   
>   		/*
>   		 * Try to set up the migration path for the node, and the target
> @@ -2434,20 +2447,14 @@ static void __set_migration_target_nodes(void)
>   				best_distance = node_distance(node, target_node);
>   
>   			/*
> -			 * Visit targets from this pass in the next pass.
> -			 * Eventually, every node will have been part of
> -			 * a pass, and will become set in 'used_targets'.
> +			 * Add this node in the src_nodes list so that we can
> +			 * detect the looping.
>   			 */
> -			node_set(target_node, next_pass);
> +			node_set(node, src_nodes[target_node]);
>   		} while (1);
>   	}
> -	/*
> -	 * 'next_pass' contains nodes which became migration
> -	 * targets in this pass.  Make additional passes until
> -	 * no more migrations targets are available.
> -	 */
> -	if (!nodes_empty(next_pass))
> -		goto again;
> +
> +	kfree(src_nodes);
>   }
>   
>   /*
Jagdish Gediya March 29, 2022, 2:04 p.m. UTC | #2
On Tue, Mar 29, 2022 at 08:26:05PM +0800, Baolin Wang wrote:
Hi Baolin,
> Hi Jagdish,
> 
> On 3/29/2022 7:52 PM, Jagdish Gediya wrote:
> > The current implementation to identify the demotion
> > targets limits some of the opportunities to share
> > the demotion targets between multiple source nodes.
> > 
> > Implement a logic to identify the loop in the demotion
> > targets such that all the possibilities of demotion can
> > be utilized. Don't share the used targets between all
> > the nodes, instead create the used targets from scratch
> > for each individual node based on for what all node this
> > node is a demotion target. This helps to share the demotion
> > targets without missing any possible way of demotion.
> > 
> > e.g. with below NUMA topology, where node 0 & 1 are
> > cpu + dram nodes, node 2 & 3 are equally slower memory
> > only nodes, and node 4 is slowest memory only node,
> > 
> > available: 5 nodes (0-4)
> > node 0 cpus: 0 1
> > node 0 size: n MB
> > node 0 free: n MB
> > node 1 cpus: 2 3
> > node 1 size: n MB
> > node 1 free: n MB
> > node 2 cpus:
> > node 2 size: n MB
> > node 2 free: n MB
> > node 3 cpus:
> > node 3 size: n MB
> > node 3 free: n MB
> > node 4 cpus:
> > node 4 size: n MB
> > node 4 free: n MB
> > node distances:
> > node   0   1   2   3   4
> >    0:  10  20  40  40  80
> >    1:  20  10  40  40  80
> >    2:  40  40  10  40  80
> >    3:  40  40  40  10  80
> >    4:  80  80  80  80  10
> > 
> > The existing implementation gives below demotion targets,
> > 
> > node    demotion_target
> >   0              3, 2
> >   1              4
> >   2              X
> >   3              X
> >   4		X
> > 
> > With this patch applied, below are the demotion targets,
> > 
> > node    demotion_target
> >   0              3, 2
> >   1              3, 2
> >   2              3
> >   3              4
> >   4		X
> 
> Node 2 and node 3 both are slow memory and have same distance, why node 2
> should demote cold memory to node 3? They should have the same target
> demotion node 4, which is the slowest memory node, right?
> 
Current demotion target finding algorithm works based on best distance, as distance between node 2 & 3 is 40 and distance between node 2 & 4 is 80, node 2 demotes to node 3.
> > 
> > e.g. with below NUMA topology, where node 0, 1 & 2 are
> > cpu + dram nodes and node 3 is slow memory node,
> > 
> > available: 4 nodes (0-3)
> > node 0 cpus: 0 1
> > node 0 size: n MB
> > node 0 free: n MB
> > node 1 cpus: 2 3
> > node 1 size: n MB
> > node 1 free: n MB
> > node 2 cpus: 4 5
> > node 2 size: n MB
> > node 2 free: n MB
> > node 3 cpus:
> > node 3 size: n MB
> > node 3 free: n MB
> > node distances:
> > node   0   1   2   3
> >    0:  10  20  20  40
> >    1:  20  10  20  40
> >    2:  20  20  10  40
> >    3:  40  40  40  10
> > 
> > The existing implementation gives below demotion targets,
> > 
> > node    demotion_target
> >   0              3
> >   1              X
> >   2              X
> >   3              X
> > 
> > With this patch applied, below are the demotion targets,
> > 
> > node    demotion_target
> >   0              3
> >   1              3
> >   2              3
> >   3              X
> 
> Sounds reasonable.
> 
> > 
> > with below NUMA topology, where node 0 & 2 are cpu + dram
> > nodes and node 1 & 3 are slow memory nodes,
> > 
> > available: 4 nodes (0-3)
> > node 0 cpus: 0 1
> > node 0 size: n MB
> > node 0 free: n MB
> > node 1 cpus:
> > node 1 size: n MB
> > node 1 free: n MB
> > node 2 cpus: 2 3
> > node 2 size: n MB
> > node 2 free: n MB
> > node 3 cpus:
> > node 3 size: n MB
> > node 3 free: n MB
> > node distances:
> > node   0   1   2   3
> >    0:  10  40  20  80
> >    1:  40  10  80  80
> >    2:  20  80  10  40
> >    3:  80  80  40  10
> > 
> > The existing implementation gives below demotion targets,
> > 
> > node    demotion_target
> >   0              3
> >   1              X
> >   2              3
> >   3              X
> 
> If I understand correctly, this is not true. The demotion route should be as
> below with existing implementation:
> node 0 ---> node 1
> node 1 ---> X
> node 2 ---> node 3
> node 3 ---> X
> 
Its typo, It should be 0 -> 1, Will correct it in v2.
> > 
> > With this patch applied, below are the demotion targets,
> > 
> > node    demotion_target
> >   0              1
> >   1              3
> >   2              3
> >   3              X
> > 
> > As it can be seen above, node 3 can be demotion target for node
> > 1 but existing implementation doesn't configure it that way. It
> > is better to move pages from node 1 to node 3 instead of moving
> > it from node 1 to swap.
> 
> Which means node 3 is the slowest memory node?
>
Node 1 and 3 are equally slower but 1 is near to 0 and 3 is near to 2. Basically you can think of it like node 1 is slow memory logical node near to node 0 and node 3 is slow memory logical node near to node 2.
> > 
> > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> > ---
> >   mm/migrate.c | 75 ++++++++++++++++++++++++++++------------------------
> >   1 file changed, 41 insertions(+), 34 deletions(-)
> > 
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 3d60823afd2d..7ec8d934e706 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -2381,10 +2381,13 @@ static int establish_migrate_target(int node, nodemask_t *used,
> >    */
> >   static void __set_migration_target_nodes(void)
> >   {
> > -	nodemask_t next_pass	= NODE_MASK_NONE;
> > -	nodemask_t this_pass	= NODE_MASK_NONE;
> >   	nodemask_t used_targets = NODE_MASK_NONE;
> >   	int node, best_distance;
> > +	nodemask_t *src_nodes;
> > +
> > +	src_nodes = kcalloc(nr_node_ids, sizeof(nodemask_t), GFP_KERNEL);
> > +	if (!src_nodes)
> > +		return;
> >   	/*
> >   	 * Avoid any oddities like cycles that could occur
> > @@ -2393,29 +2396,39 @@ static void __set_migration_target_nodes(void)
> >   	 */
> >   	disable_all_migrate_targets();
> > -	/*
> > -	 * Allocations go close to CPUs, first.  Assume that
> > -	 * the migration path starts at the nodes with CPUs.
> > -	 */
> > -	next_pass = node_states[N_CPU];
> > -again:
> > -	this_pass = next_pass;
> > -	next_pass = NODE_MASK_NONE;
> > -	/*
> > -	 * To avoid cycles in the migration "graph", ensure
> > -	 * that migration sources are not future targets by
> > -	 * setting them in 'used_targets'.  Do this only
> > -	 * once per pass so that multiple source nodes can
> > -	 * share a target node.
> > -	 *
> > -	 * 'used_targets' will become unavailable in future
> > -	 * passes.  This limits some opportunities for
> > -	 * multiple source nodes to share a destination.
> > -	 */
> > -	nodes_or(used_targets, used_targets, this_pass);
> > +	for_each_online_node(node) {
> > +		int tmp_node;
> > -	for_each_node_mask(node, this_pass) {
> >   		best_distance = -1;
> > +		used_targets = NODE_MASK_NONE;
> > +
> > +		/*
> > +		 * Avoid adding same node as the demotion target.
> > +		 */
> > +		node_set(node, used_targets);
> > +
> > +		/*
> > +		 * Add CPU NUMA nodes to the used target list so that it
> > +		 * won't be considered a demotion target.
> > +		 */
> > +		nodes_or(used_targets, used_targets, node_states[N_CPU]);
> > +
> > +		/*
> > +		 * Add all nodes that has appeared as source node of demotion
> > +		 * for this target node.
> > +		 *
> > +		 * To avoid cycles in the migration "graph", ensure
> > +		 * that migration sources are not future targets by
> > +		 * setting them in 'used_targets'.
> > +		 */
> > +		for_each_node_mask(tmp_node, src_nodes[node])
> > +			nodes_or(used_targets, used_targets, src_nodes[tmp_node]);
> > +
> > +		/*
> > +		 * Now update the demotion src nodes with other nodes in graph
> > +		 * which got computed above.
> > +		 */
> > +		nodes_or(src_nodes[node], src_nodes[node], used_targets);
> >   		/*
> >   		 * Try to set up the migration path for the node, and the target
> > @@ -2434,20 +2447,14 @@ static void __set_migration_target_nodes(void)
> >   				best_distance = node_distance(node, target_node);
> >   			/*
> > -			 * Visit targets from this pass in the next pass.
> > -			 * Eventually, every node will have been part of
> > -			 * a pass, and will become set in 'used_targets'.
> > +			 * Add this node in the src_nodes list so that we can
> > +			 * detect the looping.
> >   			 */
> > -			node_set(target_node, next_pass);
> > +			node_set(node, src_nodes[target_node]);
> >   		} while (1);
> >   	}
> > -	/*
> > -	 * 'next_pass' contains nodes which became migration
> > -	 * targets in this pass.  Make additional passes until
> > -	 * no more migrations targets are available.
> > -	 */
> > -	if (!nodes_empty(next_pass))
> > -		goto again;
> > +
> > +	kfree(src_nodes);
> >   }
> >   /*
Dave Hansen March 29, 2022, 2:31 p.m. UTC | #3
On 3/29/22 04:52, Jagdish Gediya wrote:
> The current implementation to identify the demotion
> targets limits some of the opportunities to share
> the demotion targets between multiple source nodes.

This changelog is a bit unsatisfying.  It basically says: the current
code isn't working, throw some more code at the problem.

I'd love to see some more information about *why* the current code
doesn't work.  Is it purely a bug or was it mis-designed?

I actually wrote it intending for it to handle cases like you describe
while not leaving lots of nodes without demotion targets.
Jagdish Gediya March 29, 2022, 4:46 p.m. UTC | #4
On Tue, Mar 29, 2022 at 07:31:12AM -0700, Dave Hansen wrote:
> On 3/29/22 04:52, Jagdish Gediya wrote:
> > The current implementation to identify the demotion
> > targets limits some of the opportunities to share
> > the demotion targets between multiple source nodes.
> 
> This changelog is a bit unsatisfying.  It basically says: the current
> code isn't working, throw some more code at the problem.

The current code works but it doesn't utilize all the possibilities
for the demotion. One of the comment in the current code says that
'used_targets will become unavailable in future passes. This limits
some opportunities for multiple source nodes to share a destination.'
This patch removes those limitation. The current limitation is single
used_target nodemask sharing between all the nodes, This patch prepares
the used_targets from scratch for each individual node and then tries
to avoid the looping based on what demotion targets are not possible
for that particular node based on the data we accumulate in newly
defined src_nodes array.

For example, with below numa topology where node 0,1 & 2 are cpu + dram
node and node 3 is slow memory node,
node   0   1   2   3
  0:  10  20  20  40
  1:  20  10  20  40
  2:  20  20  10  40
  3:  40  40  40  10

once node 3 is utilized for node 0, it can not be shared for node 1 & 2
because in current code, used_targets will have that set even when we
find demotion targets for  1 & 2.

> I'd love to see some more information about *why* the current code
> doesn't work.  Is it purely a bug or was it mis-designed?

I think the design seems to be intended because as per the comment
in the current code, it was known that there are limits to the node
sharing as a demotion target.

> I actually wrote it intending for it to handle cases like you describe
> while not leaving lots of nodes without demotion targets.

if you see the examples I have given in the commit message, the current
code misses many opportunities for the demotion, so current
implementation to avoid the looping is not the best as I have explained
above, that is where this patch is required.
Dave Hansen March 29, 2022, 10:40 p.m. UTC | #5
On 3/29/22 09:46, Jagdish Gediya wrote:
>> I'd love to see some more information about *why* the current code
>> doesn't work.  Is it purely a bug or was it mis-designed?
> I think the design seems to be intended because as per the comment
> in the current code, it was known that there are limits to the node
> sharing as a demotion target.

Ahh, that makes sense.

You've got some slow memory that's not really associated closely enough
with an initiator node to be used exclusively for demotions from that
node.  The original code was really intended for demotions when the
system is made up of relatively identical building blocks, like one CPU
with two kinds of memory attached.

I think what you're doing here makes sense.  The only thing I'll say is
that, at some point, we need to stop messing with the demotion order
that the kernel comes up with.  It's not going to be able to handle
*everything* perfectly.  We're probably not at the point that we should
stop enhancing the kernel-generated demotion order, but let's at least
keep in mind that we should stop eventually.
Baolin Wang March 30, 2022, 6:37 a.m. UTC | #6
On 3/29/2022 10:04 PM, Jagdish Gediya wrote:
> On Tue, Mar 29, 2022 at 08:26:05PM +0800, Baolin Wang wrote:
> Hi Baolin,
>> Hi Jagdish,
>>
>> On 3/29/2022 7:52 PM, Jagdish Gediya wrote:
>>> The current implementation to identify the demotion
>>> targets limits some of the opportunities to share
>>> the demotion targets between multiple source nodes.
>>>
>>> Implement a logic to identify the loop in the demotion
>>> targets such that all the possibilities of demotion can
>>> be utilized. Don't share the used targets between all
>>> the nodes, instead create the used targets from scratch
>>> for each individual node based on for what all node this
>>> node is a demotion target. This helps to share the demotion
>>> targets without missing any possible way of demotion.
>>>
>>> e.g. with below NUMA topology, where node 0 & 1 are
>>> cpu + dram nodes, node 2 & 3 are equally slower memory
>>> only nodes, and node 4 is slowest memory only node,
>>>
>>> available: 5 nodes (0-4)
>>> node 0 cpus: 0 1
>>> node 0 size: n MB
>>> node 0 free: n MB
>>> node 1 cpus: 2 3
>>> node 1 size: n MB
>>> node 1 free: n MB
>>> node 2 cpus:
>>> node 2 size: n MB
>>> node 2 free: n MB
>>> node 3 cpus:
>>> node 3 size: n MB
>>> node 3 free: n MB
>>> node 4 cpus:
>>> node 4 size: n MB
>>> node 4 free: n MB
>>> node distances:
>>> node   0   1   2   3   4
>>>     0:  10  20  40  40  80
>>>     1:  20  10  40  40  80
>>>     2:  40  40  10  40  80
>>>     3:  40  40  40  10  80
>>>     4:  80  80  80  80  10
>>>
>>> The existing implementation gives below demotion targets,
>>>
>>> node    demotion_target
>>>    0              3, 2
>>>    1              4
>>>    2              X
>>>    3              X
>>>    4		X
>>>
>>> With this patch applied, below are the demotion targets,
>>>
>>> node    demotion_target
>>>    0              3, 2
>>>    1              3, 2
>>>    2              3
>>>    3              4
>>>    4		X
>>
>> Node 2 and node 3 both are slow memory and have same distance, why node 2
>> should demote cold memory to node 3? They should have the same target
>> demotion node 4, which is the slowest memory node, right?
>>
> Current demotion target finding algorithm works based on best distance, as distance between node 2 & 3 is 40 and distance between node 2 & 4 is 80, node 2 demotes to node 3.

If node 2 can demote to node 3, which means node 3's memory is colder 
than node 2, right? The accessing time of node 3 should be larger than 
node 2, then we can demote colder memory to node 3 from node 2.

But node 2 and node 3 are same memory type and have same distance, the 
accessing time of node 2 and node 3 should be same too, so why add so 
many page migration between node 2 and node 3? I'm still not sure the 
benefits.

Huang Ying and Dave, how do you think about this demotion targets?

>>>
>>> e.g. with below NUMA topology, where node 0, 1 & 2 are
>>> cpu + dram nodes and node 3 is slow memory node,
>>>
>>> available: 4 nodes (0-3)
>>> node 0 cpus: 0 1
>>> node 0 size: n MB
>>> node 0 free: n MB
>>> node 1 cpus: 2 3
>>> node 1 size: n MB
>>> node 1 free: n MB
>>> node 2 cpus: 4 5
>>> node 2 size: n MB
>>> node 2 free: n MB
>>> node 3 cpus:
>>> node 3 size: n MB
>>> node 3 free: n MB
>>> node distances:
>>> node   0   1   2   3
>>>     0:  10  20  20  40
>>>     1:  20  10  20  40
>>>     2:  20  20  10  40
>>>     3:  40  40  40  10
>>>
>>> The existing implementation gives below demotion targets,
>>>
>>> node    demotion_target
>>>    0              3
>>>    1              X
>>>    2              X
>>>    3              X
>>>
>>> With this patch applied, below are the demotion targets,
>>>
>>> node    demotion_target
>>>    0              3
>>>    1              3
>>>    2              3
>>>    3              X
>>
>> Sounds reasonable.
>>
>>>
>>> with below NUMA topology, where node 0 & 2 are cpu + dram
>>> nodes and node 1 & 3 are slow memory nodes,
>>>
>>> available: 4 nodes (0-3)
>>> node 0 cpus: 0 1
>>> node 0 size: n MB
>>> node 0 free: n MB
>>> node 1 cpus:
>>> node 1 size: n MB
>>> node 1 free: n MB
>>> node 2 cpus: 2 3
>>> node 2 size: n MB
>>> node 2 free: n MB
>>> node 3 cpus:
>>> node 3 size: n MB
>>> node 3 free: n MB
>>> node distances:
>>> node   0   1   2   3
>>>     0:  10  40  20  80
>>>     1:  40  10  80  80
>>>     2:  20  80  10  40
>>>     3:  80  80  40  10
>>>
>>> The existing implementation gives below demotion targets,
>>>
>>> node    demotion_target
>>>    0              3
>>>    1              X
>>>    2              3
>>>    3              X
>>
>> If I understand correctly, this is not true. The demotion route should be as
>> below with existing implementation:
>> node 0 ---> node 1
>> node 1 ---> X
>> node 2 ---> node 3
>> node 3 ---> X
>>
> Its typo, It should be 0 -> 1, Will correct it in v2.
>>>
>>> With this patch applied, below are the demotion targets,
>>>
>>> node    demotion_target
>>>    0              1
>>>    1              3
>>>    2              3
>>>    3              X
>>>
>>> As it can be seen above, node 3 can be demotion target for node
>>> 1 but existing implementation doesn't configure it that way. It
>>> is better to move pages from node 1 to node 3 instead of moving
>>> it from node 1 to swap.
>>
>> Which means node 3 is the slowest memory node?
>>
> Node 1 and 3 are equally slower but 1 is near to 0 and 3 is near to 2. Basically you can think of it like node 1 is slow memory logical node near to node 0 and node 3 is slow memory logical node near to node 2.

OK.
Huang, Ying March 30, 2022, 6:46 a.m. UTC | #7
Hi, Jagdish,

Jagdish Gediya <jvgediya@linux.ibm.com> writes:

> The current implementation to identify the demotion
> targets limits some of the opportunities to share
> the demotion targets between multiple source nodes.

Yes.  It sounds reasonable to share demotion targets among multiple
source nodes.

One question, are example machines below are real hardware now or in
near future?  Or you just think they are possible?

And, before going into the implementation details, I think that we can
discuss the perfect demotion order firstly.

> Implement a logic to identify the loop in the demotion
> targets such that all the possibilities of demotion can
> be utilized. Don't share the used targets between all
> the nodes, instead create the used targets from scratch
> for each individual node based on for what all node this
> node is a demotion target. This helps to share the demotion
> targets without missing any possible way of demotion.
>
> e.g. with below NUMA topology, where node 0 & 1 are
> cpu + dram nodes, node 2 & 3 are equally slower memory
> only nodes, and node 4 is slowest memory only node,
>
> available: 5 nodes (0-4)
> node 0 cpus: 0 1
> node 0 size: n MB
> node 0 free: n MB
> node 1 cpus: 2 3
> node 1 size: n MB
> node 1 free: n MB
> node 2 cpus:
> node 2 size: n MB
> node 2 free: n MB
> node 3 cpus:
> node 3 size: n MB
> node 3 free: n MB
> node 4 cpus:
> node 4 size: n MB
> node 4 free: n MB
> node distances:
> node   0   1   2   3   4
>   0:  10  20  40  40  80
>   1:  20  10  40  40  80
>   2:  40  40  10  40  80
>   3:  40  40  40  10  80
>   4:  80  80  80  80  10
>
> The existing implementation gives below demotion targets,
>
> node    demotion_target
>  0              3, 2
>  1              4
>  2              X
>  3              X
>  4		X
>
> With this patch applied, below are the demotion targets,
>
> node    demotion_target
>  0              3, 2
>  1              3, 2
>  2              3
>  3              4
>  4		X

For such machine, I think the perfect demotion order is,

node    demotion_target
 0              2, 3
 1              2, 3
 2              4
 3              4
 4              X

> e.g. with below NUMA topology, where node 0, 1 & 2 are
> cpu + dram nodes and node 3 is slow memory node,
>
> available: 4 nodes (0-3)
> node 0 cpus: 0 1
> node 0 size: n MB
> node 0 free: n MB
> node 1 cpus: 2 3
> node 1 size: n MB
> node 1 free: n MB
> node 2 cpus: 4 5
> node 2 size: n MB
> node 2 free: n MB
> node 3 cpus:
> node 3 size: n MB
> node 3 free: n MB
> node distances:
> node   0   1   2   3
>   0:  10  20  20  40
>   1:  20  10  20  40
>   2:  20  20  10  40
>   3:  40  40  40  10
>
> The existing implementation gives below demotion targets,
>
> node    demotion_target
>  0              3
>  1              X
>  2              X
>  3              X
>
> With this patch applied, below are the demotion targets,
>
> node    demotion_target
>  0              3
>  1              3
>  2              3
>  3              X

I think this is perfect already.

> with below NUMA topology, where node 0 & 2 are cpu + dram
> nodes and node 1 & 3 are slow memory nodes,
>
> available: 4 nodes (0-3)
> node 0 cpus: 0 1
> node 0 size: n MB
> node 0 free: n MB
> node 1 cpus:
> node 1 size: n MB
> node 1 free: n MB
> node 2 cpus: 2 3
> node 2 size: n MB
> node 2 free: n MB
> node 3 cpus:
> node 3 size: n MB
> node 3 free: n MB
> node distances:
> node   0   1   2   3
>   0:  10  40  20  80
>   1:  40  10  80  80
>   2:  20  80  10  40
>   3:  80  80  40  10
>
> The existing implementation gives below demotion targets,
>
> node    demotion_target
>  0              3
>  1              X
>  2              3
>  3              X

Should be as below as you said in another email of the thread.

node    demotion_target
 0              1
 1              X
 2              3
 3              X

> With this patch applied, below are the demotion targets,
>
> node    demotion_target
>  0              1
>  1              3
>  2              3
>  3              X

The original demotion order looks better for me.  1 and 3 are at the
same level from the perspective of the whole system.

Another example, node 0 & 2 are cpu + dram nodes and node 1 are slow
memory node near node 0,

available: 3 nodes (0-2)
node 0 cpus: 0 1
node 0 size: n MB
node 0 free: n MB
node 1 cpus:
node 1 size: n MB
node 1 free: n MB
node 2 cpus: 2 3
node 2 size: n MB
node 2 free: n MB
node distances:
node   0   1   2
  0:  10  40  20
  1:  40  10  80
  2:  20  80  10


Demotion order 1:

node    demotion_target
 0              1
 1              X
 2              X

Demotion order 2:

node    demotion_target
 0              1
 1              X
 2              1

Demotion order 2 looks better.  But I think that demotion order 1 makes
some sense too (like node reclaim mode).

It seems that,

If a demotion target has same distance to several current demotion
sources, the demotion target should be shared among the demotion
sources.

And as Dave pointed out, we may eventually need a mechanism to override
the default demotion order generated automatically.  So we can just use
some simple mechanism that makes sense in most cases in kernel
automatically.  And leave the best demotion for users to some
customization mechanism.

> As it can be seen above, node 3 can be demotion target for node
> 1 but existing implementation doesn't configure it that way. It
> is better to move pages from node 1 to node 3 instead of moving
> it from node 1 to swap.
>
> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

Best Regards,
Huang, Ying

[snip]
Huang, Ying March 30, 2022, 6:54 a.m. UTC | #8
Baolin Wang <baolin.wang@linux.alibaba.com> writes:

> On 3/29/2022 10:04 PM, Jagdish Gediya wrote:
>> On Tue, Mar 29, 2022 at 08:26:05PM +0800, Baolin Wang wrote:
>> Hi Baolin,
>>> Hi Jagdish,
>>>
>>> On 3/29/2022 7:52 PM, Jagdish Gediya wrote:
>>>> The current implementation to identify the demotion
>>>> targets limits some of the opportunities to share
>>>> the demotion targets between multiple source nodes.
>>>>
>>>> Implement a logic to identify the loop in the demotion
>>>> targets such that all the possibilities of demotion can
>>>> be utilized. Don't share the used targets between all
>>>> the nodes, instead create the used targets from scratch
>>>> for each individual node based on for what all node this
>>>> node is a demotion target. This helps to share the demotion
>>>> targets without missing any possible way of demotion.
>>>>
>>>> e.g. with below NUMA topology, where node 0 & 1 are
>>>> cpu + dram nodes, node 2 & 3 are equally slower memory
>>>> only nodes, and node 4 is slowest memory only node,
>>>>
>>>> available: 5 nodes (0-4)
>>>> node 0 cpus: 0 1
>>>> node 0 size: n MB
>>>> node 0 free: n MB
>>>> node 1 cpus: 2 3
>>>> node 1 size: n MB
>>>> node 1 free: n MB
>>>> node 2 cpus:
>>>> node 2 size: n MB
>>>> node 2 free: n MB
>>>> node 3 cpus:
>>>> node 3 size: n MB
>>>> node 3 free: n MB
>>>> node 4 cpus:
>>>> node 4 size: n MB
>>>> node 4 free: n MB
>>>> node distances:
>>>> node   0   1   2   3   4
>>>>     0:  10  20  40  40  80
>>>>     1:  20  10  40  40  80
>>>>     2:  40  40  10  40  80
>>>>     3:  40  40  40  10  80
>>>>     4:  80  80  80  80  10
>>>>
>>>> The existing implementation gives below demotion targets,
>>>>
>>>> node    demotion_target
>>>>    0              3, 2
>>>>    1              4
>>>>    2              X
>>>>    3              X
>>>>    4		X
>>>>
>>>> With this patch applied, below are the demotion targets,
>>>>
>>>> node    demotion_target
>>>>    0              3, 2
>>>>    1              3, 2
>>>>    2              3
>>>>    3              4
>>>>    4		X
>>>
>>> Node 2 and node 3 both are slow memory and have same distance, why node 2
>>> should demote cold memory to node 3? They should have the same target
>>> demotion node 4, which is the slowest memory node, right?
>>>
>> Current demotion target finding algorithm works based on best distance, as distance between node 2 & 3 is 40 and distance between node 2 & 4 is 80, node 2 demotes to node 3.
>
> If node 2 can demote to node 3, which means node 3's memory is colder
> than node 2, right? The accessing time of node 3 should be larger than 
> node 2, then we can demote colder memory to node 3 from node 2.
>
> But node 2 and node 3 are same memory type and have same distance, the
> accessing time of node 2 and node 3 should be same too, so why add so 
> many page migration between node 2 and node 3? I'm still not sure the
> benefits.
>
> Huang Ying and Dave, how do you think about this demotion targets?

Yes.  I think the demotion target of 2 should be 4, as in my another
email in this thread.  Demoting from 2 to 3 makes no sense.

Best Regards,
Huang, Ying

[snip]
Jagdish Gediya March 30, 2022, 4:36 p.m. UTC | #9
Hi Huang,

On Wed, Mar 30, 2022 at 02:46:51PM +0800, Huang, Ying wrote:
> Hi, Jagdish,
> 
> Jagdish Gediya <jvgediya@linux.ibm.com> writes:
> 
> > The current implementation to identify the demotion
> > targets limits some of the opportunities to share
> > the demotion targets between multiple source nodes.
> 
> Yes.  It sounds reasonable to share demotion targets among multiple
> source nodes.
> 
> One question, are example machines below are real hardware now or in
> near future?  Or you just think they are possible?

They are not real hardware right now, they are the future possibilities.

> And, before going into the implementation details, I think that we can
> discuss the perfect demotion order firstly.
> 
> > Implement a logic to identify the loop in the demotion
> > targets such that all the possibilities of demotion can
> > be utilized. Don't share the used targets between all
> > the nodes, instead create the used targets from scratch
> > for each individual node based on for what all node this
> > node is a demotion target. This helps to share the demotion
> > targets without missing any possible way of demotion.
> >
> > e.g. with below NUMA topology, where node 0 & 1 are
> > cpu + dram nodes, node 2 & 3 are equally slower memory
> > only nodes, and node 4 is slowest memory only node,
> >
> > available: 5 nodes (0-4)
> > node 0 cpus: 0 1
> > node 0 size: n MB
> > node 0 free: n MB
> > node 1 cpus: 2 3
> > node 1 size: n MB
> > node 1 free: n MB
> > node 2 cpus:
> > node 2 size: n MB
> > node 2 free: n MB
> > node 3 cpus:
> > node 3 size: n MB
> > node 3 free: n MB
> > node 4 cpus:
> > node 4 size: n MB
> > node 4 free: n MB
> > node distances:
> > node   0   1   2   3   4
> >   0:  10  20  40  40  80
> >   1:  20  10  40  40  80
> >   2:  40  40  10  40  80
> >   3:  40  40  40  10  80
> >   4:  80  80  80  80  10
> >
> > The existing implementation gives below demotion targets,
> >
> > node    demotion_target
> >  0              3, 2
> >  1              4
> >  2              X
> >  3              X
> >  4		X
> >
> > With this patch applied, below are the demotion targets,
> >
> > node    demotion_target
> >  0              3, 2
> >  1              3, 2
> >  2              3
> >  3              4
> >  4		X
> 
> For such machine, I think the perfect demotion order is,
> 
> node    demotion_target
>  0              2, 3
>  1              2, 3
>  2              4
>  3              4
>  4              X

Current implementation works based on the best distance algorithm,
this patch doesn't change it either, so based on the distance, the
demotion list is what I have mentioned. I understand 4 is better
target for 2 but as per the mentioned numa distances and current
algorithm, it doesn't get configured like that in the kernel.

> > e.g. with below NUMA topology, where node 0, 1 & 2 are
> > cpu + dram nodes and node 3 is slow memory node,
> >
> > available: 4 nodes (0-3)
> > node 0 cpus: 0 1
> > node 0 size: n MB
> > node 0 free: n MB
> > node 1 cpus: 2 3
> > node 1 size: n MB
> > node 1 free: n MB
> > node 2 cpus: 4 5
> > node 2 size: n MB
> > node 2 free: n MB
> > node 3 cpus:
> > node 3 size: n MB
> > node 3 free: n MB
> > node distances:
> > node   0   1   2   3
> >   0:  10  20  20  40
> >   1:  20  10  20  40
> >   2:  20  20  10  40
> >   3:  40  40  40  10
> >
> > The existing implementation gives below demotion targets,
> >
> > node    demotion_target
> >  0              3
> >  1              X
> >  2              X
> >  3              X
> >
> > With this patch applied, below are the demotion targets,
> >
> > node    demotion_target
> >  0              3
> >  1              3
> >  2              3
> >  3              X
> 
> I think this is perfect already.
> 
> > with below NUMA topology, where node 0 & 2 are cpu + dram
> > nodes and node 1 & 3 are slow memory nodes,
> >
> > available: 4 nodes (0-3)
> > node 0 cpus: 0 1
> > node 0 size: n MB
> > node 0 free: n MB
> > node 1 cpus:
> > node 1 size: n MB
> > node 1 free: n MB
> > node 2 cpus: 2 3
> > node 2 size: n MB
> > node 2 free: n MB
> > node 3 cpus:
> > node 3 size: n MB
> > node 3 free: n MB
> > node distances:
> > node   0   1   2   3
> >   0:  10  40  20  80
> >   1:  40  10  80  80
> >   2:  20  80  10  40
> >   3:  80  80  40  10
> >
> > The existing implementation gives below demotion targets,
> >
> > node    demotion_target
> >  0              3
> >  1              X
> >  2              3
> >  3              X
> 
> Should be as below as you said in another email of the thread.
> 
> node    demotion_target
>  0              1
>  1              X
>  2              3
>  3              X
> 
> > With this patch applied, below are the demotion targets,
> >
> > node    demotion_target
> >  0              1
> >  1              3
> >  2              3
> >  3              X
> 
> The original demotion order looks better for me.  1 and 3 are at the
> same level from the perspective of the whole system.
> 
> Another example, node 0 & 2 are cpu + dram nodes and node 1 are slow
> memory node near node 0,
> 
> available: 3 nodes (0-2)
> node 0 cpus: 0 1
> node 0 size: n MB
> node 0 free: n MB
> node 1 cpus:
> node 1 size: n MB
> node 1 free: n MB
> node 2 cpus: 2 3
> node 2 size: n MB
> node 2 free: n MB
> node distances:
> node   0   1   2
>   0:  10  40  20
>   1:  40  10  80
>   2:  20  80  10
> 
> 
> Demotion order 1:
> 
> node    demotion_target
>  0              1
>  1              X
>  2              X
> 
> Demotion order 2:
> 
> node    demotion_target
>  0              1
>  1              X
>  2              1
> 
> Demotion order 2 looks better.  But I think that demotion order 1 makes
> some sense too (like node reclaim mode).
> 
> It seems that,
> 
> If a demotion target has same distance to several current demotion
> sources, the demotion target should be shared among the demotion
> sources.

Yes, and that is where this patch is useful.

> And as Dave pointed out, we may eventually need a mechanism to override
> the default demotion order generated automatically.  So we can just use
> some simple mechanism that makes sense in most cases in kernel
> automatically.  And leave the best demotion for users to some
> customization mechanism.

Yes, We need a mechanism to override the default demotion list prepared
by the current implementation. PowerVM can have a cpu less dram node
as well, which infact are not the right target for demotion because
it is the fast memory. We need to distinguish between memory tiers
so that slow memory can be utilized for the demotion even when there
are fast memory only numa nodes.

I think we may see implementations in future to override the default
behavior e.g. when systems have both fast only and slow only memory
nodes and in that case it will make sense to demote to slow memory
only node even if it is far, but this patch is to put the current
implementation to its intended design 'best distance based demotion
targets'.

> > As it can be seen above, node 3 can be demotion target for node
> > 1 but existing implementation doesn't configure it that way. It
> > is better to move pages from node 1 to node 3 instead of moving
> > it from node 1 to swap.
> >
> > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> 
> Best Regards,
> Huang, Ying
> 
> [snip]
> 
Best Regards,
Jagdish
Aneesh Kumar K.V March 30, 2022, 5:17 p.m. UTC | #10
"Huang, Ying" <ying.huang@intel.com> writes:

> Hi, Jagdish,
>
> Jagdish Gediya <jvgediya@linux.ibm.com> writes:
>

...

>> e.g. with below NUMA topology, where node 0 & 1 are
>> cpu + dram nodes, node 2 & 3 are equally slower memory
>> only nodes, and node 4 is slowest memory only node,
>>
>> available: 5 nodes (0-4)
>> node 0 cpus: 0 1
>> node 0 size: n MB
>> node 0 free: n MB
>> node 1 cpus: 2 3
>> node 1 size: n MB
>> node 1 free: n MB
>> node 2 cpus:
>> node 2 size: n MB
>> node 2 free: n MB
>> node 3 cpus:
>> node 3 size: n MB
>> node 3 free: n MB
>> node 4 cpus:
>> node 4 size: n MB
>> node 4 free: n MB
>> node distances:
>> node   0   1   2   3   4
>>   0:  10  20  40  40  80
>>   1:  20  10  40  40  80
>>   2:  40  40  10  40  80
>>   3:  40  40  40  10  80
>>   4:  80  80  80  80  10
>>
>> The existing implementation gives below demotion targets,
>>
>> node    demotion_target
>>  0              3, 2
>>  1              4
>>  2              X
>>  3              X
>>  4		X
>>
>> With this patch applied, below are the demotion targets,
>>
>> node    demotion_target
>>  0              3, 2
>>  1              3, 2
>>  2              3
>>  3              4
>>  4		X
>
> For such machine, I think the perfect demotion order is,
>
> node    demotion_target
>  0              2, 3
>  1              2, 3
>  2              4
>  3              4
>  4              X

I guess the "equally slow nodes" is a confusing definition here. Now if the
system consists of 2 1GB equally slow memory and the firmware doesn't want to
differentiate between them, firmware can present a single NUMA node
with 2GB capacity? The fact that we are finding two NUMA nodes is a hint
that there is some difference between these two memory devices. This is
also captured by the fact that the distance between 2 and 3 is 40 and not 10.

For that specific topology where the distance between 2 and 3 is 40 and 2
and 4 is 80, the demotion target derived by the new code is better
right? 

...


-aneesh
Huang, Ying March 31, 2022, 12:27 a.m. UTC | #11
Jagdish Gediya <jvgediya@linux.ibm.com> writes:

> Hi Huang,
>
> On Wed, Mar 30, 2022 at 02:46:51PM +0800, Huang, Ying wrote:
>> Hi, Jagdish,
>> 
>> Jagdish Gediya <jvgediya@linux.ibm.com> writes:
>> 
>> > The current implementation to identify the demotion
>> > targets limits some of the opportunities to share
>> > the demotion targets between multiple source nodes.
>> 
>> Yes.  It sounds reasonable to share demotion targets among multiple
>> source nodes.
>> 
>> One question, are example machines below are real hardware now or in
>> near future?  Or you just think they are possible?
>
> They are not real hardware right now, they are the future possibilities.
>
>> And, before going into the implementation details, I think that we can
>> discuss the perfect demotion order firstly.
>> 
>> > Implement a logic to identify the loop in the demotion
>> > targets such that all the possibilities of demotion can
>> > be utilized. Don't share the used targets between all
>> > the nodes, instead create the used targets from scratch
>> > for each individual node based on for what all node this
>> > node is a demotion target. This helps to share the demotion
>> > targets without missing any possible way of demotion.
>> >
>> > e.g. with below NUMA topology, where node 0 & 1 are
>> > cpu + dram nodes, node 2 & 3 are equally slower memory
>> > only nodes, and node 4 is slowest memory only node,
>> >
>> > available: 5 nodes (0-4)
>> > node 0 cpus: 0 1
>> > node 0 size: n MB
>> > node 0 free: n MB
>> > node 1 cpus: 2 3
>> > node 1 size: n MB
>> > node 1 free: n MB
>> > node 2 cpus:
>> > node 2 size: n MB
>> > node 2 free: n MB
>> > node 3 cpus:
>> > node 3 size: n MB
>> > node 3 free: n MB
>> > node 4 cpus:
>> > node 4 size: n MB
>> > node 4 free: n MB
>> > node distances:
>> > node   0   1   2   3   4
>> >   0:  10  20  40  40  80
>> >   1:  20  10  40  40  80
>> >   2:  40  40  10  40  80
>> >   3:  40  40  40  10  80
>> >   4:  80  80  80  80  10
>> >
>> > The existing implementation gives below demotion targets,
>> >
>> > node    demotion_target
>> >  0              3, 2
>> >  1              4
>> >  2              X
>> >  3              X
>> >  4		X
>> >
>> > With this patch applied, below are the demotion targets,
>> >
>> > node    demotion_target
>> >  0              3, 2
>> >  1              3, 2
>> >  2              3
>> >  3              4
>> >  4		X
>> 
>> For such machine, I think the perfect demotion order is,
>> 
>> node    demotion_target
>>  0              2, 3
>>  1              2, 3
>>  2              4
>>  3              4
>>  4              X
>
> Current implementation works based on the best distance algorithm,
> this patch doesn't change it either, so based on the distance, the
> demotion list is what I have mentioned. I understand 4 is better
> target for 2 but as per the mentioned numa distances and current
> algorithm, it doesn't get configured like that in the kernel.

If we agree on the perfect demotion order, can we revise the current
algorithm a little to make it happen?

Is the following solution possible?

1st round:
  sources: 0, 1
  exclude targets: 0, 1
  result: 0 - > 2,3; 1 -> 2,3

2nd round:
  sources: 2, 3
  exclude targets: 0, 1, 2, 3
  result: 2 -> 4; 3 -> 4

3rd round:
  sources: 4
  exclude targets: 0, 1, 2, 3, 4
  result: 4 -> X

That is, in each round, we exclude the sources and targets in the
previous rounds as demotion targets.  Compared with the original
algorithm, we don't exclude targets used in the current round,
but we will exclude them in the next round.

If this doesn't work, can you think about whether it is possible to find
some way?

>> > e.g. with below NUMA topology, where node 0, 1 & 2 are
>> > cpu + dram nodes and node 3 is slow memory node,
>> >
>> > available: 4 nodes (0-3)
>> > node 0 cpus: 0 1
>> > node 0 size: n MB
>> > node 0 free: n MB
>> > node 1 cpus: 2 3
>> > node 1 size: n MB
>> > node 1 free: n MB
>> > node 2 cpus: 4 5
>> > node 2 size: n MB
>> > node 2 free: n MB
>> > node 3 cpus:
>> > node 3 size: n MB
>> > node 3 free: n MB
>> > node distances:
>> > node   0   1   2   3
>> >   0:  10  20  20  40
>> >   1:  20  10  20  40
>> >   2:  20  20  10  40
>> >   3:  40  40  40  10
>> >
>> > The existing implementation gives below demotion targets,
>> >
>> > node    demotion_target
>> >  0              3
>> >  1              X
>> >  2              X
>> >  3              X
>> >
>> > With this patch applied, below are the demotion targets,
>> >
>> > node    demotion_target
>> >  0              3
>> >  1              3
>> >  2              3
>> >  3              X
>> 
>> I think this is perfect already.
>> 
>> > with below NUMA topology, where node 0 & 2 are cpu + dram
>> > nodes and node 1 & 3 are slow memory nodes,
>> >
>> > available: 4 nodes (0-3)
>> > node 0 cpus: 0 1
>> > node 0 size: n MB
>> > node 0 free: n MB
>> > node 1 cpus:
>> > node 1 size: n MB
>> > node 1 free: n MB
>> > node 2 cpus: 2 3
>> > node 2 size: n MB
>> > node 2 free: n MB
>> > node 3 cpus:
>> > node 3 size: n MB
>> > node 3 free: n MB
>> > node distances:
>> > node   0   1   2   3
>> >   0:  10  40  20  80
>> >   1:  40  10  80  80
>> >   2:  20  80  10  40
>> >   3:  80  80  40  10
>> >
>> > The existing implementation gives below demotion targets,
>> >
>> > node    demotion_target
>> >  0              3
>> >  1              X
>> >  2              3
>> >  3              X
>> 
>> Should be as below as you said in another email of the thread.
>> 
>> node    demotion_target
>>  0              1
>>  1              X
>>  2              3
>>  3              X
>> 
>> > With this patch applied, below are the demotion targets,
>> >
>> > node    demotion_target
>> >  0              1
>> >  1              3
>> >  2              3
>> >  3              X
>> 
>> The original demotion order looks better for me.  1 and 3 are at the
>> same level from the perspective of the whole system.
>> 
>> Another example, node 0 & 2 are cpu + dram nodes and node 1 are slow
>> memory node near node 0,
>> 
>> available: 3 nodes (0-2)
>> node 0 cpus: 0 1
>> node 0 size: n MB
>> node 0 free: n MB
>> node 1 cpus:
>> node 1 size: n MB
>> node 1 free: n MB
>> node 2 cpus: 2 3
>> node 2 size: n MB
>> node 2 free: n MB
>> node distances:
>> node   0   1   2
>>   0:  10  40  20
>>   1:  40  10  80
>>   2:  20  80  10
>> 
>> 
>> Demotion order 1:
>> 
>> node    demotion_target
>>  0              1
>>  1              X
>>  2              X
>> 
>> Demotion order 2:
>> 
>> node    demotion_target
>>  0              1
>>  1              X
>>  2              1
>> 
>> Demotion order 2 looks better.  But I think that demotion order 1 makes
>> some sense too (like node reclaim mode).
>> 
>> It seems that,
>> 
>> If a demotion target has same distance to several current demotion
>> sources, the demotion target should be shared among the demotion
>> sources.
>
> Yes, and that is where this patch is useful.
>
>> And as Dave pointed out, we may eventually need a mechanism to override
>> the default demotion order generated automatically.  So we can just use
>> some simple mechanism that makes sense in most cases in kernel
>> automatically.  And leave the best demotion for users to some
>> customization mechanism.
>
> Yes, We need a mechanism to override the default demotion list prepared
> by the current implementation. PowerVM can have a cpu less dram node
> as well, which infact are not the right target for demotion because
> it is the fast memory. We need to distinguish between memory tiers
> so that slow memory can be utilized for the demotion even when there
> are fast memory only numa nodes.
>
> I think we may see implementations in future to override the default
> behavior e.g. when systems have both fast only and slow only memory
> nodes and in that case it will make sense to demote to slow memory
> only node even if it is far, but this patch is to put the current
> implementation to its intended design 'best distance based demotion
> targets'.

If this is a real problem for you already.  How about implement it after
this work has been done?

Best Regards,
Huang, Ying

>> > As it can be seen above, node 3 can be demotion target for node
>> > 1 but existing implementation doesn't configure it that way. It
>> > is better to move pages from node 1 to node 3 instead of moving
>> > it from node 1 to swap.
>> >
>> > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
>> > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> 
>> Best Regards,
>> Huang, Ying
>> 
>> [snip]
>> 
> Best Regards,
> Jagdish
Huang, Ying March 31, 2022, 12:32 a.m. UTC | #12
"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> "Huang, Ying" <ying.huang@intel.com> writes:
>
>> Hi, Jagdish,
>>
>> Jagdish Gediya <jvgediya@linux.ibm.com> writes:
>>
>
> ...
>
>>> e.g. with below NUMA topology, where node 0 & 1 are
>>> cpu + dram nodes, node 2 & 3 are equally slower memory
>>> only nodes, and node 4 is slowest memory only node,
>>>
>>> available: 5 nodes (0-4)
>>> node 0 cpus: 0 1
>>> node 0 size: n MB
>>> node 0 free: n MB
>>> node 1 cpus: 2 3
>>> node 1 size: n MB
>>> node 1 free: n MB
>>> node 2 cpus:
>>> node 2 size: n MB
>>> node 2 free: n MB
>>> node 3 cpus:
>>> node 3 size: n MB
>>> node 3 free: n MB
>>> node 4 cpus:
>>> node 4 size: n MB
>>> node 4 free: n MB
>>> node distances:
>>> node   0   1   2   3   4
>>>   0:  10  20  40  40  80
>>>   1:  20  10  40  40  80
>>>   2:  40  40  10  40  80
>>>   3:  40  40  40  10  80
>>>   4:  80  80  80  80  10
>>>
>>> The existing implementation gives below demotion targets,
>>>
>>> node    demotion_target
>>>  0              3, 2
>>>  1              4
>>>  2              X
>>>  3              X
>>>  4		X
>>>
>>> With this patch applied, below are the demotion targets,
>>>
>>> node    demotion_target
>>>  0              3, 2
>>>  1              3, 2
>>>  2              3
>>>  3              4
>>>  4		X
>>
>> For such machine, I think the perfect demotion order is,
>>
>> node    demotion_target
>>  0              2, 3
>>  1              2, 3
>>  2              4
>>  3              4
>>  4              X
>
> I guess the "equally slow nodes" is a confusing definition here. Now if the
> system consists of 2 1GB equally slow memory and the firmware doesn't want to
> differentiate between them, firmware can present a single NUMA node
> with 2GB capacity? The fact that we are finding two NUMA nodes is a hint
> that there is some difference between these two memory devices. This is
> also captured by the fact that the distance between 2 and 3 is 40 and not 10.

Do you have more information about this?

Best Regards,
Huang, Ying

> For that specific topology where the distance between 2 and 3 is 40 and 2
> and 4 is 80, the demotion target derived by the new code is better
> right? 
>
> ...
>
>
> -aneesh
Aneesh Kumar K.V March 31, 2022, 6:45 a.m. UTC | #13
"Huang, Ying" <ying.huang@intel.com> writes:

> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>
>> "Huang, Ying" <ying.huang@intel.com> writes:
>>
>>> Hi, Jagdish,
>>>
>>> Jagdish Gediya <jvgediya@linux.ibm.com> writes:
>>>
>>
>> ...
>>
>>>> e.g. with below NUMA topology, where node 0 & 1 are
>>>> cpu + dram nodes, node 2 & 3 are equally slower memory
>>>> only nodes, and node 4 is slowest memory only node,
>>>>
>>>> available: 5 nodes (0-4)
>>>> node 0 cpus: 0 1
>>>> node 0 size: n MB
>>>> node 0 free: n MB
>>>> node 1 cpus: 2 3
>>>> node 1 size: n MB
>>>> node 1 free: n MB
>>>> node 2 cpus:
>>>> node 2 size: n MB
>>>> node 2 free: n MB
>>>> node 3 cpus:
>>>> node 3 size: n MB
>>>> node 3 free: n MB
>>>> node 4 cpus:
>>>> node 4 size: n MB
>>>> node 4 free: n MB
>>>> node distances:
>>>> node   0   1   2   3   4
>>>>   0:  10  20  40  40  80
>>>>   1:  20  10  40  40  80
>>>>   2:  40  40  10  40  80
>>>>   3:  40  40  40  10  80
>>>>   4:  80  80  80  80  10
>>>>
>>>> The existing implementation gives below demotion targets,
>>>>
>>>> node    demotion_target
>>>>  0              3, 2
>>>>  1              4
>>>>  2              X
>>>>  3              X
>>>>  4		X
>>>>
>>>> With this patch applied, below are the demotion targets,
>>>>
>>>> node    demotion_target
>>>>  0              3, 2
>>>>  1              3, 2
>>>>  2              3
>>>>  3              4
>>>>  4		X
>>>
>>> For such machine, I think the perfect demotion order is,
>>>
>>> node    demotion_target
>>>  0              2, 3
>>>  1              2, 3
>>>  2              4
>>>  3              4
>>>  4              X
>>
>> I guess the "equally slow nodes" is a confusing definition here. Now if the
>> system consists of 2 1GB equally slow memory and the firmware doesn't want to
>> differentiate between them, firmware can present a single NUMA node
>> with 2GB capacity? The fact that we are finding two NUMA nodes is a hint
>> that there is some difference between these two memory devices. This is
>> also captured by the fact that the distance between 2 and 3 is 40 and not 10.
>
> Do you have more information about this?

Not sure I follow the question there. I was checking shouldn't firmware
do a single NUMA node if two memory devices are of the same type? How will
optane present such a config? Both the DIMMs will have the same
proximity domain value and hence dax kmem will add them to the same NUMA
node?

If you are suggesting that firmware doesn't do that, then I agree with you
that a demotion target like the below is good. 

 node    demotion_target
  0              2, 3
  1              2, 3
  2              4
  3              4
  4              X

We can also achieve that with a smiple change as below.

@@ -3120,7 +3120,7 @@ static void __set_migration_target_nodes(void)
 {
 	nodemask_t next_pass	= NODE_MASK_NONE;
 	nodemask_t this_pass	= NODE_MASK_NONE;
-	nodemask_t used_targets = NODE_MASK_NONE;
+	nodemask_t this_pass_used_targets = NODE_MASK_NONE;
 	int node, best_distance;
 
 	/*
@@ -3141,17 +3141,20 @@ static void __set_migration_target_nodes(void)
 	/*
 	 * To avoid cycles in the migration "graph", ensure
 	 * that migration sources are not future targets by
-	 * setting them in 'used_targets'.  Do this only
+	 * setting them in 'this_pass_used_targets'.  Do this only
 	 * once per pass so that multiple source nodes can
 	 * share a target node.
 	 *
-	 * 'used_targets' will become unavailable in future
+	 * 'this_pass_used_targets' will become unavailable in future
 	 * passes.  This limits some opportunities for
 	 * multiple source nodes to share a destination.
 	 */
-	nodes_or(used_targets, used_targets, this_pass);
+	nodes_or(this_pass_used_targets, this_pass_used_targets, this_pass);
 
 	for_each_node_mask(node, this_pass) {
+
+		nodemask_t used_targets = this_pass_used_targets;
+
 		best_distance = -1;
 
 		/*
Huang, Ying March 31, 2022, 7:23 a.m. UTC | #14
"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> "Huang, Ying" <ying.huang@intel.com> writes:
>
>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>
>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>
>>>> Hi, Jagdish,
>>>>
>>>> Jagdish Gediya <jvgediya@linux.ibm.com> writes:
>>>>
>>>
>>> ...
>>>
>>>>> e.g. with below NUMA topology, where node 0 & 1 are
>>>>> cpu + dram nodes, node 2 & 3 are equally slower memory
>>>>> only nodes, and node 4 is slowest memory only node,
>>>>>
>>>>> available: 5 nodes (0-4)
>>>>> node 0 cpus: 0 1
>>>>> node 0 size: n MB
>>>>> node 0 free: n MB
>>>>> node 1 cpus: 2 3
>>>>> node 1 size: n MB
>>>>> node 1 free: n MB
>>>>> node 2 cpus:
>>>>> node 2 size: n MB
>>>>> node 2 free: n MB
>>>>> node 3 cpus:
>>>>> node 3 size: n MB
>>>>> node 3 free: n MB
>>>>> node 4 cpus:
>>>>> node 4 size: n MB
>>>>> node 4 free: n MB
>>>>> node distances:
>>>>> node   0   1   2   3   4
>>>>>   0:  10  20  40  40  80
>>>>>   1:  20  10  40  40  80
>>>>>   2:  40  40  10  40  80
>>>>>   3:  40  40  40  10  80
>>>>>   4:  80  80  80  80  10
>>>>>
>>>>> The existing implementation gives below demotion targets,
>>>>>
>>>>> node    demotion_target
>>>>>  0              3, 2
>>>>>  1              4
>>>>>  2              X
>>>>>  3              X
>>>>>  4		X
>>>>>
>>>>> With this patch applied, below are the demotion targets,
>>>>>
>>>>> node    demotion_target
>>>>>  0              3, 2
>>>>>  1              3, 2
>>>>>  2              3
>>>>>  3              4
>>>>>  4		X
>>>>
>>>> For such machine, I think the perfect demotion order is,
>>>>
>>>> node    demotion_target
>>>>  0              2, 3
>>>>  1              2, 3
>>>>  2              4
>>>>  3              4
>>>>  4              X
>>>
>>> I guess the "equally slow nodes" is a confusing definition here. Now if the
>>> system consists of 2 1GB equally slow memory and the firmware doesn't want to
>>> differentiate between them, firmware can present a single NUMA node
>>> with 2GB capacity? The fact that we are finding two NUMA nodes is a hint
>>> that there is some difference between these two memory devices. This is
>>> also captured by the fact that the distance between 2 and 3 is 40 and not 10.
>>
>> Do you have more information about this?
>
> Not sure I follow the question there. I was checking shouldn't firmware
> do a single NUMA node if two memory devices are of the same type? How will
> optane present such a config? Both the DIMMs will have the same
> proximity domain value and hence dax kmem will add them to the same NUMA
> node?

Sorry for confusing.  I just wanted to check whether you have more
information about the machine configuration above.  The machines in my
hand have no complex NUMA topology as in the patch description.

> If you are suggesting that firmware doesn't do that, then I agree with you
> that a demotion target like the below is good. 
>
>  node    demotion_target
>   0              2, 3
>   1              2, 3
>   2              4
>   3              4
>   4              X
>
> We can also achieve that with a smiple change as below.

Glad to see the demotion order can be implemented in a simple way.

My concern is that is it necessary to do this?  If there are real
machines with the NUMA topology, then I think it's good to add the
support.  But if not, why do we make the code complex unnecessarily?

I don't have these kind of machines, do you have and will have?

> @@ -3120,7 +3120,7 @@ static void __set_migration_target_nodes(void)
>  {
>  	nodemask_t next_pass	= NODE_MASK_NONE;
>  	nodemask_t this_pass	= NODE_MASK_NONE;
> -	nodemask_t used_targets = NODE_MASK_NONE;
> +	nodemask_t this_pass_used_targets = NODE_MASK_NONE;
>  	int node, best_distance;
>  
>  	/*
> @@ -3141,17 +3141,20 @@ static void __set_migration_target_nodes(void)
>  	/*
>  	 * To avoid cycles in the migration "graph", ensure
>  	 * that migration sources are not future targets by
> -	 * setting them in 'used_targets'.  Do this only
> +	 * setting them in 'this_pass_used_targets'.  Do this only
>  	 * once per pass so that multiple source nodes can
>  	 * share a target node.
>  	 *
> -	 * 'used_targets' will become unavailable in future
> +	 * 'this_pass_used_targets' will become unavailable in future
>  	 * passes.  This limits some opportunities for
>  	 * multiple source nodes to share a destination.
>  	 */
> -	nodes_or(used_targets, used_targets, this_pass);
> +	nodes_or(this_pass_used_targets, this_pass_used_targets, this_pass);
>  
>  	for_each_node_mask(node, this_pass) {
> +
> +		nodemask_t used_targets = this_pass_used_targets;
> +
>  		best_distance = -1;
>  
>  		/*

Best Regards,
Huang, Ying
Aneesh Kumar K.V March 31, 2022, 8:27 a.m. UTC | #15
"Huang, Ying" <ying.huang@intel.com> writes:

> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>
>> "Huang, Ying" <ying.huang@intel.com> writes:
>>
>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>>
>>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>>
>>>>> Hi, Jagdish,
>>>>>
>>>>> Jagdish Gediya <jvgediya@linux.ibm.com> writes:
>>>>>
>>>>
>>>> ...
>>>>
>>>>>> e.g. with below NUMA topology, where node 0 & 1 are
>>>>>> cpu + dram nodes, node 2 & 3 are equally slower memory
>>>>>> only nodes, and node 4 is slowest memory only node,
>>>>>>
>>>>>> available: 5 nodes (0-4)
>>>>>> node 0 cpus: 0 1
>>>>>> node 0 size: n MB
>>>>>> node 0 free: n MB
>>>>>> node 1 cpus: 2 3
>>>>>> node 1 size: n MB
>>>>>> node 1 free: n MB
>>>>>> node 2 cpus:
>>>>>> node 2 size: n MB
>>>>>> node 2 free: n MB
>>>>>> node 3 cpus:
>>>>>> node 3 size: n MB
>>>>>> node 3 free: n MB
>>>>>> node 4 cpus:
>>>>>> node 4 size: n MB
>>>>>> node 4 free: n MB
>>>>>> node distances:
>>>>>> node   0   1   2   3   4
>>>>>>   0:  10  20  40  40  80
>>>>>>   1:  20  10  40  40  80
>>>>>>   2:  40  40  10  40  80
>>>>>>   3:  40  40  40  10  80
>>>>>>   4:  80  80  80  80  10
>>>>>>
>>>>>> The existing implementation gives below demotion targets,
>>>>>>
>>>>>> node    demotion_target
>>>>>>  0              3, 2
>>>>>>  1              4
>>>>>>  2              X
>>>>>>  3              X
>>>>>>  4		X
>>>>>>
>>>>>> With this patch applied, below are the demotion targets,
>>>>>>
>>>>>> node    demotion_target
>>>>>>  0              3, 2
>>>>>>  1              3, 2
>>>>>>  2              3
>>>>>>  3              4
>>>>>>  4		X
>>>>>
>>>>> For such machine, I think the perfect demotion order is,
>>>>>
>>>>> node    demotion_target
>>>>>  0              2, 3
>>>>>  1              2, 3
>>>>>  2              4
>>>>>  3              4
>>>>>  4              X
>>>>
>>>> I guess the "equally slow nodes" is a confusing definition here. Now if the
>>>> system consists of 2 1GB equally slow memory and the firmware doesn't want to
>>>> differentiate between them, firmware can present a single NUMA node
>>>> with 2GB capacity? The fact that we are finding two NUMA nodes is a hint
>>>> that there is some difference between these two memory devices. This is
>>>> also captured by the fact that the distance between 2 and 3 is 40 and not 10.
>>>
>>> Do you have more information about this?
>>
>> Not sure I follow the question there. I was checking shouldn't firmware
>> do a single NUMA node if two memory devices are of the same type? How will
>> optane present such a config? Both the DIMMs will have the same
>> proximity domain value and hence dax kmem will add them to the same NUMA
>> node?
>
> Sorry for confusing.  I just wanted to check whether you have more
> information about the machine configuration above.  The machines in my
> hand have no complex NUMA topology as in the patch description.


Even with simple topologies like below

available: 3 nodes (0-2)
node 0 cpus: 0 1
node 0 size: 4046 MB
node 0 free: 3478 MB
node 1 cpus: 2 3
node 1 size: 4090 MB
node 1 free: 3430 MB
node 2 cpus:
node 2 size: 4074 MB
node 2 free: 4037 MB
node distances:
node   0   1   2 
  0:  10  20  40 
  1:  20  10  40 
  2:  40  40  10 

With current code we get demotion targets assigned as below

[    0.337307] Demotion nodes for Node 0: 2
[    0.337351] Demotion nodes for Node 1: 
[    0.337380] Demotion nodes for Node 2: 

I guess we should fix that to be below?

[    0.344554] Demotion nodes for Node 0: 2
[    0.344605] Demotion nodes for Node 1: 2
[    0.344638] Demotion nodes for Node 2: 


Most of the tests we are doing are using Qemu to simulate this. We
started looking at this to avoid using demotion completely when slow
memory is not present. ie, we should have a different way to identify
demotion targets other than node_states[N_MEMORY]. Virtualized platforms
can have configs with memory only NUMA nodes with DRAM and we don't
want to consider those as demotion targets. 

While we are at it can you let us know how topology will look on a
system with two optane DIMMs? Do both appear with the same
target_node? 

>
>> If you are suggesting that firmware doesn't do that, then I agree with you
>> that a demotion target like the below is good. 
>>
>>  node    demotion_target
>>   0              2, 3
>>   1              2, 3
>>   2              4
>>   3              4
>>   4              X
>>
>> We can also achieve that with a smiple change as below.
>
> Glad to see the demotion order can be implemented in a simple way.
>
> My concern is that is it necessary to do this?  If there are real
> machines with the NUMA topology, then I think it's good to add the
> support.  But if not, why do we make the code complex unnecessarily?
>
> I don't have these kind of machines, do you have and will have?
>


Based on the above, we still need to get the simpler fix merged right?


>> @@ -3120,7 +3120,7 @@ static void __set_migration_target_nodes(void)
>>  {
>>  	nodemask_t next_pass	= NODE_MASK_NONE;
>>  	nodemask_t this_pass	= NODE_MASK_NONE;
>> -	nodemask_t used_targets = NODE_MASK_NONE;
>> +	nodemask_t this_pass_used_targets = NODE_MASK_NONE;
>>  	int node, best_distance;
>>  
>>  	/*
>> @@ -3141,17 +3141,20 @@ static void __set_migration_target_nodes(void)
>>  	/*
>>  	 * To avoid cycles in the migration "graph", ensure
>>  	 * that migration sources are not future targets by
>> -	 * setting them in 'used_targets'.  Do this only
>> +	 * setting them in 'this_pass_used_targets'.  Do this only
>>  	 * once per pass so that multiple source nodes can
>>  	 * share a target node.
>>  	 *
>> -	 * 'used_targets' will become unavailable in future
>> +	 * 'this_pass_used_targets' will become unavailable in future
>>  	 * passes.  This limits some opportunities for
>>  	 * multiple source nodes to share a destination.
>>  	 */
>> -	nodes_or(used_targets, used_targets, this_pass);
>> +	nodes_or(this_pass_used_targets, this_pass_used_targets, this_pass);
>>  
>>  	for_each_node_mask(node, this_pass) {
>> +
>> +		nodemask_t used_targets = this_pass_used_targets;
>> +
>>  		best_distance = -1;
>>  
>>  		/*
>
> Best Regards,
> Huang, Ying
Huang, Ying March 31, 2022, 8:58 a.m. UTC | #16
"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> "Huang, Ying" <ying.huang@intel.com> writes:
>
>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>
>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>
>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>>>
>>>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>>>
>>>>>> Hi, Jagdish,
>>>>>>
>>>>>> Jagdish Gediya <jvgediya@linux.ibm.com> writes:
>>>>>>
>>>>>
>>>>> ...
>>>>>
>>>>>>> e.g. with below NUMA topology, where node 0 & 1 are
>>>>>>> cpu + dram nodes, node 2 & 3 are equally slower memory
>>>>>>> only nodes, and node 4 is slowest memory only node,
>>>>>>>
>>>>>>> available: 5 nodes (0-4)
>>>>>>> node 0 cpus: 0 1
>>>>>>> node 0 size: n MB
>>>>>>> node 0 free: n MB
>>>>>>> node 1 cpus: 2 3
>>>>>>> node 1 size: n MB
>>>>>>> node 1 free: n MB
>>>>>>> node 2 cpus:
>>>>>>> node 2 size: n MB
>>>>>>> node 2 free: n MB
>>>>>>> node 3 cpus:
>>>>>>> node 3 size: n MB
>>>>>>> node 3 free: n MB
>>>>>>> node 4 cpus:
>>>>>>> node 4 size: n MB
>>>>>>> node 4 free: n MB
>>>>>>> node distances:
>>>>>>> node   0   1   2   3   4
>>>>>>>   0:  10  20  40  40  80
>>>>>>>   1:  20  10  40  40  80
>>>>>>>   2:  40  40  10  40  80
>>>>>>>   3:  40  40  40  10  80
>>>>>>>   4:  80  80  80  80  10
>>>>>>>
>>>>>>> The existing implementation gives below demotion targets,
>>>>>>>
>>>>>>> node    demotion_target
>>>>>>>  0              3, 2
>>>>>>>  1              4
>>>>>>>  2              X
>>>>>>>  3              X
>>>>>>>  4		X
>>>>>>>
>>>>>>> With this patch applied, below are the demotion targets,
>>>>>>>
>>>>>>> node    demotion_target
>>>>>>>  0              3, 2
>>>>>>>  1              3, 2
>>>>>>>  2              3
>>>>>>>  3              4
>>>>>>>  4		X
>>>>>>
>>>>>> For such machine, I think the perfect demotion order is,
>>>>>>
>>>>>> node    demotion_target
>>>>>>  0              2, 3
>>>>>>  1              2, 3
>>>>>>  2              4
>>>>>>  3              4
>>>>>>  4              X
>>>>>
>>>>> I guess the "equally slow nodes" is a confusing definition here. Now if the
>>>>> system consists of 2 1GB equally slow memory and the firmware doesn't want to
>>>>> differentiate between them, firmware can present a single NUMA node
>>>>> with 2GB capacity? The fact that we are finding two NUMA nodes is a hint
>>>>> that there is some difference between these two memory devices. This is
>>>>> also captured by the fact that the distance between 2 and 3 is 40 and not 10.
>>>>
>>>> Do you have more information about this?
>>>
>>> Not sure I follow the question there. I was checking shouldn't firmware
>>> do a single NUMA node if two memory devices are of the same type? How will
>>> optane present such a config? Both the DIMMs will have the same
>>> proximity domain value and hence dax kmem will add them to the same NUMA
>>> node?
>>
>> Sorry for confusing.  I just wanted to check whether you have more
>> information about the machine configuration above.  The machines in my
>> hand have no complex NUMA topology as in the patch description.
>
>
> Even with simple topologies like below
>
> available: 3 nodes (0-2)
> node 0 cpus: 0 1
> node 0 size: 4046 MB
> node 0 free: 3478 MB
> node 1 cpus: 2 3
> node 1 size: 4090 MB
> node 1 free: 3430 MB
> node 2 cpus:
> node 2 size: 4074 MB
> node 2 free: 4037 MB
> node distances:
> node   0   1   2 
>   0:  10  20  40 
>   1:  20  10  40 
>   2:  40  40  10 
>
> With current code we get demotion targets assigned as below
>
> [    0.337307] Demotion nodes for Node 0: 2
> [    0.337351] Demotion nodes for Node 1: 
> [    0.337380] Demotion nodes for Node 2: 
>
> I guess we should fix that to be below?
>
> [    0.344554] Demotion nodes for Node 0: 2
> [    0.344605] Demotion nodes for Node 1: 2
> [    0.344638] Demotion nodes for Node 2: 

If the cross-socket link has enough bandwidth to accommodate the PMEM
throughput, the new one is better.  If it hasn't, the old one may be
better.  So, I think we need some kind of user space overridden support
here.  Right?

> Most of the tests we are doing are using Qemu to simulate this. We
> started looking at this to avoid using demotion completely when slow
> memory is not present. ie, we should have a different way to identify
> demotion targets other than node_states[N_MEMORY]. Virtualized platforms
> can have configs with memory only NUMA nodes with DRAM and we don't
> want to consider those as demotion targets. 

Even if the demotion targets are set for some node, the demotion will
not work before enabling demotion via sysfs
(/sys/kernel/mm/numa/demotion_enabled).  So for system without slow
memory, just don't enable demotion.

> While we are at it can you let us know how topology will look on a
> system with two optane DIMMs? Do both appear with the same
> target_node? 

In my test system, multiple optane DIMMs in one socket will be
represented as one NUMA node.

I remember Baolin has different configuration.

Hi, Baolin,  Can you provide some information about this?

>>
>>> If you are suggesting that firmware doesn't do that, then I agree with you
>>> that a demotion target like the below is good. 
>>>
>>>  node    demotion_target
>>>   0              2, 3
>>>   1              2, 3
>>>   2              4
>>>   3              4
>>>   4              X
>>>
>>> We can also achieve that with a smiple change as below.
>>
>> Glad to see the demotion order can be implemented in a simple way.
>>
>> My concern is that is it necessary to do this?  If there are real
>> machines with the NUMA topology, then I think it's good to add the
>> support.  But if not, why do we make the code complex unnecessarily?
>>
>> I don't have these kind of machines, do you have and will have?
>>
>
>
> Based on the above, we still need to get the simpler fix merged right?

Or user overridden support?

Best Regards,
Huang, Ying

[snip]
Baolin Wang March 31, 2022, 9:33 a.m. UTC | #17
On 3/31/2022 4:58 PM, Huang, Ying wrote:
> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> 
>> "Huang, Ying" <ying.huang@intel.com> writes:
>>
>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>>
>>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>>
>>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>>>>
>>>>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>>>>
>>>>>>> Hi, Jagdish,
>>>>>>>
>>>>>>> Jagdish Gediya <jvgediya@linux.ibm.com> writes:
>>>>>>>
>>>>>>
>>>>>> ...
>>>>>>
>>>>>>>> e.g. with below NUMA topology, where node 0 & 1 are
>>>>>>>> cpu + dram nodes, node 2 & 3 are equally slower memory
>>>>>>>> only nodes, and node 4 is slowest memory only node,
>>>>>>>>
>>>>>>>> available: 5 nodes (0-4)
>>>>>>>> node 0 cpus: 0 1
>>>>>>>> node 0 size: n MB
>>>>>>>> node 0 free: n MB
>>>>>>>> node 1 cpus: 2 3
>>>>>>>> node 1 size: n MB
>>>>>>>> node 1 free: n MB
>>>>>>>> node 2 cpus:
>>>>>>>> node 2 size: n MB
>>>>>>>> node 2 free: n MB
>>>>>>>> node 3 cpus:
>>>>>>>> node 3 size: n MB
>>>>>>>> node 3 free: n MB
>>>>>>>> node 4 cpus:
>>>>>>>> node 4 size: n MB
>>>>>>>> node 4 free: n MB
>>>>>>>> node distances:
>>>>>>>> node   0   1   2   3   4
>>>>>>>>    0:  10  20  40  40  80
>>>>>>>>    1:  20  10  40  40  80
>>>>>>>>    2:  40  40  10  40  80
>>>>>>>>    3:  40  40  40  10  80
>>>>>>>>    4:  80  80  80  80  10
>>>>>>>>
>>>>>>>> The existing implementation gives below demotion targets,
>>>>>>>>
>>>>>>>> node    demotion_target
>>>>>>>>   0              3, 2
>>>>>>>>   1              4
>>>>>>>>   2              X
>>>>>>>>   3              X
>>>>>>>>   4		X
>>>>>>>>
>>>>>>>> With this patch applied, below are the demotion targets,
>>>>>>>>
>>>>>>>> node    demotion_target
>>>>>>>>   0              3, 2
>>>>>>>>   1              3, 2
>>>>>>>>   2              3
>>>>>>>>   3              4
>>>>>>>>   4		X
>>>>>>>
>>>>>>> For such machine, I think the perfect demotion order is,
>>>>>>>
>>>>>>> node    demotion_target
>>>>>>>   0              2, 3
>>>>>>>   1              2, 3
>>>>>>>   2              4
>>>>>>>   3              4
>>>>>>>   4              X
>>>>>>
>>>>>> I guess the "equally slow nodes" is a confusing definition here. Now if the
>>>>>> system consists of 2 1GB equally slow memory and the firmware doesn't want to
>>>>>> differentiate between them, firmware can present a single NUMA node
>>>>>> with 2GB capacity? The fact that we are finding two NUMA nodes is a hint
>>>>>> that there is some difference between these two memory devices. This is
>>>>>> also captured by the fact that the distance between 2 and 3 is 40 and not 10.
>>>>>
>>>>> Do you have more information about this?
>>>>
>>>> Not sure I follow the question there. I was checking shouldn't firmware
>>>> do a single NUMA node if two memory devices are of the same type? How will
>>>> optane present such a config? Both the DIMMs will have the same
>>>> proximity domain value and hence dax kmem will add them to the same NUMA
>>>> node?
>>>
>>> Sorry for confusing.  I just wanted to check whether you have more
>>> information about the machine configuration above.  The machines in my
>>> hand have no complex NUMA topology as in the patch description.
>>
>>
>> Even with simple topologies like below
>>
>> available: 3 nodes (0-2)
>> node 0 cpus: 0 1
>> node 0 size: 4046 MB
>> node 0 free: 3478 MB
>> node 1 cpus: 2 3
>> node 1 size: 4090 MB
>> node 1 free: 3430 MB
>> node 2 cpus:
>> node 2 size: 4074 MB
>> node 2 free: 4037 MB
>> node distances:
>> node   0   1   2
>>    0:  10  20  40
>>    1:  20  10  40
>>    2:  40  40  10
>>
>> With current code we get demotion targets assigned as below
>>
>> [    0.337307] Demotion nodes for Node 0: 2
>> [    0.337351] Demotion nodes for Node 1:
>> [    0.337380] Demotion nodes for Node 2:
>>
>> I guess we should fix that to be below?
>>
>> [    0.344554] Demotion nodes for Node 0: 2
>> [    0.344605] Demotion nodes for Node 1: 2
>> [    0.344638] Demotion nodes for Node 2:
> 
> If the cross-socket link has enough bandwidth to accommodate the PMEM
> throughput, the new one is better.  If it hasn't, the old one may be
> better.  So, I think we need some kind of user space overridden support
> here.  Right?
> 
>> Most of the tests we are doing are using Qemu to simulate this. We
>> started looking at this to avoid using demotion completely when slow
>> memory is not present. ie, we should have a different way to identify
>> demotion targets other than node_states[N_MEMORY]. Virtualized platforms
>> can have configs with memory only NUMA nodes with DRAM and we don't
>> want to consider those as demotion targets.
> 
> Even if the demotion targets are set for some node, the demotion will
> not work before enabling demotion via sysfs
> (/sys/kernel/mm/numa/demotion_enabled).  So for system without slow
> memory, just don't enable demotion.
> 
>> While we are at it can you let us know how topology will look on a
>> system with two optane DIMMs? Do both appear with the same
>> target_node?
> 
> In my test system, multiple optane DIMMs in one socket will be
> represented as one NUMA node.
> 
> I remember Baolin has different configuration.
> 
> Hi, Baolin,  Can you provide some information about this?

Sure. We have real machines with 2 optane DIMMs, and they are 
represented as 2 numa nodes. So we want to support the target demotion 
nodes can be multiple.

available: 3 nodes (0-2)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 62153 MB
node 0 free: 447 MB
node 1 cpus:
node 1 size: 126969 MB
node 1 free: 84099 MB
node 2 cpus:
node 2 size: 127006 MB
node 2 free: 126925 MB
node distances:
node   0   1   2
   0:  10  20  20
   1:  20  10  20
   2:  20  20  10
Jonathan Cameron March 31, 2022, 11:17 a.m. UTC | #18
On Wed, 30 Mar 2022 22:06:52 +0530
Jagdish Gediya <jvgediya@linux.ibm.com> wrote:

> Hi Huang,
> 
> On Wed, Mar 30, 2022 at 02:46:51PM +0800, Huang, Ying wrote:
> > Hi, Jagdish,
> > 
> > Jagdish Gediya <jvgediya@linux.ibm.com> writes:
> >   
> > > The current implementation to identify the demotion
> > > targets limits some of the opportunities to share
> > > the demotion targets between multiple source nodes.  
> > 
> > Yes.  It sounds reasonable to share demotion targets among multiple
> > source nodes.
> > 
> > One question, are example machines below are real hardware now or in
> > near future?  Or you just think they are possible?  
> 
> They are not real hardware right now, they are the future possibilities.

I'll strengthen that a bit to say they are highly likely to turn up
fairly soon.  Often they will be result of specific interleaving decisions
and might not come from SRAT (e.g. might be added later as a result of
CXL discovery) but the principal will be the same.  Also, in some
cases the firmware will have done the CXL setup so it will be via SRAT.

example 1:
> > e.g. with below NUMA topology, where node 0 & 1 are
> > cpu + dram nodes, node 2 & 3 are equally slower memory
> > only nodes, and node 4 is slowest memory only node,
> >

Couple of near term examples of systems that will look like this.

2 socket system, each socket has DRAM (0,1) + NVDIMM (2,3) being
used as DRAM.
Also CXL attached memory and to get maximum bandwidth to a large
remote pool, interleave across host bridges in the two sockets
(Node 4) 
An alternative for node 4 is a that the system is using an IO expander
bridge on the CPU interconnect (effectively a CPU less CPU :)

Example 2:
> > e.g. with below NUMA topology, where node 0, 1 & 2 are
> > cpu + dram nodes and node 3 is slow memory node,
> >
3 CPU socket machines are still unusual, but the 2 or 4 socket
equivalents are simplifications of example 1.

Example 3:
> > with below NUMA topology, where node 0 & 2 are cpu + dram
> > nodes and node 1 & 3 are slow memory nodes,
I believe this is what today's simple DRAM + NVDIMM (as cheap
DRAM) 2 sockets servers look like.

Example 4:
> Another example, node 0 & 2 are cpu + dram nodes and node 1 are slow
> memory node near node 0,
> 
Normal system with CXL attached DRAM below an RP in node 0

...

Jonathan
diff mbox series

Patch

diff --git a/mm/migrate.c b/mm/migrate.c
index 3d60823afd2d..7ec8d934e706 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2381,10 +2381,13 @@  static int establish_migrate_target(int node, nodemask_t *used,
  */
 static void __set_migration_target_nodes(void)
 {
-	nodemask_t next_pass	= NODE_MASK_NONE;
-	nodemask_t this_pass	= NODE_MASK_NONE;
 	nodemask_t used_targets = NODE_MASK_NONE;
 	int node, best_distance;
+	nodemask_t *src_nodes;
+
+	src_nodes = kcalloc(nr_node_ids, sizeof(nodemask_t), GFP_KERNEL);
+	if (!src_nodes)
+		return;
 
 	/*
 	 * Avoid any oddities like cycles that could occur
@@ -2393,29 +2396,39 @@  static void __set_migration_target_nodes(void)
 	 */
 	disable_all_migrate_targets();
 
-	/*
-	 * Allocations go close to CPUs, first.  Assume that
-	 * the migration path starts at the nodes with CPUs.
-	 */
-	next_pass = node_states[N_CPU];
-again:
-	this_pass = next_pass;
-	next_pass = NODE_MASK_NONE;
-	/*
-	 * To avoid cycles in the migration "graph", ensure
-	 * that migration sources are not future targets by
-	 * setting them in 'used_targets'.  Do this only
-	 * once per pass so that multiple source nodes can
-	 * share a target node.
-	 *
-	 * 'used_targets' will become unavailable in future
-	 * passes.  This limits some opportunities for
-	 * multiple source nodes to share a destination.
-	 */
-	nodes_or(used_targets, used_targets, this_pass);
+	for_each_online_node(node) {
+		int tmp_node;
 
-	for_each_node_mask(node, this_pass) {
 		best_distance = -1;
+		used_targets = NODE_MASK_NONE;
+
+		/*
+		 * Avoid adding same node as the demotion target.
+		 */
+		node_set(node, used_targets);
+
+		/*
+		 * Add CPU NUMA nodes to the used target list so that it
+		 * won't be considered a demotion target.
+		 */
+		nodes_or(used_targets, used_targets, node_states[N_CPU]);
+
+		/*
+		 * Add all nodes that has appeared as source node of demotion
+		 * for this target node.
+		 *
+		 * To avoid cycles in the migration "graph", ensure
+		 * that migration sources are not future targets by
+		 * setting them in 'used_targets'.
+		 */
+		for_each_node_mask(tmp_node, src_nodes[node])
+			nodes_or(used_targets, used_targets, src_nodes[tmp_node]);
+
+		/*
+		 * Now update the demotion src nodes with other nodes in graph
+		 * which got computed above.
+		 */
+		nodes_or(src_nodes[node], src_nodes[node], used_targets);
 
 		/*
 		 * Try to set up the migration path for the node, and the target
@@ -2434,20 +2447,14 @@  static void __set_migration_target_nodes(void)
 				best_distance = node_distance(node, target_node);
 
 			/*
-			 * Visit targets from this pass in the next pass.
-			 * Eventually, every node will have been part of
-			 * a pass, and will become set in 'used_targets'.
+			 * Add this node in the src_nodes list so that we can
+			 * detect the looping.
 			 */
-			node_set(target_node, next_pass);
+			node_set(node, src_nodes[target_node]);
 		} while (1);
 	}
-	/*
-	 * 'next_pass' contains nodes which became migration
-	 * targets in this pass.  Make additional passes until
-	 * no more migrations targets are available.
-	 */
-	if (!nodes_empty(next_pass))
-		goto again;
+
+	kfree(src_nodes);
 }
 
 /*