Message ID | 20220329115222.8923-1-jvgediya@linux.ibm.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: migrate: set demotion targets differently | expand |
Hi Jagdish, On 3/29/2022 7:52 PM, Jagdish Gediya wrote: > The current implementation to identify the demotion > targets limits some of the opportunities to share > the demotion targets between multiple source nodes. > > Implement a logic to identify the loop in the demotion > targets such that all the possibilities of demotion can > be utilized. Don't share the used targets between all > the nodes, instead create the used targets from scratch > for each individual node based on for what all node this > node is a demotion target. This helps to share the demotion > targets without missing any possible way of demotion. > > e.g. with below NUMA topology, where node 0 & 1 are > cpu + dram nodes, node 2 & 3 are equally slower memory > only nodes, and node 4 is slowest memory only node, > > available: 5 nodes (0-4) > node 0 cpus: 0 1 > node 0 size: n MB > node 0 free: n MB > node 1 cpus: 2 3 > node 1 size: n MB > node 1 free: n MB > node 2 cpus: > node 2 size: n MB > node 2 free: n MB > node 3 cpus: > node 3 size: n MB > node 3 free: n MB > node 4 cpus: > node 4 size: n MB > node 4 free: n MB > node distances: > node 0 1 2 3 4 > 0: 10 20 40 40 80 > 1: 20 10 40 40 80 > 2: 40 40 10 40 80 > 3: 40 40 40 10 80 > 4: 80 80 80 80 10 > > The existing implementation gives below demotion targets, > > node demotion_target > 0 3, 2 > 1 4 > 2 X > 3 X > 4 X > > With this patch applied, below are the demotion targets, > > node demotion_target > 0 3, 2 > 1 3, 2 > 2 3 > 3 4 > 4 X Node 2 and node 3 both are slow memory and have same distance, why node 2 should demote cold memory to node 3? They should have the same target demotion node 4, which is the slowest memory node, right? > > e.g. with below NUMA topology, where node 0, 1 & 2 are > cpu + dram nodes and node 3 is slow memory node, > > available: 4 nodes (0-3) > node 0 cpus: 0 1 > node 0 size: n MB > node 0 free: n MB > node 1 cpus: 2 3 > node 1 size: n MB > node 1 free: n MB > node 2 cpus: 4 5 > node 2 size: n MB > node 2 free: n MB > node 3 cpus: > node 3 size: n MB > node 3 free: n MB > node distances: > node 0 1 2 3 > 0: 10 20 20 40 > 1: 20 10 20 40 > 2: 20 20 10 40 > 3: 40 40 40 10 > > The existing implementation gives below demotion targets, > > node demotion_target > 0 3 > 1 X > 2 X > 3 X > > With this patch applied, below are the demotion targets, > > node demotion_target > 0 3 > 1 3 > 2 3 > 3 X Sounds reasonable. > > with below NUMA topology, where node 0 & 2 are cpu + dram > nodes and node 1 & 3 are slow memory nodes, > > available: 4 nodes (0-3) > node 0 cpus: 0 1 > node 0 size: n MB > node 0 free: n MB > node 1 cpus: > node 1 size: n MB > node 1 free: n MB > node 2 cpus: 2 3 > node 2 size: n MB > node 2 free: n MB > node 3 cpus: > node 3 size: n MB > node 3 free: n MB > node distances: > node 0 1 2 3 > 0: 10 40 20 80 > 1: 40 10 80 80 > 2: 20 80 10 40 > 3: 80 80 40 10 > > The existing implementation gives below demotion targets, > > node demotion_target > 0 3 > 1 X > 2 3 > 3 X If I understand correctly, this is not true. The demotion route should be as below with existing implementation: node 0 ---> node 1 node 1 ---> X node 2 ---> node 3 node 3 ---> X > > With this patch applied, below are the demotion targets, > > node demotion_target > 0 1 > 1 3 > 2 3 > 3 X > > As it can be seen above, node 3 can be demotion target for node > 1 but existing implementation doesn't configure it that way. It > is better to move pages from node 1 to node 3 instead of moving > it from node 1 to swap. Which means node 3 is the slowest memory node? > > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com> > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> > --- > mm/migrate.c | 75 ++++++++++++++++++++++++++++------------------------ > 1 file changed, 41 insertions(+), 34 deletions(-) > > diff --git a/mm/migrate.c b/mm/migrate.c > index 3d60823afd2d..7ec8d934e706 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -2381,10 +2381,13 @@ static int establish_migrate_target(int node, nodemask_t *used, > */ > static void __set_migration_target_nodes(void) > { > - nodemask_t next_pass = NODE_MASK_NONE; > - nodemask_t this_pass = NODE_MASK_NONE; > nodemask_t used_targets = NODE_MASK_NONE; > int node, best_distance; > + nodemask_t *src_nodes; > + > + src_nodes = kcalloc(nr_node_ids, sizeof(nodemask_t), GFP_KERNEL); > + if (!src_nodes) > + return; > > /* > * Avoid any oddities like cycles that could occur > @@ -2393,29 +2396,39 @@ static void __set_migration_target_nodes(void) > */ > disable_all_migrate_targets(); > > - /* > - * Allocations go close to CPUs, first. Assume that > - * the migration path starts at the nodes with CPUs. > - */ > - next_pass = node_states[N_CPU]; > -again: > - this_pass = next_pass; > - next_pass = NODE_MASK_NONE; > - /* > - * To avoid cycles in the migration "graph", ensure > - * that migration sources are not future targets by > - * setting them in 'used_targets'. Do this only > - * once per pass so that multiple source nodes can > - * share a target node. > - * > - * 'used_targets' will become unavailable in future > - * passes. This limits some opportunities for > - * multiple source nodes to share a destination. > - */ > - nodes_or(used_targets, used_targets, this_pass); > + for_each_online_node(node) { > + int tmp_node; > > - for_each_node_mask(node, this_pass) { > best_distance = -1; > + used_targets = NODE_MASK_NONE; > + > + /* > + * Avoid adding same node as the demotion target. > + */ > + node_set(node, used_targets); > + > + /* > + * Add CPU NUMA nodes to the used target list so that it > + * won't be considered a demotion target. > + */ > + nodes_or(used_targets, used_targets, node_states[N_CPU]); > + > + /* > + * Add all nodes that has appeared as source node of demotion > + * for this target node. > + * > + * To avoid cycles in the migration "graph", ensure > + * that migration sources are not future targets by > + * setting them in 'used_targets'. > + */ > + for_each_node_mask(tmp_node, src_nodes[node]) > + nodes_or(used_targets, used_targets, src_nodes[tmp_node]); > + > + /* > + * Now update the demotion src nodes with other nodes in graph > + * which got computed above. > + */ > + nodes_or(src_nodes[node], src_nodes[node], used_targets); > > /* > * Try to set up the migration path for the node, and the target > @@ -2434,20 +2447,14 @@ static void __set_migration_target_nodes(void) > best_distance = node_distance(node, target_node); > > /* > - * Visit targets from this pass in the next pass. > - * Eventually, every node will have been part of > - * a pass, and will become set in 'used_targets'. > + * Add this node in the src_nodes list so that we can > + * detect the looping. > */ > - node_set(target_node, next_pass); > + node_set(node, src_nodes[target_node]); > } while (1); > } > - /* > - * 'next_pass' contains nodes which became migration > - * targets in this pass. Make additional passes until > - * no more migrations targets are available. > - */ > - if (!nodes_empty(next_pass)) > - goto again; > + > + kfree(src_nodes); > } > > /*
On Tue, Mar 29, 2022 at 08:26:05PM +0800, Baolin Wang wrote: Hi Baolin, > Hi Jagdish, > > On 3/29/2022 7:52 PM, Jagdish Gediya wrote: > > The current implementation to identify the demotion > > targets limits some of the opportunities to share > > the demotion targets between multiple source nodes. > > > > Implement a logic to identify the loop in the demotion > > targets such that all the possibilities of demotion can > > be utilized. Don't share the used targets between all > > the nodes, instead create the used targets from scratch > > for each individual node based on for what all node this > > node is a demotion target. This helps to share the demotion > > targets without missing any possible way of demotion. > > > > e.g. with below NUMA topology, where node 0 & 1 are > > cpu + dram nodes, node 2 & 3 are equally slower memory > > only nodes, and node 4 is slowest memory only node, > > > > available: 5 nodes (0-4) > > node 0 cpus: 0 1 > > node 0 size: n MB > > node 0 free: n MB > > node 1 cpus: 2 3 > > node 1 size: n MB > > node 1 free: n MB > > node 2 cpus: > > node 2 size: n MB > > node 2 free: n MB > > node 3 cpus: > > node 3 size: n MB > > node 3 free: n MB > > node 4 cpus: > > node 4 size: n MB > > node 4 free: n MB > > node distances: > > node 0 1 2 3 4 > > 0: 10 20 40 40 80 > > 1: 20 10 40 40 80 > > 2: 40 40 10 40 80 > > 3: 40 40 40 10 80 > > 4: 80 80 80 80 10 > > > > The existing implementation gives below demotion targets, > > > > node demotion_target > > 0 3, 2 > > 1 4 > > 2 X > > 3 X > > 4 X > > > > With this patch applied, below are the demotion targets, > > > > node demotion_target > > 0 3, 2 > > 1 3, 2 > > 2 3 > > 3 4 > > 4 X > > Node 2 and node 3 both are slow memory and have same distance, why node 2 > should demote cold memory to node 3? They should have the same target > demotion node 4, which is the slowest memory node, right? > Current demotion target finding algorithm works based on best distance, as distance between node 2 & 3 is 40 and distance between node 2 & 4 is 80, node 2 demotes to node 3. > > > > e.g. with below NUMA topology, where node 0, 1 & 2 are > > cpu + dram nodes and node 3 is slow memory node, > > > > available: 4 nodes (0-3) > > node 0 cpus: 0 1 > > node 0 size: n MB > > node 0 free: n MB > > node 1 cpus: 2 3 > > node 1 size: n MB > > node 1 free: n MB > > node 2 cpus: 4 5 > > node 2 size: n MB > > node 2 free: n MB > > node 3 cpus: > > node 3 size: n MB > > node 3 free: n MB > > node distances: > > node 0 1 2 3 > > 0: 10 20 20 40 > > 1: 20 10 20 40 > > 2: 20 20 10 40 > > 3: 40 40 40 10 > > > > The existing implementation gives below demotion targets, > > > > node demotion_target > > 0 3 > > 1 X > > 2 X > > 3 X > > > > With this patch applied, below are the demotion targets, > > > > node demotion_target > > 0 3 > > 1 3 > > 2 3 > > 3 X > > Sounds reasonable. > > > > > with below NUMA topology, where node 0 & 2 are cpu + dram > > nodes and node 1 & 3 are slow memory nodes, > > > > available: 4 nodes (0-3) > > node 0 cpus: 0 1 > > node 0 size: n MB > > node 0 free: n MB > > node 1 cpus: > > node 1 size: n MB > > node 1 free: n MB > > node 2 cpus: 2 3 > > node 2 size: n MB > > node 2 free: n MB > > node 3 cpus: > > node 3 size: n MB > > node 3 free: n MB > > node distances: > > node 0 1 2 3 > > 0: 10 40 20 80 > > 1: 40 10 80 80 > > 2: 20 80 10 40 > > 3: 80 80 40 10 > > > > The existing implementation gives below demotion targets, > > > > node demotion_target > > 0 3 > > 1 X > > 2 3 > > 3 X > > If I understand correctly, this is not true. The demotion route should be as > below with existing implementation: > node 0 ---> node 1 > node 1 ---> X > node 2 ---> node 3 > node 3 ---> X > Its typo, It should be 0 -> 1, Will correct it in v2. > > > > With this patch applied, below are the demotion targets, > > > > node demotion_target > > 0 1 > > 1 3 > > 2 3 > > 3 X > > > > As it can be seen above, node 3 can be demotion target for node > > 1 but existing implementation doesn't configure it that way. It > > is better to move pages from node 1 to node 3 instead of moving > > it from node 1 to swap. > > Which means node 3 is the slowest memory node? > Node 1 and 3 are equally slower but 1 is near to 0 and 3 is near to 2. Basically you can think of it like node 1 is slow memory logical node near to node 0 and node 3 is slow memory logical node near to node 2. > > > > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com> > > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> > > --- > > mm/migrate.c | 75 ++++++++++++++++++++++++++++------------------------ > > 1 file changed, 41 insertions(+), 34 deletions(-) > > > > diff --git a/mm/migrate.c b/mm/migrate.c > > index 3d60823afd2d..7ec8d934e706 100644 > > --- a/mm/migrate.c > > +++ b/mm/migrate.c > > @@ -2381,10 +2381,13 @@ static int establish_migrate_target(int node, nodemask_t *used, > > */ > > static void __set_migration_target_nodes(void) > > { > > - nodemask_t next_pass = NODE_MASK_NONE; > > - nodemask_t this_pass = NODE_MASK_NONE; > > nodemask_t used_targets = NODE_MASK_NONE; > > int node, best_distance; > > + nodemask_t *src_nodes; > > + > > + src_nodes = kcalloc(nr_node_ids, sizeof(nodemask_t), GFP_KERNEL); > > + if (!src_nodes) > > + return; > > /* > > * Avoid any oddities like cycles that could occur > > @@ -2393,29 +2396,39 @@ static void __set_migration_target_nodes(void) > > */ > > disable_all_migrate_targets(); > > - /* > > - * Allocations go close to CPUs, first. Assume that > > - * the migration path starts at the nodes with CPUs. > > - */ > > - next_pass = node_states[N_CPU]; > > -again: > > - this_pass = next_pass; > > - next_pass = NODE_MASK_NONE; > > - /* > > - * To avoid cycles in the migration "graph", ensure > > - * that migration sources are not future targets by > > - * setting them in 'used_targets'. Do this only > > - * once per pass so that multiple source nodes can > > - * share a target node. > > - * > > - * 'used_targets' will become unavailable in future > > - * passes. This limits some opportunities for > > - * multiple source nodes to share a destination. > > - */ > > - nodes_or(used_targets, used_targets, this_pass); > > + for_each_online_node(node) { > > + int tmp_node; > > - for_each_node_mask(node, this_pass) { > > best_distance = -1; > > + used_targets = NODE_MASK_NONE; > > + > > + /* > > + * Avoid adding same node as the demotion target. > > + */ > > + node_set(node, used_targets); > > + > > + /* > > + * Add CPU NUMA nodes to the used target list so that it > > + * won't be considered a demotion target. > > + */ > > + nodes_or(used_targets, used_targets, node_states[N_CPU]); > > + > > + /* > > + * Add all nodes that has appeared as source node of demotion > > + * for this target node. > > + * > > + * To avoid cycles in the migration "graph", ensure > > + * that migration sources are not future targets by > > + * setting them in 'used_targets'. > > + */ > > + for_each_node_mask(tmp_node, src_nodes[node]) > > + nodes_or(used_targets, used_targets, src_nodes[tmp_node]); > > + > > + /* > > + * Now update the demotion src nodes with other nodes in graph > > + * which got computed above. > > + */ > > + nodes_or(src_nodes[node], src_nodes[node], used_targets); > > /* > > * Try to set up the migration path for the node, and the target > > @@ -2434,20 +2447,14 @@ static void __set_migration_target_nodes(void) > > best_distance = node_distance(node, target_node); > > /* > > - * Visit targets from this pass in the next pass. > > - * Eventually, every node will have been part of > > - * a pass, and will become set in 'used_targets'. > > + * Add this node in the src_nodes list so that we can > > + * detect the looping. > > */ > > - node_set(target_node, next_pass); > > + node_set(node, src_nodes[target_node]); > > } while (1); > > } > > - /* > > - * 'next_pass' contains nodes which became migration > > - * targets in this pass. Make additional passes until > > - * no more migrations targets are available. > > - */ > > - if (!nodes_empty(next_pass)) > > - goto again; > > + > > + kfree(src_nodes); > > } > > /*
On 3/29/22 04:52, Jagdish Gediya wrote: > The current implementation to identify the demotion > targets limits some of the opportunities to share > the demotion targets between multiple source nodes. This changelog is a bit unsatisfying. It basically says: the current code isn't working, throw some more code at the problem. I'd love to see some more information about *why* the current code doesn't work. Is it purely a bug or was it mis-designed? I actually wrote it intending for it to handle cases like you describe while not leaving lots of nodes without demotion targets.
On Tue, Mar 29, 2022 at 07:31:12AM -0700, Dave Hansen wrote: > On 3/29/22 04:52, Jagdish Gediya wrote: > > The current implementation to identify the demotion > > targets limits some of the opportunities to share > > the demotion targets between multiple source nodes. > > This changelog is a bit unsatisfying. It basically says: the current > code isn't working, throw some more code at the problem. The current code works but it doesn't utilize all the possibilities for the demotion. One of the comment in the current code says that 'used_targets will become unavailable in future passes. This limits some opportunities for multiple source nodes to share a destination.' This patch removes those limitation. The current limitation is single used_target nodemask sharing between all the nodes, This patch prepares the used_targets from scratch for each individual node and then tries to avoid the looping based on what demotion targets are not possible for that particular node based on the data we accumulate in newly defined src_nodes array. For example, with below numa topology where node 0,1 & 2 are cpu + dram node and node 3 is slow memory node, node 0 1 2 3 0: 10 20 20 40 1: 20 10 20 40 2: 20 20 10 40 3: 40 40 40 10 once node 3 is utilized for node 0, it can not be shared for node 1 & 2 because in current code, used_targets will have that set even when we find demotion targets for 1 & 2. > I'd love to see some more information about *why* the current code > doesn't work. Is it purely a bug or was it mis-designed? I think the design seems to be intended because as per the comment in the current code, it was known that there are limits to the node sharing as a demotion target. > I actually wrote it intending for it to handle cases like you describe > while not leaving lots of nodes without demotion targets. if you see the examples I have given in the commit message, the current code misses many opportunities for the demotion, so current implementation to avoid the looping is not the best as I have explained above, that is where this patch is required.
On 3/29/22 09:46, Jagdish Gediya wrote: >> I'd love to see some more information about *why* the current code >> doesn't work. Is it purely a bug or was it mis-designed? > I think the design seems to be intended because as per the comment > in the current code, it was known that there are limits to the node > sharing as a demotion target. Ahh, that makes sense. You've got some slow memory that's not really associated closely enough with an initiator node to be used exclusively for demotions from that node. The original code was really intended for demotions when the system is made up of relatively identical building blocks, like one CPU with two kinds of memory attached. I think what you're doing here makes sense. The only thing I'll say is that, at some point, we need to stop messing with the demotion order that the kernel comes up with. It's not going to be able to handle *everything* perfectly. We're probably not at the point that we should stop enhancing the kernel-generated demotion order, but let's at least keep in mind that we should stop eventually.
On 3/29/2022 10:04 PM, Jagdish Gediya wrote: > On Tue, Mar 29, 2022 at 08:26:05PM +0800, Baolin Wang wrote: > Hi Baolin, >> Hi Jagdish, >> >> On 3/29/2022 7:52 PM, Jagdish Gediya wrote: >>> The current implementation to identify the demotion >>> targets limits some of the opportunities to share >>> the demotion targets between multiple source nodes. >>> >>> Implement a logic to identify the loop in the demotion >>> targets such that all the possibilities of demotion can >>> be utilized. Don't share the used targets between all >>> the nodes, instead create the used targets from scratch >>> for each individual node based on for what all node this >>> node is a demotion target. This helps to share the demotion >>> targets without missing any possible way of demotion. >>> >>> e.g. with below NUMA topology, where node 0 & 1 are >>> cpu + dram nodes, node 2 & 3 are equally slower memory >>> only nodes, and node 4 is slowest memory only node, >>> >>> available: 5 nodes (0-4) >>> node 0 cpus: 0 1 >>> node 0 size: n MB >>> node 0 free: n MB >>> node 1 cpus: 2 3 >>> node 1 size: n MB >>> node 1 free: n MB >>> node 2 cpus: >>> node 2 size: n MB >>> node 2 free: n MB >>> node 3 cpus: >>> node 3 size: n MB >>> node 3 free: n MB >>> node 4 cpus: >>> node 4 size: n MB >>> node 4 free: n MB >>> node distances: >>> node 0 1 2 3 4 >>> 0: 10 20 40 40 80 >>> 1: 20 10 40 40 80 >>> 2: 40 40 10 40 80 >>> 3: 40 40 40 10 80 >>> 4: 80 80 80 80 10 >>> >>> The existing implementation gives below demotion targets, >>> >>> node demotion_target >>> 0 3, 2 >>> 1 4 >>> 2 X >>> 3 X >>> 4 X >>> >>> With this patch applied, below are the demotion targets, >>> >>> node demotion_target >>> 0 3, 2 >>> 1 3, 2 >>> 2 3 >>> 3 4 >>> 4 X >> >> Node 2 and node 3 both are slow memory and have same distance, why node 2 >> should demote cold memory to node 3? They should have the same target >> demotion node 4, which is the slowest memory node, right? >> > Current demotion target finding algorithm works based on best distance, as distance between node 2 & 3 is 40 and distance between node 2 & 4 is 80, node 2 demotes to node 3. If node 2 can demote to node 3, which means node 3's memory is colder than node 2, right? The accessing time of node 3 should be larger than node 2, then we can demote colder memory to node 3 from node 2. But node 2 and node 3 are same memory type and have same distance, the accessing time of node 2 and node 3 should be same too, so why add so many page migration between node 2 and node 3? I'm still not sure the benefits. Huang Ying and Dave, how do you think about this demotion targets? >>> >>> e.g. with below NUMA topology, where node 0, 1 & 2 are >>> cpu + dram nodes and node 3 is slow memory node, >>> >>> available: 4 nodes (0-3) >>> node 0 cpus: 0 1 >>> node 0 size: n MB >>> node 0 free: n MB >>> node 1 cpus: 2 3 >>> node 1 size: n MB >>> node 1 free: n MB >>> node 2 cpus: 4 5 >>> node 2 size: n MB >>> node 2 free: n MB >>> node 3 cpus: >>> node 3 size: n MB >>> node 3 free: n MB >>> node distances: >>> node 0 1 2 3 >>> 0: 10 20 20 40 >>> 1: 20 10 20 40 >>> 2: 20 20 10 40 >>> 3: 40 40 40 10 >>> >>> The existing implementation gives below demotion targets, >>> >>> node demotion_target >>> 0 3 >>> 1 X >>> 2 X >>> 3 X >>> >>> With this patch applied, below are the demotion targets, >>> >>> node demotion_target >>> 0 3 >>> 1 3 >>> 2 3 >>> 3 X >> >> Sounds reasonable. >> >>> >>> with below NUMA topology, where node 0 & 2 are cpu + dram >>> nodes and node 1 & 3 are slow memory nodes, >>> >>> available: 4 nodes (0-3) >>> node 0 cpus: 0 1 >>> node 0 size: n MB >>> node 0 free: n MB >>> node 1 cpus: >>> node 1 size: n MB >>> node 1 free: n MB >>> node 2 cpus: 2 3 >>> node 2 size: n MB >>> node 2 free: n MB >>> node 3 cpus: >>> node 3 size: n MB >>> node 3 free: n MB >>> node distances: >>> node 0 1 2 3 >>> 0: 10 40 20 80 >>> 1: 40 10 80 80 >>> 2: 20 80 10 40 >>> 3: 80 80 40 10 >>> >>> The existing implementation gives below demotion targets, >>> >>> node demotion_target >>> 0 3 >>> 1 X >>> 2 3 >>> 3 X >> >> If I understand correctly, this is not true. The demotion route should be as >> below with existing implementation: >> node 0 ---> node 1 >> node 1 ---> X >> node 2 ---> node 3 >> node 3 ---> X >> > Its typo, It should be 0 -> 1, Will correct it in v2. >>> >>> With this patch applied, below are the demotion targets, >>> >>> node demotion_target >>> 0 1 >>> 1 3 >>> 2 3 >>> 3 X >>> >>> As it can be seen above, node 3 can be demotion target for node >>> 1 but existing implementation doesn't configure it that way. It >>> is better to move pages from node 1 to node 3 instead of moving >>> it from node 1 to swap. >> >> Which means node 3 is the slowest memory node? >> > Node 1 and 3 are equally slower but 1 is near to 0 and 3 is near to 2. Basically you can think of it like node 1 is slow memory logical node near to node 0 and node 3 is slow memory logical node near to node 2. OK.
Hi, Jagdish, Jagdish Gediya <jvgediya@linux.ibm.com> writes: > The current implementation to identify the demotion > targets limits some of the opportunities to share > the demotion targets between multiple source nodes. Yes. It sounds reasonable to share demotion targets among multiple source nodes. One question, are example machines below are real hardware now or in near future? Or you just think they are possible? And, before going into the implementation details, I think that we can discuss the perfect demotion order firstly. > Implement a logic to identify the loop in the demotion > targets such that all the possibilities of demotion can > be utilized. Don't share the used targets between all > the nodes, instead create the used targets from scratch > for each individual node based on for what all node this > node is a demotion target. This helps to share the demotion > targets without missing any possible way of demotion. > > e.g. with below NUMA topology, where node 0 & 1 are > cpu + dram nodes, node 2 & 3 are equally slower memory > only nodes, and node 4 is slowest memory only node, > > available: 5 nodes (0-4) > node 0 cpus: 0 1 > node 0 size: n MB > node 0 free: n MB > node 1 cpus: 2 3 > node 1 size: n MB > node 1 free: n MB > node 2 cpus: > node 2 size: n MB > node 2 free: n MB > node 3 cpus: > node 3 size: n MB > node 3 free: n MB > node 4 cpus: > node 4 size: n MB > node 4 free: n MB > node distances: > node 0 1 2 3 4 > 0: 10 20 40 40 80 > 1: 20 10 40 40 80 > 2: 40 40 10 40 80 > 3: 40 40 40 10 80 > 4: 80 80 80 80 10 > > The existing implementation gives below demotion targets, > > node demotion_target > 0 3, 2 > 1 4 > 2 X > 3 X > 4 X > > With this patch applied, below are the demotion targets, > > node demotion_target > 0 3, 2 > 1 3, 2 > 2 3 > 3 4 > 4 X For such machine, I think the perfect demotion order is, node demotion_target 0 2, 3 1 2, 3 2 4 3 4 4 X > e.g. with below NUMA topology, where node 0, 1 & 2 are > cpu + dram nodes and node 3 is slow memory node, > > available: 4 nodes (0-3) > node 0 cpus: 0 1 > node 0 size: n MB > node 0 free: n MB > node 1 cpus: 2 3 > node 1 size: n MB > node 1 free: n MB > node 2 cpus: 4 5 > node 2 size: n MB > node 2 free: n MB > node 3 cpus: > node 3 size: n MB > node 3 free: n MB > node distances: > node 0 1 2 3 > 0: 10 20 20 40 > 1: 20 10 20 40 > 2: 20 20 10 40 > 3: 40 40 40 10 > > The existing implementation gives below demotion targets, > > node demotion_target > 0 3 > 1 X > 2 X > 3 X > > With this patch applied, below are the demotion targets, > > node demotion_target > 0 3 > 1 3 > 2 3 > 3 X I think this is perfect already. > with below NUMA topology, where node 0 & 2 are cpu + dram > nodes and node 1 & 3 are slow memory nodes, > > available: 4 nodes (0-3) > node 0 cpus: 0 1 > node 0 size: n MB > node 0 free: n MB > node 1 cpus: > node 1 size: n MB > node 1 free: n MB > node 2 cpus: 2 3 > node 2 size: n MB > node 2 free: n MB > node 3 cpus: > node 3 size: n MB > node 3 free: n MB > node distances: > node 0 1 2 3 > 0: 10 40 20 80 > 1: 40 10 80 80 > 2: 20 80 10 40 > 3: 80 80 40 10 > > The existing implementation gives below demotion targets, > > node demotion_target > 0 3 > 1 X > 2 3 > 3 X Should be as below as you said in another email of the thread. node demotion_target 0 1 1 X 2 3 3 X > With this patch applied, below are the demotion targets, > > node demotion_target > 0 1 > 1 3 > 2 3 > 3 X The original demotion order looks better for me. 1 and 3 are at the same level from the perspective of the whole system. Another example, node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near node 0, available: 3 nodes (0-2) node 0 cpus: 0 1 node 0 size: n MB node 0 free: n MB node 1 cpus: node 1 size: n MB node 1 free: n MB node 2 cpus: 2 3 node 2 size: n MB node 2 free: n MB node distances: node 0 1 2 0: 10 40 20 1: 40 10 80 2: 20 80 10 Demotion order 1: node demotion_target 0 1 1 X 2 X Demotion order 2: node demotion_target 0 1 1 X 2 1 Demotion order 2 looks better. But I think that demotion order 1 makes some sense too (like node reclaim mode). It seems that, If a demotion target has same distance to several current demotion sources, the demotion target should be shared among the demotion sources. And as Dave pointed out, we may eventually need a mechanism to override the default demotion order generated automatically. So we can just use some simple mechanism that makes sense in most cases in kernel automatically. And leave the best demotion for users to some customization mechanism. > As it can be seen above, node 3 can be demotion target for node > 1 but existing implementation doesn't configure it that way. It > is better to move pages from node 1 to node 3 instead of moving > it from node 1 to swap. > > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com> > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Best Regards, Huang, Ying [snip]
Baolin Wang <baolin.wang@linux.alibaba.com> writes: > On 3/29/2022 10:04 PM, Jagdish Gediya wrote: >> On Tue, Mar 29, 2022 at 08:26:05PM +0800, Baolin Wang wrote: >> Hi Baolin, >>> Hi Jagdish, >>> >>> On 3/29/2022 7:52 PM, Jagdish Gediya wrote: >>>> The current implementation to identify the demotion >>>> targets limits some of the opportunities to share >>>> the demotion targets between multiple source nodes. >>>> >>>> Implement a logic to identify the loop in the demotion >>>> targets such that all the possibilities of demotion can >>>> be utilized. Don't share the used targets between all >>>> the nodes, instead create the used targets from scratch >>>> for each individual node based on for what all node this >>>> node is a demotion target. This helps to share the demotion >>>> targets without missing any possible way of demotion. >>>> >>>> e.g. with below NUMA topology, where node 0 & 1 are >>>> cpu + dram nodes, node 2 & 3 are equally slower memory >>>> only nodes, and node 4 is slowest memory only node, >>>> >>>> available: 5 nodes (0-4) >>>> node 0 cpus: 0 1 >>>> node 0 size: n MB >>>> node 0 free: n MB >>>> node 1 cpus: 2 3 >>>> node 1 size: n MB >>>> node 1 free: n MB >>>> node 2 cpus: >>>> node 2 size: n MB >>>> node 2 free: n MB >>>> node 3 cpus: >>>> node 3 size: n MB >>>> node 3 free: n MB >>>> node 4 cpus: >>>> node 4 size: n MB >>>> node 4 free: n MB >>>> node distances: >>>> node 0 1 2 3 4 >>>> 0: 10 20 40 40 80 >>>> 1: 20 10 40 40 80 >>>> 2: 40 40 10 40 80 >>>> 3: 40 40 40 10 80 >>>> 4: 80 80 80 80 10 >>>> >>>> The existing implementation gives below demotion targets, >>>> >>>> node demotion_target >>>> 0 3, 2 >>>> 1 4 >>>> 2 X >>>> 3 X >>>> 4 X >>>> >>>> With this patch applied, below are the demotion targets, >>>> >>>> node demotion_target >>>> 0 3, 2 >>>> 1 3, 2 >>>> 2 3 >>>> 3 4 >>>> 4 X >>> >>> Node 2 and node 3 both are slow memory and have same distance, why node 2 >>> should demote cold memory to node 3? They should have the same target >>> demotion node 4, which is the slowest memory node, right? >>> >> Current demotion target finding algorithm works based on best distance, as distance between node 2 & 3 is 40 and distance between node 2 & 4 is 80, node 2 demotes to node 3. > > If node 2 can demote to node 3, which means node 3's memory is colder > than node 2, right? The accessing time of node 3 should be larger than > node 2, then we can demote colder memory to node 3 from node 2. > > But node 2 and node 3 are same memory type and have same distance, the > accessing time of node 2 and node 3 should be same too, so why add so > many page migration between node 2 and node 3? I'm still not sure the > benefits. > > Huang Ying and Dave, how do you think about this demotion targets? Yes. I think the demotion target of 2 should be 4, as in my another email in this thread. Demoting from 2 to 3 makes no sense. Best Regards, Huang, Ying [snip]
Hi Huang, On Wed, Mar 30, 2022 at 02:46:51PM +0800, Huang, Ying wrote: > Hi, Jagdish, > > Jagdish Gediya <jvgediya@linux.ibm.com> writes: > > > The current implementation to identify the demotion > > targets limits some of the opportunities to share > > the demotion targets between multiple source nodes. > > Yes. It sounds reasonable to share demotion targets among multiple > source nodes. > > One question, are example machines below are real hardware now or in > near future? Or you just think they are possible? They are not real hardware right now, they are the future possibilities. > And, before going into the implementation details, I think that we can > discuss the perfect demotion order firstly. > > > Implement a logic to identify the loop in the demotion > > targets such that all the possibilities of demotion can > > be utilized. Don't share the used targets between all > > the nodes, instead create the used targets from scratch > > for each individual node based on for what all node this > > node is a demotion target. This helps to share the demotion > > targets without missing any possible way of demotion. > > > > e.g. with below NUMA topology, where node 0 & 1 are > > cpu + dram nodes, node 2 & 3 are equally slower memory > > only nodes, and node 4 is slowest memory only node, > > > > available: 5 nodes (0-4) > > node 0 cpus: 0 1 > > node 0 size: n MB > > node 0 free: n MB > > node 1 cpus: 2 3 > > node 1 size: n MB > > node 1 free: n MB > > node 2 cpus: > > node 2 size: n MB > > node 2 free: n MB > > node 3 cpus: > > node 3 size: n MB > > node 3 free: n MB > > node 4 cpus: > > node 4 size: n MB > > node 4 free: n MB > > node distances: > > node 0 1 2 3 4 > > 0: 10 20 40 40 80 > > 1: 20 10 40 40 80 > > 2: 40 40 10 40 80 > > 3: 40 40 40 10 80 > > 4: 80 80 80 80 10 > > > > The existing implementation gives below demotion targets, > > > > node demotion_target > > 0 3, 2 > > 1 4 > > 2 X > > 3 X > > 4 X > > > > With this patch applied, below are the demotion targets, > > > > node demotion_target > > 0 3, 2 > > 1 3, 2 > > 2 3 > > 3 4 > > 4 X > > For such machine, I think the perfect demotion order is, > > node demotion_target > 0 2, 3 > 1 2, 3 > 2 4 > 3 4 > 4 X Current implementation works based on the best distance algorithm, this patch doesn't change it either, so based on the distance, the demotion list is what I have mentioned. I understand 4 is better target for 2 but as per the mentioned numa distances and current algorithm, it doesn't get configured like that in the kernel. > > e.g. with below NUMA topology, where node 0, 1 & 2 are > > cpu + dram nodes and node 3 is slow memory node, > > > > available: 4 nodes (0-3) > > node 0 cpus: 0 1 > > node 0 size: n MB > > node 0 free: n MB > > node 1 cpus: 2 3 > > node 1 size: n MB > > node 1 free: n MB > > node 2 cpus: 4 5 > > node 2 size: n MB > > node 2 free: n MB > > node 3 cpus: > > node 3 size: n MB > > node 3 free: n MB > > node distances: > > node 0 1 2 3 > > 0: 10 20 20 40 > > 1: 20 10 20 40 > > 2: 20 20 10 40 > > 3: 40 40 40 10 > > > > The existing implementation gives below demotion targets, > > > > node demotion_target > > 0 3 > > 1 X > > 2 X > > 3 X > > > > With this patch applied, below are the demotion targets, > > > > node demotion_target > > 0 3 > > 1 3 > > 2 3 > > 3 X > > I think this is perfect already. > > > with below NUMA topology, where node 0 & 2 are cpu + dram > > nodes and node 1 & 3 are slow memory nodes, > > > > available: 4 nodes (0-3) > > node 0 cpus: 0 1 > > node 0 size: n MB > > node 0 free: n MB > > node 1 cpus: > > node 1 size: n MB > > node 1 free: n MB > > node 2 cpus: 2 3 > > node 2 size: n MB > > node 2 free: n MB > > node 3 cpus: > > node 3 size: n MB > > node 3 free: n MB > > node distances: > > node 0 1 2 3 > > 0: 10 40 20 80 > > 1: 40 10 80 80 > > 2: 20 80 10 40 > > 3: 80 80 40 10 > > > > The existing implementation gives below demotion targets, > > > > node demotion_target > > 0 3 > > 1 X > > 2 3 > > 3 X > > Should be as below as you said in another email of the thread. > > node demotion_target > 0 1 > 1 X > 2 3 > 3 X > > > With this patch applied, below are the demotion targets, > > > > node demotion_target > > 0 1 > > 1 3 > > 2 3 > > 3 X > > The original demotion order looks better for me. 1 and 3 are at the > same level from the perspective of the whole system. > > Another example, node 0 & 2 are cpu + dram nodes and node 1 are slow > memory node near node 0, > > available: 3 nodes (0-2) > node 0 cpus: 0 1 > node 0 size: n MB > node 0 free: n MB > node 1 cpus: > node 1 size: n MB > node 1 free: n MB > node 2 cpus: 2 3 > node 2 size: n MB > node 2 free: n MB > node distances: > node 0 1 2 > 0: 10 40 20 > 1: 40 10 80 > 2: 20 80 10 > > > Demotion order 1: > > node demotion_target > 0 1 > 1 X > 2 X > > Demotion order 2: > > node demotion_target > 0 1 > 1 X > 2 1 > > Demotion order 2 looks better. But I think that demotion order 1 makes > some sense too (like node reclaim mode). > > It seems that, > > If a demotion target has same distance to several current demotion > sources, the demotion target should be shared among the demotion > sources. Yes, and that is where this patch is useful. > And as Dave pointed out, we may eventually need a mechanism to override > the default demotion order generated automatically. So we can just use > some simple mechanism that makes sense in most cases in kernel > automatically. And leave the best demotion for users to some > customization mechanism. Yes, We need a mechanism to override the default demotion list prepared by the current implementation. PowerVM can have a cpu less dram node as well, which infact are not the right target for demotion because it is the fast memory. We need to distinguish between memory tiers so that slow memory can be utilized for the demotion even when there are fast memory only numa nodes. I think we may see implementations in future to override the default behavior e.g. when systems have both fast only and slow only memory nodes and in that case it will make sense to demote to slow memory only node even if it is far, but this patch is to put the current implementation to its intended design 'best distance based demotion targets'. > > As it can be seen above, node 3 can be demotion target for node > > 1 but existing implementation doesn't configure it that way. It > > is better to move pages from node 1 to node 3 instead of moving > > it from node 1 to swap. > > > > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com> > > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> > > Best Regards, > Huang, Ying > > [snip] > Best Regards, Jagdish
"Huang, Ying" <ying.huang@intel.com> writes: > Hi, Jagdish, > > Jagdish Gediya <jvgediya@linux.ibm.com> writes: > ... >> e.g. with below NUMA topology, where node 0 & 1 are >> cpu + dram nodes, node 2 & 3 are equally slower memory >> only nodes, and node 4 is slowest memory only node, >> >> available: 5 nodes (0-4) >> node 0 cpus: 0 1 >> node 0 size: n MB >> node 0 free: n MB >> node 1 cpus: 2 3 >> node 1 size: n MB >> node 1 free: n MB >> node 2 cpus: >> node 2 size: n MB >> node 2 free: n MB >> node 3 cpus: >> node 3 size: n MB >> node 3 free: n MB >> node 4 cpus: >> node 4 size: n MB >> node 4 free: n MB >> node distances: >> node 0 1 2 3 4 >> 0: 10 20 40 40 80 >> 1: 20 10 40 40 80 >> 2: 40 40 10 40 80 >> 3: 40 40 40 10 80 >> 4: 80 80 80 80 10 >> >> The existing implementation gives below demotion targets, >> >> node demotion_target >> 0 3, 2 >> 1 4 >> 2 X >> 3 X >> 4 X >> >> With this patch applied, below are the demotion targets, >> >> node demotion_target >> 0 3, 2 >> 1 3, 2 >> 2 3 >> 3 4 >> 4 X > > For such machine, I think the perfect demotion order is, > > node demotion_target > 0 2, 3 > 1 2, 3 > 2 4 > 3 4 > 4 X I guess the "equally slow nodes" is a confusing definition here. Now if the system consists of 2 1GB equally slow memory and the firmware doesn't want to differentiate between them, firmware can present a single NUMA node with 2GB capacity? The fact that we are finding two NUMA nodes is a hint that there is some difference between these two memory devices. This is also captured by the fact that the distance between 2 and 3 is 40 and not 10. For that specific topology where the distance between 2 and 3 is 40 and 2 and 4 is 80, the demotion target derived by the new code is better right? ... -aneesh
Jagdish Gediya <jvgediya@linux.ibm.com> writes: > Hi Huang, > > On Wed, Mar 30, 2022 at 02:46:51PM +0800, Huang, Ying wrote: >> Hi, Jagdish, >> >> Jagdish Gediya <jvgediya@linux.ibm.com> writes: >> >> > The current implementation to identify the demotion >> > targets limits some of the opportunities to share >> > the demotion targets between multiple source nodes. >> >> Yes. It sounds reasonable to share demotion targets among multiple >> source nodes. >> >> One question, are example machines below are real hardware now or in >> near future? Or you just think they are possible? > > They are not real hardware right now, they are the future possibilities. > >> And, before going into the implementation details, I think that we can >> discuss the perfect demotion order firstly. >> >> > Implement a logic to identify the loop in the demotion >> > targets such that all the possibilities of demotion can >> > be utilized. Don't share the used targets between all >> > the nodes, instead create the used targets from scratch >> > for each individual node based on for what all node this >> > node is a demotion target. This helps to share the demotion >> > targets without missing any possible way of demotion. >> > >> > e.g. with below NUMA topology, where node 0 & 1 are >> > cpu + dram nodes, node 2 & 3 are equally slower memory >> > only nodes, and node 4 is slowest memory only node, >> > >> > available: 5 nodes (0-4) >> > node 0 cpus: 0 1 >> > node 0 size: n MB >> > node 0 free: n MB >> > node 1 cpus: 2 3 >> > node 1 size: n MB >> > node 1 free: n MB >> > node 2 cpus: >> > node 2 size: n MB >> > node 2 free: n MB >> > node 3 cpus: >> > node 3 size: n MB >> > node 3 free: n MB >> > node 4 cpus: >> > node 4 size: n MB >> > node 4 free: n MB >> > node distances: >> > node 0 1 2 3 4 >> > 0: 10 20 40 40 80 >> > 1: 20 10 40 40 80 >> > 2: 40 40 10 40 80 >> > 3: 40 40 40 10 80 >> > 4: 80 80 80 80 10 >> > >> > The existing implementation gives below demotion targets, >> > >> > node demotion_target >> > 0 3, 2 >> > 1 4 >> > 2 X >> > 3 X >> > 4 X >> > >> > With this patch applied, below are the demotion targets, >> > >> > node demotion_target >> > 0 3, 2 >> > 1 3, 2 >> > 2 3 >> > 3 4 >> > 4 X >> >> For such machine, I think the perfect demotion order is, >> >> node demotion_target >> 0 2, 3 >> 1 2, 3 >> 2 4 >> 3 4 >> 4 X > > Current implementation works based on the best distance algorithm, > this patch doesn't change it either, so based on the distance, the > demotion list is what I have mentioned. I understand 4 is better > target for 2 but as per the mentioned numa distances and current > algorithm, it doesn't get configured like that in the kernel. If we agree on the perfect demotion order, can we revise the current algorithm a little to make it happen? Is the following solution possible? 1st round: sources: 0, 1 exclude targets: 0, 1 result: 0 - > 2,3; 1 -> 2,3 2nd round: sources: 2, 3 exclude targets: 0, 1, 2, 3 result: 2 -> 4; 3 -> 4 3rd round: sources: 4 exclude targets: 0, 1, 2, 3, 4 result: 4 -> X That is, in each round, we exclude the sources and targets in the previous rounds as demotion targets. Compared with the original algorithm, we don't exclude targets used in the current round, but we will exclude them in the next round. If this doesn't work, can you think about whether it is possible to find some way? >> > e.g. with below NUMA topology, where node 0, 1 & 2 are >> > cpu + dram nodes and node 3 is slow memory node, >> > >> > available: 4 nodes (0-3) >> > node 0 cpus: 0 1 >> > node 0 size: n MB >> > node 0 free: n MB >> > node 1 cpus: 2 3 >> > node 1 size: n MB >> > node 1 free: n MB >> > node 2 cpus: 4 5 >> > node 2 size: n MB >> > node 2 free: n MB >> > node 3 cpus: >> > node 3 size: n MB >> > node 3 free: n MB >> > node distances: >> > node 0 1 2 3 >> > 0: 10 20 20 40 >> > 1: 20 10 20 40 >> > 2: 20 20 10 40 >> > 3: 40 40 40 10 >> > >> > The existing implementation gives below demotion targets, >> > >> > node demotion_target >> > 0 3 >> > 1 X >> > 2 X >> > 3 X >> > >> > With this patch applied, below are the demotion targets, >> > >> > node demotion_target >> > 0 3 >> > 1 3 >> > 2 3 >> > 3 X >> >> I think this is perfect already. >> >> > with below NUMA topology, where node 0 & 2 are cpu + dram >> > nodes and node 1 & 3 are slow memory nodes, >> > >> > available: 4 nodes (0-3) >> > node 0 cpus: 0 1 >> > node 0 size: n MB >> > node 0 free: n MB >> > node 1 cpus: >> > node 1 size: n MB >> > node 1 free: n MB >> > node 2 cpus: 2 3 >> > node 2 size: n MB >> > node 2 free: n MB >> > node 3 cpus: >> > node 3 size: n MB >> > node 3 free: n MB >> > node distances: >> > node 0 1 2 3 >> > 0: 10 40 20 80 >> > 1: 40 10 80 80 >> > 2: 20 80 10 40 >> > 3: 80 80 40 10 >> > >> > The existing implementation gives below demotion targets, >> > >> > node demotion_target >> > 0 3 >> > 1 X >> > 2 3 >> > 3 X >> >> Should be as below as you said in another email of the thread. >> >> node demotion_target >> 0 1 >> 1 X >> 2 3 >> 3 X >> >> > With this patch applied, below are the demotion targets, >> > >> > node demotion_target >> > 0 1 >> > 1 3 >> > 2 3 >> > 3 X >> >> The original demotion order looks better for me. 1 and 3 are at the >> same level from the perspective of the whole system. >> >> Another example, node 0 & 2 are cpu + dram nodes and node 1 are slow >> memory node near node 0, >> >> available: 3 nodes (0-2) >> node 0 cpus: 0 1 >> node 0 size: n MB >> node 0 free: n MB >> node 1 cpus: >> node 1 size: n MB >> node 1 free: n MB >> node 2 cpus: 2 3 >> node 2 size: n MB >> node 2 free: n MB >> node distances: >> node 0 1 2 >> 0: 10 40 20 >> 1: 40 10 80 >> 2: 20 80 10 >> >> >> Demotion order 1: >> >> node demotion_target >> 0 1 >> 1 X >> 2 X >> >> Demotion order 2: >> >> node demotion_target >> 0 1 >> 1 X >> 2 1 >> >> Demotion order 2 looks better. But I think that demotion order 1 makes >> some sense too (like node reclaim mode). >> >> It seems that, >> >> If a demotion target has same distance to several current demotion >> sources, the demotion target should be shared among the demotion >> sources. > > Yes, and that is where this patch is useful. > >> And as Dave pointed out, we may eventually need a mechanism to override >> the default demotion order generated automatically. So we can just use >> some simple mechanism that makes sense in most cases in kernel >> automatically. And leave the best demotion for users to some >> customization mechanism. > > Yes, We need a mechanism to override the default demotion list prepared > by the current implementation. PowerVM can have a cpu less dram node > as well, which infact are not the right target for demotion because > it is the fast memory. We need to distinguish between memory tiers > so that slow memory can be utilized for the demotion even when there > are fast memory only numa nodes. > > I think we may see implementations in future to override the default > behavior e.g. when systems have both fast only and slow only memory > nodes and in that case it will make sense to demote to slow memory > only node even if it is far, but this patch is to put the current > implementation to its intended design 'best distance based demotion > targets'. If this is a real problem for you already. How about implement it after this work has been done? Best Regards, Huang, Ying >> > As it can be seen above, node 3 can be demotion target for node >> > 1 but existing implementation doesn't configure it that way. It >> > is better to move pages from node 1 to node 3 instead of moving >> > it from node 1 to swap. >> > >> > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com> >> > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> >> >> Best Regards, >> Huang, Ying >> >> [snip] >> > Best Regards, > Jagdish
"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes: > "Huang, Ying" <ying.huang@intel.com> writes: > >> Hi, Jagdish, >> >> Jagdish Gediya <jvgediya@linux.ibm.com> writes: >> > > ... > >>> e.g. with below NUMA topology, where node 0 & 1 are >>> cpu + dram nodes, node 2 & 3 are equally slower memory >>> only nodes, and node 4 is slowest memory only node, >>> >>> available: 5 nodes (0-4) >>> node 0 cpus: 0 1 >>> node 0 size: n MB >>> node 0 free: n MB >>> node 1 cpus: 2 3 >>> node 1 size: n MB >>> node 1 free: n MB >>> node 2 cpus: >>> node 2 size: n MB >>> node 2 free: n MB >>> node 3 cpus: >>> node 3 size: n MB >>> node 3 free: n MB >>> node 4 cpus: >>> node 4 size: n MB >>> node 4 free: n MB >>> node distances: >>> node 0 1 2 3 4 >>> 0: 10 20 40 40 80 >>> 1: 20 10 40 40 80 >>> 2: 40 40 10 40 80 >>> 3: 40 40 40 10 80 >>> 4: 80 80 80 80 10 >>> >>> The existing implementation gives below demotion targets, >>> >>> node demotion_target >>> 0 3, 2 >>> 1 4 >>> 2 X >>> 3 X >>> 4 X >>> >>> With this patch applied, below are the demotion targets, >>> >>> node demotion_target >>> 0 3, 2 >>> 1 3, 2 >>> 2 3 >>> 3 4 >>> 4 X >> >> For such machine, I think the perfect demotion order is, >> >> node demotion_target >> 0 2, 3 >> 1 2, 3 >> 2 4 >> 3 4 >> 4 X > > I guess the "equally slow nodes" is a confusing definition here. Now if the > system consists of 2 1GB equally slow memory and the firmware doesn't want to > differentiate between them, firmware can present a single NUMA node > with 2GB capacity? The fact that we are finding two NUMA nodes is a hint > that there is some difference between these two memory devices. This is > also captured by the fact that the distance between 2 and 3 is 40 and not 10. Do you have more information about this? Best Regards, Huang, Ying > For that specific topology where the distance between 2 and 3 is 40 and 2 > and 4 is 80, the demotion target derived by the new code is better > right? > > ... > > > -aneesh
"Huang, Ying" <ying.huang@intel.com> writes: > "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes: > >> "Huang, Ying" <ying.huang@intel.com> writes: >> >>> Hi, Jagdish, >>> >>> Jagdish Gediya <jvgediya@linux.ibm.com> writes: >>> >> >> ... >> >>>> e.g. with below NUMA topology, where node 0 & 1 are >>>> cpu + dram nodes, node 2 & 3 are equally slower memory >>>> only nodes, and node 4 is slowest memory only node, >>>> >>>> available: 5 nodes (0-4) >>>> node 0 cpus: 0 1 >>>> node 0 size: n MB >>>> node 0 free: n MB >>>> node 1 cpus: 2 3 >>>> node 1 size: n MB >>>> node 1 free: n MB >>>> node 2 cpus: >>>> node 2 size: n MB >>>> node 2 free: n MB >>>> node 3 cpus: >>>> node 3 size: n MB >>>> node 3 free: n MB >>>> node 4 cpus: >>>> node 4 size: n MB >>>> node 4 free: n MB >>>> node distances: >>>> node 0 1 2 3 4 >>>> 0: 10 20 40 40 80 >>>> 1: 20 10 40 40 80 >>>> 2: 40 40 10 40 80 >>>> 3: 40 40 40 10 80 >>>> 4: 80 80 80 80 10 >>>> >>>> The existing implementation gives below demotion targets, >>>> >>>> node demotion_target >>>> 0 3, 2 >>>> 1 4 >>>> 2 X >>>> 3 X >>>> 4 X >>>> >>>> With this patch applied, below are the demotion targets, >>>> >>>> node demotion_target >>>> 0 3, 2 >>>> 1 3, 2 >>>> 2 3 >>>> 3 4 >>>> 4 X >>> >>> For such machine, I think the perfect demotion order is, >>> >>> node demotion_target >>> 0 2, 3 >>> 1 2, 3 >>> 2 4 >>> 3 4 >>> 4 X >> >> I guess the "equally slow nodes" is a confusing definition here. Now if the >> system consists of 2 1GB equally slow memory and the firmware doesn't want to >> differentiate between them, firmware can present a single NUMA node >> with 2GB capacity? The fact that we are finding two NUMA nodes is a hint >> that there is some difference between these two memory devices. This is >> also captured by the fact that the distance between 2 and 3 is 40 and not 10. > > Do you have more information about this? Not sure I follow the question there. I was checking shouldn't firmware do a single NUMA node if two memory devices are of the same type? How will optane present such a config? Both the DIMMs will have the same proximity domain value and hence dax kmem will add them to the same NUMA node? If you are suggesting that firmware doesn't do that, then I agree with you that a demotion target like the below is good. node demotion_target 0 2, 3 1 2, 3 2 4 3 4 4 X We can also achieve that with a smiple change as below. @@ -3120,7 +3120,7 @@ static void __set_migration_target_nodes(void) { nodemask_t next_pass = NODE_MASK_NONE; nodemask_t this_pass = NODE_MASK_NONE; - nodemask_t used_targets = NODE_MASK_NONE; + nodemask_t this_pass_used_targets = NODE_MASK_NONE; int node, best_distance; /* @@ -3141,17 +3141,20 @@ static void __set_migration_target_nodes(void) /* * To avoid cycles in the migration "graph", ensure * that migration sources are not future targets by - * setting them in 'used_targets'. Do this only + * setting them in 'this_pass_used_targets'. Do this only * once per pass so that multiple source nodes can * share a target node. * - * 'used_targets' will become unavailable in future + * 'this_pass_used_targets' will become unavailable in future * passes. This limits some opportunities for * multiple source nodes to share a destination. */ - nodes_or(used_targets, used_targets, this_pass); + nodes_or(this_pass_used_targets, this_pass_used_targets, this_pass); for_each_node_mask(node, this_pass) { + + nodemask_t used_targets = this_pass_used_targets; + best_distance = -1; /*
"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes: > "Huang, Ying" <ying.huang@intel.com> writes: > >> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes: >> >>> "Huang, Ying" <ying.huang@intel.com> writes: >>> >>>> Hi, Jagdish, >>>> >>>> Jagdish Gediya <jvgediya@linux.ibm.com> writes: >>>> >>> >>> ... >>> >>>>> e.g. with below NUMA topology, where node 0 & 1 are >>>>> cpu + dram nodes, node 2 & 3 are equally slower memory >>>>> only nodes, and node 4 is slowest memory only node, >>>>> >>>>> available: 5 nodes (0-4) >>>>> node 0 cpus: 0 1 >>>>> node 0 size: n MB >>>>> node 0 free: n MB >>>>> node 1 cpus: 2 3 >>>>> node 1 size: n MB >>>>> node 1 free: n MB >>>>> node 2 cpus: >>>>> node 2 size: n MB >>>>> node 2 free: n MB >>>>> node 3 cpus: >>>>> node 3 size: n MB >>>>> node 3 free: n MB >>>>> node 4 cpus: >>>>> node 4 size: n MB >>>>> node 4 free: n MB >>>>> node distances: >>>>> node 0 1 2 3 4 >>>>> 0: 10 20 40 40 80 >>>>> 1: 20 10 40 40 80 >>>>> 2: 40 40 10 40 80 >>>>> 3: 40 40 40 10 80 >>>>> 4: 80 80 80 80 10 >>>>> >>>>> The existing implementation gives below demotion targets, >>>>> >>>>> node demotion_target >>>>> 0 3, 2 >>>>> 1 4 >>>>> 2 X >>>>> 3 X >>>>> 4 X >>>>> >>>>> With this patch applied, below are the demotion targets, >>>>> >>>>> node demotion_target >>>>> 0 3, 2 >>>>> 1 3, 2 >>>>> 2 3 >>>>> 3 4 >>>>> 4 X >>>> >>>> For such machine, I think the perfect demotion order is, >>>> >>>> node demotion_target >>>> 0 2, 3 >>>> 1 2, 3 >>>> 2 4 >>>> 3 4 >>>> 4 X >>> >>> I guess the "equally slow nodes" is a confusing definition here. Now if the >>> system consists of 2 1GB equally slow memory and the firmware doesn't want to >>> differentiate between them, firmware can present a single NUMA node >>> with 2GB capacity? The fact that we are finding two NUMA nodes is a hint >>> that there is some difference between these two memory devices. This is >>> also captured by the fact that the distance between 2 and 3 is 40 and not 10. >> >> Do you have more information about this? > > Not sure I follow the question there. I was checking shouldn't firmware > do a single NUMA node if two memory devices are of the same type? How will > optane present such a config? Both the DIMMs will have the same > proximity domain value and hence dax kmem will add them to the same NUMA > node? Sorry for confusing. I just wanted to check whether you have more information about the machine configuration above. The machines in my hand have no complex NUMA topology as in the patch description. > If you are suggesting that firmware doesn't do that, then I agree with you > that a demotion target like the below is good. > > node demotion_target > 0 2, 3 > 1 2, 3 > 2 4 > 3 4 > 4 X > > We can also achieve that with a smiple change as below. Glad to see the demotion order can be implemented in a simple way. My concern is that is it necessary to do this? If there are real machines with the NUMA topology, then I think it's good to add the support. But if not, why do we make the code complex unnecessarily? I don't have these kind of machines, do you have and will have? > @@ -3120,7 +3120,7 @@ static void __set_migration_target_nodes(void) > { > nodemask_t next_pass = NODE_MASK_NONE; > nodemask_t this_pass = NODE_MASK_NONE; > - nodemask_t used_targets = NODE_MASK_NONE; > + nodemask_t this_pass_used_targets = NODE_MASK_NONE; > int node, best_distance; > > /* > @@ -3141,17 +3141,20 @@ static void __set_migration_target_nodes(void) > /* > * To avoid cycles in the migration "graph", ensure > * that migration sources are not future targets by > - * setting them in 'used_targets'. Do this only > + * setting them in 'this_pass_used_targets'. Do this only > * once per pass so that multiple source nodes can > * share a target node. > * > - * 'used_targets' will become unavailable in future > + * 'this_pass_used_targets' will become unavailable in future > * passes. This limits some opportunities for > * multiple source nodes to share a destination. > */ > - nodes_or(used_targets, used_targets, this_pass); > + nodes_or(this_pass_used_targets, this_pass_used_targets, this_pass); > > for_each_node_mask(node, this_pass) { > + > + nodemask_t used_targets = this_pass_used_targets; > + > best_distance = -1; > > /* Best Regards, Huang, Ying
"Huang, Ying" <ying.huang@intel.com> writes: > "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes: > >> "Huang, Ying" <ying.huang@intel.com> writes: >> >>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes: >>> >>>> "Huang, Ying" <ying.huang@intel.com> writes: >>>> >>>>> Hi, Jagdish, >>>>> >>>>> Jagdish Gediya <jvgediya@linux.ibm.com> writes: >>>>> >>>> >>>> ... >>>> >>>>>> e.g. with below NUMA topology, where node 0 & 1 are >>>>>> cpu + dram nodes, node 2 & 3 are equally slower memory >>>>>> only nodes, and node 4 is slowest memory only node, >>>>>> >>>>>> available: 5 nodes (0-4) >>>>>> node 0 cpus: 0 1 >>>>>> node 0 size: n MB >>>>>> node 0 free: n MB >>>>>> node 1 cpus: 2 3 >>>>>> node 1 size: n MB >>>>>> node 1 free: n MB >>>>>> node 2 cpus: >>>>>> node 2 size: n MB >>>>>> node 2 free: n MB >>>>>> node 3 cpus: >>>>>> node 3 size: n MB >>>>>> node 3 free: n MB >>>>>> node 4 cpus: >>>>>> node 4 size: n MB >>>>>> node 4 free: n MB >>>>>> node distances: >>>>>> node 0 1 2 3 4 >>>>>> 0: 10 20 40 40 80 >>>>>> 1: 20 10 40 40 80 >>>>>> 2: 40 40 10 40 80 >>>>>> 3: 40 40 40 10 80 >>>>>> 4: 80 80 80 80 10 >>>>>> >>>>>> The existing implementation gives below demotion targets, >>>>>> >>>>>> node demotion_target >>>>>> 0 3, 2 >>>>>> 1 4 >>>>>> 2 X >>>>>> 3 X >>>>>> 4 X >>>>>> >>>>>> With this patch applied, below are the demotion targets, >>>>>> >>>>>> node demotion_target >>>>>> 0 3, 2 >>>>>> 1 3, 2 >>>>>> 2 3 >>>>>> 3 4 >>>>>> 4 X >>>>> >>>>> For such machine, I think the perfect demotion order is, >>>>> >>>>> node demotion_target >>>>> 0 2, 3 >>>>> 1 2, 3 >>>>> 2 4 >>>>> 3 4 >>>>> 4 X >>>> >>>> I guess the "equally slow nodes" is a confusing definition here. Now if the >>>> system consists of 2 1GB equally slow memory and the firmware doesn't want to >>>> differentiate between them, firmware can present a single NUMA node >>>> with 2GB capacity? The fact that we are finding two NUMA nodes is a hint >>>> that there is some difference between these two memory devices. This is >>>> also captured by the fact that the distance between 2 and 3 is 40 and not 10. >>> >>> Do you have more information about this? >> >> Not sure I follow the question there. I was checking shouldn't firmware >> do a single NUMA node if two memory devices are of the same type? How will >> optane present such a config? Both the DIMMs will have the same >> proximity domain value and hence dax kmem will add them to the same NUMA >> node? > > Sorry for confusing. I just wanted to check whether you have more > information about the machine configuration above. The machines in my > hand have no complex NUMA topology as in the patch description. Even with simple topologies like below available: 3 nodes (0-2) node 0 cpus: 0 1 node 0 size: 4046 MB node 0 free: 3478 MB node 1 cpus: 2 3 node 1 size: 4090 MB node 1 free: 3430 MB node 2 cpus: node 2 size: 4074 MB node 2 free: 4037 MB node distances: node 0 1 2 0: 10 20 40 1: 20 10 40 2: 40 40 10 With current code we get demotion targets assigned as below [ 0.337307] Demotion nodes for Node 0: 2 [ 0.337351] Demotion nodes for Node 1: [ 0.337380] Demotion nodes for Node 2: I guess we should fix that to be below? [ 0.344554] Demotion nodes for Node 0: 2 [ 0.344605] Demotion nodes for Node 1: 2 [ 0.344638] Demotion nodes for Node 2: Most of the tests we are doing are using Qemu to simulate this. We started looking at this to avoid using demotion completely when slow memory is not present. ie, we should have a different way to identify demotion targets other than node_states[N_MEMORY]. Virtualized platforms can have configs with memory only NUMA nodes with DRAM and we don't want to consider those as demotion targets. While we are at it can you let us know how topology will look on a system with two optane DIMMs? Do both appear with the same target_node? > >> If you are suggesting that firmware doesn't do that, then I agree with you >> that a demotion target like the below is good. >> >> node demotion_target >> 0 2, 3 >> 1 2, 3 >> 2 4 >> 3 4 >> 4 X >> >> We can also achieve that with a smiple change as below. > > Glad to see the demotion order can be implemented in a simple way. > > My concern is that is it necessary to do this? If there are real > machines with the NUMA topology, then I think it's good to add the > support. But if not, why do we make the code complex unnecessarily? > > I don't have these kind of machines, do you have and will have? > Based on the above, we still need to get the simpler fix merged right? >> @@ -3120,7 +3120,7 @@ static void __set_migration_target_nodes(void) >> { >> nodemask_t next_pass = NODE_MASK_NONE; >> nodemask_t this_pass = NODE_MASK_NONE; >> - nodemask_t used_targets = NODE_MASK_NONE; >> + nodemask_t this_pass_used_targets = NODE_MASK_NONE; >> int node, best_distance; >> >> /* >> @@ -3141,17 +3141,20 @@ static void __set_migration_target_nodes(void) >> /* >> * To avoid cycles in the migration "graph", ensure >> * that migration sources are not future targets by >> - * setting them in 'used_targets'. Do this only >> + * setting them in 'this_pass_used_targets'. Do this only >> * once per pass so that multiple source nodes can >> * share a target node. >> * >> - * 'used_targets' will become unavailable in future >> + * 'this_pass_used_targets' will become unavailable in future >> * passes. This limits some opportunities for >> * multiple source nodes to share a destination. >> */ >> - nodes_or(used_targets, used_targets, this_pass); >> + nodes_or(this_pass_used_targets, this_pass_used_targets, this_pass); >> >> for_each_node_mask(node, this_pass) { >> + >> + nodemask_t used_targets = this_pass_used_targets; >> + >> best_distance = -1; >> >> /* > > Best Regards, > Huang, Ying
"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes: > "Huang, Ying" <ying.huang@intel.com> writes: > >> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes: >> >>> "Huang, Ying" <ying.huang@intel.com> writes: >>> >>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes: >>>> >>>>> "Huang, Ying" <ying.huang@intel.com> writes: >>>>> >>>>>> Hi, Jagdish, >>>>>> >>>>>> Jagdish Gediya <jvgediya@linux.ibm.com> writes: >>>>>> >>>>> >>>>> ... >>>>> >>>>>>> e.g. with below NUMA topology, where node 0 & 1 are >>>>>>> cpu + dram nodes, node 2 & 3 are equally slower memory >>>>>>> only nodes, and node 4 is slowest memory only node, >>>>>>> >>>>>>> available: 5 nodes (0-4) >>>>>>> node 0 cpus: 0 1 >>>>>>> node 0 size: n MB >>>>>>> node 0 free: n MB >>>>>>> node 1 cpus: 2 3 >>>>>>> node 1 size: n MB >>>>>>> node 1 free: n MB >>>>>>> node 2 cpus: >>>>>>> node 2 size: n MB >>>>>>> node 2 free: n MB >>>>>>> node 3 cpus: >>>>>>> node 3 size: n MB >>>>>>> node 3 free: n MB >>>>>>> node 4 cpus: >>>>>>> node 4 size: n MB >>>>>>> node 4 free: n MB >>>>>>> node distances: >>>>>>> node 0 1 2 3 4 >>>>>>> 0: 10 20 40 40 80 >>>>>>> 1: 20 10 40 40 80 >>>>>>> 2: 40 40 10 40 80 >>>>>>> 3: 40 40 40 10 80 >>>>>>> 4: 80 80 80 80 10 >>>>>>> >>>>>>> The existing implementation gives below demotion targets, >>>>>>> >>>>>>> node demotion_target >>>>>>> 0 3, 2 >>>>>>> 1 4 >>>>>>> 2 X >>>>>>> 3 X >>>>>>> 4 X >>>>>>> >>>>>>> With this patch applied, below are the demotion targets, >>>>>>> >>>>>>> node demotion_target >>>>>>> 0 3, 2 >>>>>>> 1 3, 2 >>>>>>> 2 3 >>>>>>> 3 4 >>>>>>> 4 X >>>>>> >>>>>> For such machine, I think the perfect demotion order is, >>>>>> >>>>>> node demotion_target >>>>>> 0 2, 3 >>>>>> 1 2, 3 >>>>>> 2 4 >>>>>> 3 4 >>>>>> 4 X >>>>> >>>>> I guess the "equally slow nodes" is a confusing definition here. Now if the >>>>> system consists of 2 1GB equally slow memory and the firmware doesn't want to >>>>> differentiate between them, firmware can present a single NUMA node >>>>> with 2GB capacity? The fact that we are finding two NUMA nodes is a hint >>>>> that there is some difference between these two memory devices. This is >>>>> also captured by the fact that the distance between 2 and 3 is 40 and not 10. >>>> >>>> Do you have more information about this? >>> >>> Not sure I follow the question there. I was checking shouldn't firmware >>> do a single NUMA node if two memory devices are of the same type? How will >>> optane present such a config? Both the DIMMs will have the same >>> proximity domain value and hence dax kmem will add them to the same NUMA >>> node? >> >> Sorry for confusing. I just wanted to check whether you have more >> information about the machine configuration above. The machines in my >> hand have no complex NUMA topology as in the patch description. > > > Even with simple topologies like below > > available: 3 nodes (0-2) > node 0 cpus: 0 1 > node 0 size: 4046 MB > node 0 free: 3478 MB > node 1 cpus: 2 3 > node 1 size: 4090 MB > node 1 free: 3430 MB > node 2 cpus: > node 2 size: 4074 MB > node 2 free: 4037 MB > node distances: > node 0 1 2 > 0: 10 20 40 > 1: 20 10 40 > 2: 40 40 10 > > With current code we get demotion targets assigned as below > > [ 0.337307] Demotion nodes for Node 0: 2 > [ 0.337351] Demotion nodes for Node 1: > [ 0.337380] Demotion nodes for Node 2: > > I guess we should fix that to be below? > > [ 0.344554] Demotion nodes for Node 0: 2 > [ 0.344605] Demotion nodes for Node 1: 2 > [ 0.344638] Demotion nodes for Node 2: If the cross-socket link has enough bandwidth to accommodate the PMEM throughput, the new one is better. If it hasn't, the old one may be better. So, I think we need some kind of user space overridden support here. Right? > Most of the tests we are doing are using Qemu to simulate this. We > started looking at this to avoid using demotion completely when slow > memory is not present. ie, we should have a different way to identify > demotion targets other than node_states[N_MEMORY]. Virtualized platforms > can have configs with memory only NUMA nodes with DRAM and we don't > want to consider those as demotion targets. Even if the demotion targets are set for some node, the demotion will not work before enabling demotion via sysfs (/sys/kernel/mm/numa/demotion_enabled). So for system without slow memory, just don't enable demotion. > While we are at it can you let us know how topology will look on a > system with two optane DIMMs? Do both appear with the same > target_node? In my test system, multiple optane DIMMs in one socket will be represented as one NUMA node. I remember Baolin has different configuration. Hi, Baolin, Can you provide some information about this? >> >>> If you are suggesting that firmware doesn't do that, then I agree with you >>> that a demotion target like the below is good. >>> >>> node demotion_target >>> 0 2, 3 >>> 1 2, 3 >>> 2 4 >>> 3 4 >>> 4 X >>> >>> We can also achieve that with a smiple change as below. >> >> Glad to see the demotion order can be implemented in a simple way. >> >> My concern is that is it necessary to do this? If there are real >> machines with the NUMA topology, then I think it's good to add the >> support. But if not, why do we make the code complex unnecessarily? >> >> I don't have these kind of machines, do you have and will have? >> > > > Based on the above, we still need to get the simpler fix merged right? Or user overridden support? Best Regards, Huang, Ying [snip]
On 3/31/2022 4:58 PM, Huang, Ying wrote: > "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes: > >> "Huang, Ying" <ying.huang@intel.com> writes: >> >>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes: >>> >>>> "Huang, Ying" <ying.huang@intel.com> writes: >>>> >>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes: >>>>> >>>>>> "Huang, Ying" <ying.huang@intel.com> writes: >>>>>> >>>>>>> Hi, Jagdish, >>>>>>> >>>>>>> Jagdish Gediya <jvgediya@linux.ibm.com> writes: >>>>>>> >>>>>> >>>>>> ... >>>>>> >>>>>>>> e.g. with below NUMA topology, where node 0 & 1 are >>>>>>>> cpu + dram nodes, node 2 & 3 are equally slower memory >>>>>>>> only nodes, and node 4 is slowest memory only node, >>>>>>>> >>>>>>>> available: 5 nodes (0-4) >>>>>>>> node 0 cpus: 0 1 >>>>>>>> node 0 size: n MB >>>>>>>> node 0 free: n MB >>>>>>>> node 1 cpus: 2 3 >>>>>>>> node 1 size: n MB >>>>>>>> node 1 free: n MB >>>>>>>> node 2 cpus: >>>>>>>> node 2 size: n MB >>>>>>>> node 2 free: n MB >>>>>>>> node 3 cpus: >>>>>>>> node 3 size: n MB >>>>>>>> node 3 free: n MB >>>>>>>> node 4 cpus: >>>>>>>> node 4 size: n MB >>>>>>>> node 4 free: n MB >>>>>>>> node distances: >>>>>>>> node 0 1 2 3 4 >>>>>>>> 0: 10 20 40 40 80 >>>>>>>> 1: 20 10 40 40 80 >>>>>>>> 2: 40 40 10 40 80 >>>>>>>> 3: 40 40 40 10 80 >>>>>>>> 4: 80 80 80 80 10 >>>>>>>> >>>>>>>> The existing implementation gives below demotion targets, >>>>>>>> >>>>>>>> node demotion_target >>>>>>>> 0 3, 2 >>>>>>>> 1 4 >>>>>>>> 2 X >>>>>>>> 3 X >>>>>>>> 4 X >>>>>>>> >>>>>>>> With this patch applied, below are the demotion targets, >>>>>>>> >>>>>>>> node demotion_target >>>>>>>> 0 3, 2 >>>>>>>> 1 3, 2 >>>>>>>> 2 3 >>>>>>>> 3 4 >>>>>>>> 4 X >>>>>>> >>>>>>> For such machine, I think the perfect demotion order is, >>>>>>> >>>>>>> node demotion_target >>>>>>> 0 2, 3 >>>>>>> 1 2, 3 >>>>>>> 2 4 >>>>>>> 3 4 >>>>>>> 4 X >>>>>> >>>>>> I guess the "equally slow nodes" is a confusing definition here. Now if the >>>>>> system consists of 2 1GB equally slow memory and the firmware doesn't want to >>>>>> differentiate between them, firmware can present a single NUMA node >>>>>> with 2GB capacity? The fact that we are finding two NUMA nodes is a hint >>>>>> that there is some difference between these two memory devices. This is >>>>>> also captured by the fact that the distance between 2 and 3 is 40 and not 10. >>>>> >>>>> Do you have more information about this? >>>> >>>> Not sure I follow the question there. I was checking shouldn't firmware >>>> do a single NUMA node if two memory devices are of the same type? How will >>>> optane present such a config? Both the DIMMs will have the same >>>> proximity domain value and hence dax kmem will add them to the same NUMA >>>> node? >>> >>> Sorry for confusing. I just wanted to check whether you have more >>> information about the machine configuration above. The machines in my >>> hand have no complex NUMA topology as in the patch description. >> >> >> Even with simple topologies like below >> >> available: 3 nodes (0-2) >> node 0 cpus: 0 1 >> node 0 size: 4046 MB >> node 0 free: 3478 MB >> node 1 cpus: 2 3 >> node 1 size: 4090 MB >> node 1 free: 3430 MB >> node 2 cpus: >> node 2 size: 4074 MB >> node 2 free: 4037 MB >> node distances: >> node 0 1 2 >> 0: 10 20 40 >> 1: 20 10 40 >> 2: 40 40 10 >> >> With current code we get demotion targets assigned as below >> >> [ 0.337307] Demotion nodes for Node 0: 2 >> [ 0.337351] Demotion nodes for Node 1: >> [ 0.337380] Demotion nodes for Node 2: >> >> I guess we should fix that to be below? >> >> [ 0.344554] Demotion nodes for Node 0: 2 >> [ 0.344605] Demotion nodes for Node 1: 2 >> [ 0.344638] Demotion nodes for Node 2: > > If the cross-socket link has enough bandwidth to accommodate the PMEM > throughput, the new one is better. If it hasn't, the old one may be > better. So, I think we need some kind of user space overridden support > here. Right? > >> Most of the tests we are doing are using Qemu to simulate this. We >> started looking at this to avoid using demotion completely when slow >> memory is not present. ie, we should have a different way to identify >> demotion targets other than node_states[N_MEMORY]. Virtualized platforms >> can have configs with memory only NUMA nodes with DRAM and we don't >> want to consider those as demotion targets. > > Even if the demotion targets are set for some node, the demotion will > not work before enabling demotion via sysfs > (/sys/kernel/mm/numa/demotion_enabled). So for system without slow > memory, just don't enable demotion. > >> While we are at it can you let us know how topology will look on a >> system with two optane DIMMs? Do both appear with the same >> target_node? > > In my test system, multiple optane DIMMs in one socket will be > represented as one NUMA node. > > I remember Baolin has different configuration. > > Hi, Baolin, Can you provide some information about this? Sure. We have real machines with 2 optane DIMMs, and they are represented as 2 numa nodes. So we want to support the target demotion nodes can be multiple. available: 3 nodes (0-2) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 0 size: 62153 MB node 0 free: 447 MB node 1 cpus: node 1 size: 126969 MB node 1 free: 84099 MB node 2 cpus: node 2 size: 127006 MB node 2 free: 126925 MB node distances: node 0 1 2 0: 10 20 20 1: 20 10 20 2: 20 20 10
On Wed, 30 Mar 2022 22:06:52 +0530 Jagdish Gediya <jvgediya@linux.ibm.com> wrote: > Hi Huang, > > On Wed, Mar 30, 2022 at 02:46:51PM +0800, Huang, Ying wrote: > > Hi, Jagdish, > > > > Jagdish Gediya <jvgediya@linux.ibm.com> writes: > > > > > The current implementation to identify the demotion > > > targets limits some of the opportunities to share > > > the demotion targets between multiple source nodes. > > > > Yes. It sounds reasonable to share demotion targets among multiple > > source nodes. > > > > One question, are example machines below are real hardware now or in > > near future? Or you just think they are possible? > > They are not real hardware right now, they are the future possibilities. I'll strengthen that a bit to say they are highly likely to turn up fairly soon. Often they will be result of specific interleaving decisions and might not come from SRAT (e.g. might be added later as a result of CXL discovery) but the principal will be the same. Also, in some cases the firmware will have done the CXL setup so it will be via SRAT. example 1: > > e.g. with below NUMA topology, where node 0 & 1 are > > cpu + dram nodes, node 2 & 3 are equally slower memory > > only nodes, and node 4 is slowest memory only node, > > Couple of near term examples of systems that will look like this. 2 socket system, each socket has DRAM (0,1) + NVDIMM (2,3) being used as DRAM. Also CXL attached memory and to get maximum bandwidth to a large remote pool, interleave across host bridges in the two sockets (Node 4) An alternative for node 4 is a that the system is using an IO expander bridge on the CPU interconnect (effectively a CPU less CPU :) Example 2: > > e.g. with below NUMA topology, where node 0, 1 & 2 are > > cpu + dram nodes and node 3 is slow memory node, > > 3 CPU socket machines are still unusual, but the 2 or 4 socket equivalents are simplifications of example 1. Example 3: > > with below NUMA topology, where node 0 & 2 are cpu + dram > > nodes and node 1 & 3 are slow memory nodes, I believe this is what today's simple DRAM + NVDIMM (as cheap DRAM) 2 sockets servers look like. Example 4: > Another example, node 0 & 2 are cpu + dram nodes and node 1 are slow > memory node near node 0, > Normal system with CXL attached DRAM below an RP in node 0 ... Jonathan
diff --git a/mm/migrate.c b/mm/migrate.c index 3d60823afd2d..7ec8d934e706 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2381,10 +2381,13 @@ static int establish_migrate_target(int node, nodemask_t *used, */ static void __set_migration_target_nodes(void) { - nodemask_t next_pass = NODE_MASK_NONE; - nodemask_t this_pass = NODE_MASK_NONE; nodemask_t used_targets = NODE_MASK_NONE; int node, best_distance; + nodemask_t *src_nodes; + + src_nodes = kcalloc(nr_node_ids, sizeof(nodemask_t), GFP_KERNEL); + if (!src_nodes) + return; /* * Avoid any oddities like cycles that could occur @@ -2393,29 +2396,39 @@ static void __set_migration_target_nodes(void) */ disable_all_migrate_targets(); - /* - * Allocations go close to CPUs, first. Assume that - * the migration path starts at the nodes with CPUs. - */ - next_pass = node_states[N_CPU]; -again: - this_pass = next_pass; - next_pass = NODE_MASK_NONE; - /* - * To avoid cycles in the migration "graph", ensure - * that migration sources are not future targets by - * setting them in 'used_targets'. Do this only - * once per pass so that multiple source nodes can - * share a target node. - * - * 'used_targets' will become unavailable in future - * passes. This limits some opportunities for - * multiple source nodes to share a destination. - */ - nodes_or(used_targets, used_targets, this_pass); + for_each_online_node(node) { + int tmp_node; - for_each_node_mask(node, this_pass) { best_distance = -1; + used_targets = NODE_MASK_NONE; + + /* + * Avoid adding same node as the demotion target. + */ + node_set(node, used_targets); + + /* + * Add CPU NUMA nodes to the used target list so that it + * won't be considered a demotion target. + */ + nodes_or(used_targets, used_targets, node_states[N_CPU]); + + /* + * Add all nodes that has appeared as source node of demotion + * for this target node. + * + * To avoid cycles in the migration "graph", ensure + * that migration sources are not future targets by + * setting them in 'used_targets'. + */ + for_each_node_mask(tmp_node, src_nodes[node]) + nodes_or(used_targets, used_targets, src_nodes[tmp_node]); + + /* + * Now update the demotion src nodes with other nodes in graph + * which got computed above. + */ + nodes_or(src_nodes[node], src_nodes[node], used_targets); /* * Try to set up the migration path for the node, and the target @@ -2434,20 +2447,14 @@ static void __set_migration_target_nodes(void) best_distance = node_distance(node, target_node); /* - * Visit targets from this pass in the next pass. - * Eventually, every node will have been part of - * a pass, and will become set in 'used_targets'. + * Add this node in the src_nodes list so that we can + * detect the looping. */ - node_set(target_node, next_pass); + node_set(node, src_nodes[target_node]); } while (1); } - /* - * 'next_pass' contains nodes which became migration - * targets in this pass. Make additional passes until - * no more migrations targets are available. - */ - if (!nodes_empty(next_pass)) - goto again; + + kfree(src_nodes); } /*