Message ID | 20210126003421.45897BF4@viggo.jf.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Migrate Pages in lieu of discard | expand |
On Mon, Jan 25, 2021 at 4:41 PM Dave Hansen <dave.hansen@linux.intel.com> wrote: > > > From: Dave Hansen <dave.hansen@linux.intel.com> > > When memory fills up on a node, memory contents can be > automatically migrated to another node. The biggest problems are > knowing when to migrate and to where the migration should be > targeted. > > The most straightforward way to generate the "to where" list > would be to follow the page allocator fallback lists. Those > lists already tell us if memory is full where to look next. It > would also be logical to move memory in that order. > > But, the allocator fallback lists have a fatal flaw: most nodes > appear in all the lists. This would potentially lead to > migration cycles (A->B, B->A, A->B, ...). > > Instead of using the allocator fallback lists directly, keep a > separate node migration ordering. But, reuse the same data used > to generate page allocator fallback in the first place: > find_next_best_node(). > > This means that the firmware data used to populate node distances > essentially dictates the ordering for now. It should also be > architecture-neutral since all NUMA architectures have a working > find_next_best_node(). > > The protocol for node_demotion[] access and writing is not > standard. It has no specific locking and is intended to be read > locklessly. Readers must take care to avoid observing changes > that appear incoherent. This was done so that node_demotion[] > locking has no chance of becoming a bottleneck on large systems > with lots of CPUs in direct reclaim. > > This code is unused for now. It will be called later in the > series. > > Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> > Cc: Yang Shi <yang.shi@linux.alibaba.com> > Cc: David Rientjes <rientjes@google.com> > Cc: Huang Ying <ying.huang@intel.com> > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: David Hildenbrand <david@redhat.com> > Cc: osalvador <osalvador@suse.de> > --- > > b/mm/internal.h | 5 + > b/mm/migrate.c | 137 +++++++++++++++++++++++++++++++++++++++++++++++++++++- > b/mm/page_alloc.c | 2 > 3 files changed, 142 insertions(+), 2 deletions(-) > > diff -puN mm/internal.h~auto-setup-default-migration-path-from-firmware mm/internal.h > --- a/mm/internal.h~auto-setup-default-migration-path-from-firmware 2021-01-25 16:23:10.607866706 -0800 > +++ b/mm/internal.h 2021-01-25 16:23:10.616866706 -0800 > @@ -515,12 +515,17 @@ static inline void mminit_validate_memmo > > #ifdef CONFIG_NUMA > extern int node_reclaim(struct pglist_data *, gfp_t, unsigned int); > +extern int find_next_best_node(int node, nodemask_t *used_node_mask); > #else > static inline int node_reclaim(struct pglist_data *pgdat, gfp_t mask, > unsigned int order) > { > return NODE_RECLAIM_NOSCAN; > } > +static inline int find_next_best_node(int node, nodemask_t *used_node_mask) > +{ > + return NUMA_NO_NODE; > +} > #endif > > extern int hwpoison_filter(struct page *p); > diff -puN mm/migrate.c~auto-setup-default-migration-path-from-firmware mm/migrate.c > --- a/mm/migrate.c~auto-setup-default-migration-path-from-firmware 2021-01-25 16:23:10.609866706 -0800 > +++ b/mm/migrate.c 2021-01-25 16:23:10.617866706 -0800 > @@ -1161,6 +1161,10 @@ out: > return rc; > } > > +/* > + * Writes to this array occur without locking. READ_ONCE() > + * is recommended for readers to ensure consistent reads. > + */ > static int node_demotion[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE}; > > /** > @@ -1174,7 +1178,13 @@ static int node_demotion[MAX_NUMNODES] = > */ > int next_demotion_node(int node) > { > - return node_demotion[node]; > + /* > + * node_demotion[] is updated without excluding > + * this function from running. READ_ONCE() avoids > + * reading multiple, inconsistent 'node' values > + * during an update. > + */ Don't we need a smp_rmb() here? The single write barrier might be not enough in migration target set. Typically a write barrier should be used in pairs with a read barrier. > + return READ_ONCE(node_demotion[node]); Why not consolidate the patch #4 in this patch? The patch #4 just add the definition of node_demotion and the function, then the function is changed in this patch, and the function is not used by anyone between the adding and changing. > } > > /* > @@ -3124,3 +3134,128 @@ void migrate_vma_finalize(struct migrate > } > EXPORT_SYMBOL(migrate_vma_finalize); > #endif /* CONFIG_DEVICE_PRIVATE */ > + > +/* Disable reclaim-based migration. */ > +static void disable_all_migrate_targets(void) > +{ > + int node; > + > + for_each_online_node(node) > + node_demotion[node] = NUMA_NO_NODE; > +} > + > +/* > + * Find an automatic demotion target for 'node'. > + * Failing here is OK. It might just indicate > + * being at the end of a chain. > + */ > +static int establish_migrate_target(int node, nodemask_t *used) > +{ > + int migration_target; > + > + /* > + * Can not set a migration target on a > + * node with it already set. > + * > + * No need for READ_ONCE() here since this > + * in the write path for node_demotion[]. > + * This should be the only thread writing. > + */ > + if (node_demotion[node] != NUMA_NO_NODE) > + return NUMA_NO_NODE; > + > + migration_target = find_next_best_node(node, used); > + if (migration_target == NUMA_NO_NODE) > + return NUMA_NO_NODE; > + > + node_demotion[node] = migration_target; > + > + return migration_target; > +} > + > +/* > + * When memory fills up on a node, memory contents can be > + * automatically migrated to another node instead of > + * discarded at reclaim. > + * > + * Establish a "migration path" which will start at nodes > + * with CPUs and will follow the priorities used to build the > + * page allocator zonelists. > + * > + * The difference here is that cycles must be avoided. If > + * node0 migrates to node1, then neither node1, nor anything > + * node1 migrates to can migrate to node0. > + * > + * This function can run simultaneously with readers of > + * node_demotion[]. However, it can not run simultaneously > + * with itself. Exclusion is provided by memory hotplug events > + * being single-threaded. Maybe an example diagram for the physical topology and how the migration target is generated in the comment seems helpful to understand the code. > + */ > +void __set_migration_target_nodes(void) > +{ > + nodemask_t next_pass = NODE_MASK_NONE; > + nodemask_t this_pass = NODE_MASK_NONE; > + nodemask_t used_targets = NODE_MASK_NONE; > + int node; > + > + /* > + * Avoid any oddities like cycles that could occur > + * from changes in the topology. This will leave > + * a momentary gap when migration is disabled. > + */ > + disable_all_migrate_targets(); > + > + /* > + * Ensure that the "disable" is visible across the system. > + * Readers will see either a combination of before+disable > + * state or disable+after. They will never see before and > + * after state together. > + * > + * The before+after state together might have cycles and > + * could cause readers to do things like loop until this > + * function finishes. This ensures they can only see a > + * single "bad" read and would, for instance, only loop > + * once. > + */ > + smp_wmb(); > + > + /* > + * Allocations go close to CPUs, first. Assume that > + * the migration path starts at the nodes with CPUs. > + */ > + next_pass = node_states[N_CPU]; > +again: > + this_pass = next_pass; > + next_pass = NODE_MASK_NONE; > + /* > + * To avoid cycles in the migration "graph", ensure > + * that migration sources are not future targets by > + * setting them in 'used_targets'. Do this only > + * once per pass so that multiple source nodes can > + * share a target node. > + * > + * 'used_targets' will become unavailable in future > + * passes. This limits some opportunities for > + * multiple source nodes to share a desintation. s/desination/destination > + */ > + nodes_or(used_targets, used_targets, this_pass); > + for_each_node_mask(node, this_pass) { > + int target_node = establish_migrate_target(node, &used_targets); > + > + if (target_node == NUMA_NO_NODE) > + continue; > + > + /* Visit targets from this pass in the next pass: */ > + node_set(target_node, next_pass); > + } > + /* Is another pass necessary? */ > + if (!nodes_empty(next_pass)) > + goto again; > +} > + > +void set_migration_target_nodes(void) It seems this function is not called outside migrate.c, so it should be static. > +{ > + get_online_mems(); > + __set_migration_target_nodes(); > + put_online_mems(); > +} > diff -puN mm/page_alloc.c~auto-setup-default-migration-path-from-firmware mm/page_alloc.c > --- a/mm/page_alloc.c~auto-setup-default-migration-path-from-firmware 2021-01-25 16:23:10.612866706 -0800 > +++ b/mm/page_alloc.c 2021-01-25 16:23:10.619866706 -0800 > @@ -5704,7 +5704,7 @@ static int node_load[MAX_NUMNODES]; > * > * Return: node id of the found node or %NUMA_NO_NODE if no node is found. > */ > -static int find_next_best_node(int node, nodemask_t *used_node_mask) > +int find_next_best_node(int node, nodemask_t *used_node_mask) > { > int n, val; > int min_val = INT_MAX; > _ >
On 1/29/21 12:46 PM, Yang Shi wrote: ... >> int next_demotion_node(int node) >> { >> - return node_demotion[node]; >> + /* >> + * node_demotion[] is updated without excluding >> + * this function from running. READ_ONCE() avoids >> + * reading multiple, inconsistent 'node' values >> + * during an update. >> + */ > > Don't we need a smp_rmb() here? The single write barrier might be not > enough in migration target set. Typically a write barrier should be > used in pairs with a read barrier. I don't think we need one, practically. Since there is no locking against node_demotion[] updates, although a smp_rmb() would ensure that this read is up-to-date, it could change freely after the smp_rmb(). In other words, smp_rmb() would shrink the window where a "stale" read could occur but would not eliminate it. >> + return READ_ONCE(node_demotion[node]); > > Why not consolidate the patch #4 in this patch? The patch #4 just add > the definition of node_demotion and the function, then the function is > changed in this patch, and the function is not used by anyone between > the adding and changing. I really wanted to highlight that the locking scheme and the READ_ONCE() (or lack thereof) was specifically due to how node_demotion[] was being updated. The READ_ONCE() is not, for instance, inherent to the data structure. ... >> +/* >> + * When memory fills up on a node, memory contents can be >> + * automatically migrated to another node instead of >> + * discarded at reclaim. >> + * >> + * Establish a "migration path" which will start at nodes >> + * with CPUs and will follow the priorities used to build the >> + * page allocator zonelists. >> + * >> + * The difference here is that cycles must be avoided. If >> + * node0 migrates to node1, then neither node1, nor anything >> + * node1 migrates to can migrate to node0. >> + * >> + * This function can run simultaneously with readers of >> + * node_demotion[]. However, it can not run simultaneously >> + * with itself. Exclusion is provided by memory hotplug events >> + * being single-threaded. > > Maybe an example diagram for the physical topology and how the > migration target is generated in the comment seems helpful to > understand the code. Sure. Were you thinking of a code comment, or enhanced changelog? Let's say there's a system with two sockets each with the same three classes of memory: fast, medium and slow. Each memory class is placed in its own NUMA node and the CPUs are attached to the fast memory. That leaves 6 NUMA nodes (0-5): Socket A: 0, 1, 2 Socket B: 3, 4, 5 The migration path for this configuration path would start on the nodes with the processors and fast memory, progress through medium and end with the slow memory: 0 -> 1 -> 2 -> stop 3 -> 4 -> 5 -> stop This is represented in the node_demotion[] like this: { 1, // Node 0 migrates to 1 2, // Node 1 migrates to 2 -1, // Node 2 does not migrate 4, // Node 3 migrates to 1 5, // Node 4 migrates to 2 -1} // Node 5 does not migrate Is that what you were thinking of? ... >> +again: >> + this_pass = next_pass; >> + next_pass = NODE_MASK_NONE; >> + /* >> + * To avoid cycles in the migration "graph", ensure >> + * that migration sources are not future targets by >> + * setting them in 'used_targets'. Do this only >> + * once per pass so that multiple source nodes can >> + * share a target node. >> + * >> + * 'used_targets' will become unavailable in future >> + * passes. This limits some opportunities for >> + * multiple source nodes to share a desintation. > > s/desination/destination Fixed, thanks. >> + */ >> + nodes_or(used_targets, used_targets, this_pass); >> + for_each_node_mask(node, this_pass) { >> + int target_node = establish_migrate_target(node, &used_targets); >> + >> + if (target_node == NUMA_NO_NODE) >> + continue; >> + >> + /* Visit targets from this pass in the next pass: */ >> + node_set(target_node, next_pass); >> + } >> + /* Is another pass necessary? */ >> + if (!nodes_empty(next_pass)) >> + goto again; >> +} >> + >> +void set_migration_target_nodes(void) > > It seems this function is not called outside migrate.c, so it should be static. Fixed, thanks.
On Mon, Feb 01, 2021 at 11:13:14AM -0800, Dave Hansen wrote: > Sure. Were you thinking of a code comment, or enhanced changelog? > > Let's say there's a system with two sockets each with the same three > classes of memory: fast, medium and slow. Each memory class is placed > in its own NUMA node and the CPUs are attached to the fast memory. That > leaves 6 NUMA nodes (0-5): > > Socket A: 0, 1, 2 > Socket B: 3, 4, 5 > > The migration path for this configuration path would start on the nodes > with the processors and fast memory, progress through medium and end > with the slow memory: > > 0 -> 1 -> 2 -> stop > 3 -> 4 -> 5 -> stop > > This is represented in the node_demotion[] like this: > > { 1, // Node 0 migrates to 1 > 2, // Node 1 migrates to 2 > -1, // Node 2 does not migrate > 4, // Node 3 migrates to 1 > 5, // Node 4 migrates to 2 > -1} // Node 5 does not migrate > > Is that what you were thinking of? I would not mind to have the above explanation in a comment somewhere in the code.
On Mon, Feb 1, 2021 at 11:13 AM Dave Hansen <dave.hansen@intel.com> wrote: > > On 1/29/21 12:46 PM, Yang Shi wrote: > ... > >> int next_demotion_node(int node) > >> { > >> - return node_demotion[node]; > >> + /* > >> + * node_demotion[] is updated without excluding > >> + * this function from running. READ_ONCE() avoids > >> + * reading multiple, inconsistent 'node' values > >> + * during an update. > >> + */ > > > > Don't we need a smp_rmb() here? The single write barrier might be not > > enough in migration target set. Typically a write barrier should be > > used in pairs with a read barrier. > > I don't think we need one, practically. > > Since there is no locking against node_demotion[] updates, although a > smp_rmb() would ensure that this read is up-to-date, it could change > freely after the smp_rmb(). Yes, but this should be able to guarantee we see "disable + after" state. Isn't it more preferred? > > In other words, smp_rmb() would shrink the window where a "stale" read > could occur but would not eliminate it. > > >> + return READ_ONCE(node_demotion[node]); > > > > Why not consolidate the patch #4 in this patch? The patch #4 just add > > the definition of node_demotion and the function, then the function is > > changed in this patch, and the function is not used by anyone between > > the adding and changing. > > I really wanted to highlight that the locking scheme and the READ_ONCE() > (or lack thereof) was specifically due to how node_demotion[] was being > updated. > > The READ_ONCE() is not, for instance, inherent to the data structure. > > ... > >> +/* > >> + * When memory fills up on a node, memory contents can be > >> + * automatically migrated to another node instead of > >> + * discarded at reclaim. > >> + * > >> + * Establish a "migration path" which will start at nodes > >> + * with CPUs and will follow the priorities used to build the > >> + * page allocator zonelists. > >> + * > >> + * The difference here is that cycles must be avoided. If > >> + * node0 migrates to node1, then neither node1, nor anything > >> + * node1 migrates to can migrate to node0. > >> + * > >> + * This function can run simultaneously with readers of > >> + * node_demotion[]. However, it can not run simultaneously > >> + * with itself. Exclusion is provided by memory hotplug events > >> + * being single-threaded. > > > > Maybe an example diagram for the physical topology and how the > > migration target is generated in the comment seems helpful to > > understand the code. > > Sure. Were you thinking of a code comment, or enhanced changelog? I'd prefer a code comment. > > Let's say there's a system with two sockets each with the same three > classes of memory: fast, medium and slow. Each memory class is placed > in its own NUMA node and the CPUs are attached to the fast memory. That > leaves 6 NUMA nodes (0-5): > > Socket A: 0, 1, 2 > Socket B: 3, 4, 5 > > The migration path for this configuration path would start on the nodes > with the processors and fast memory, progress through medium and end > with the slow memory: > > 0 -> 1 -> 2 -> stop > 3 -> 4 -> 5 -> stop > > This is represented in the node_demotion[] like this: > > { 1, // Node 0 migrates to 1 > 2, // Node 1 migrates to 2 > -1, // Node 2 does not migrate > 4, // Node 3 migrates to 1 > 5, // Node 4 migrates to 2 > -1} // Node 5 does not migrate > > Is that what you were thinking of? Perfect. > > ... > >> +again: > >> + this_pass = next_pass; > >> + next_pass = NODE_MASK_NONE; > >> + /* > >> + * To avoid cycles in the migration "graph", ensure > >> + * that migration sources are not future targets by > >> + * setting them in 'used_targets'. Do this only > >> + * once per pass so that multiple source nodes can > >> + * share a target node. > >> + * > >> + * 'used_targets' will become unavailable in future > >> + * passes. This limits some opportunities for > >> + * multiple source nodes to share a desintation. > > > > s/desination/destination > > Fixed, thanks. > > >> + */ > >> + nodes_or(used_targets, used_targets, this_pass); > >> + for_each_node_mask(node, this_pass) { > >> + int target_node = establish_migrate_target(node, &used_targets); > >> + > >> + if (target_node == NUMA_NO_NODE) > >> + continue; > >> + > >> + /* Visit targets from this pass in the next pass: */ > >> + node_set(target_node, next_pass); > >> + } > >> + /* Is another pass necessary? */ > >> + if (!nodes_empty(next_pass)) > >> + goto again; > >> +} > >> + > >> +void set_migration_target_nodes(void) > > > > It seems this function is not called outside migrate.c, so it should be static. > > Fixed, thanks.
On 2/2/21 9:46 AM, Yang Shi wrote: > On Mon, Feb 1, 2021 at 11:13 AM Dave Hansen <dave.hansen@intel.com> wrote: >> On 1/29/21 12:46 PM, Yang Shi wrote: >> ... >>>> int next_demotion_node(int node) >>>> { >>>> - return node_demotion[node]; >>>> + /* >>>> + * node_demotion[] is updated without excluding >>>> + * this function from running. READ_ONCE() avoids >>>> + * reading multiple, inconsistent 'node' values >>>> + * during an update. >>>> + */ >>> Don't we need a smp_rmb() here? The single write barrier might be not >>> enough in migration target set. Typically a write barrier should be >>> used in pairs with a read barrier. >> I don't think we need one, practically. >> >> Since there is no locking against node_demotion[] updates, although a >> smp_rmb() would ensure that this read is up-to-date, it could change >> freely after the smp_rmb(). > Yes, but this should be able to guarantee we see "disable + after" > state. Isn't it more preferred? I'm debating how much of this is theoretical versus actually applicable to what we have in the kernel. But, I'm generally worried about code like this that *looks* innocuous: int terminal_node = start_node; int next_node = next_demotion_node(start_node); while (next_node != NUMA_NO_NODE) { next_node = terminal_node; terminal_node = next_demotion_node(terminal_node); } That could loop forever if it doesn't go out to memory during each loop. However, if node_demotion[] *is* read on every trip through the loop, it will eventually terminate. READ_ONCE() can guarantee that, as could compiler barriers like smp_rmb(). But, after staring at it for a while, I think RCU may be the most clearly correct way to solve the problem. Or, maybe just throw in the towel and do a spinlock like a normal human being. :) Anyway, here's what I was thinking I'd do with RCU: 1. node_demotion[] starts off in a "before" state 2. Writers to node_demotion[] first set the whole array such that it will not induce cycles, like setting every member to NUMA_NO_NODE. (the "disable" state) 3. Writer calls synchronize_rcu(). After it returns, no readers can observe the "before" values. 4. Writer sets the actual values it wants. (the "after" state) 5. Readers use rcu_read_lock() over any critical section where they read the array. They are guaranteed to only see one of the two adjacent states (before+disabled, or disabled+after), but never before+after within one RCU read-side critical section. 6. Readers use READ_ONCE() or some other compiler directive to ensure the compiler does not reorder or combine reads from multiple, adjacent RCU read-side critical sections. Although, after writing this, plain old locks are sounding awfully tempting.
On Tue, Feb 2, 2021 at 4:43 PM Dave Hansen <dave.hansen@intel.com> wrote: > > On 2/2/21 9:46 AM, Yang Shi wrote: > > On Mon, Feb 1, 2021 at 11:13 AM Dave Hansen <dave.hansen@intel.com> wrote: > >> On 1/29/21 12:46 PM, Yang Shi wrote: > >> ... > >>>> int next_demotion_node(int node) > >>>> { > >>>> - return node_demotion[node]; > >>>> + /* > >>>> + * node_demotion[] is updated without excluding > >>>> + * this function from running. READ_ONCE() avoids > >>>> + * reading multiple, inconsistent 'node' values > >>>> + * during an update. > >>>> + */ > >>> Don't we need a smp_rmb() here? The single write barrier might be not > >>> enough in migration target set. Typically a write barrier should be > >>> used in pairs with a read barrier. > >> I don't think we need one, practically. > >> > >> Since there is no locking against node_demotion[] updates, although a > >> smp_rmb() would ensure that this read is up-to-date, it could change > >> freely after the smp_rmb(). > > Yes, but this should be able to guarantee we see "disable + after" > > state. Isn't it more preferred? > > I'm debating how much of this is theoretical versus actually applicable > to what we have in the kernel. But, I'm generally worried about code > like this that *looks* innocuous: > > int terminal_node = start_node; > int next_node = next_demotion_node(start_node); > while (next_node != NUMA_NO_NODE) { > next_node = terminal_node; > terminal_node = next_demotion_node(terminal_node); > } > > That could loop forever if it doesn't go out to memory during each loop. > > However, if node_demotion[] *is* read on every trip through the loop, it > will eventually terminate. READ_ONCE() can guarantee that, as could > compiler barriers like smp_rmb(). > > But, after staring at it for a while, I think RCU may be the most > clearly correct way to solve the problem. Or, maybe just throw in the > towel and do a spinlock like a normal human being. :) > > Anyway, here's what I was thinking I'd do with RCU: > > 1. node_demotion[] starts off in a "before" state > 2. Writers to node_demotion[] first set the whole array such that > it will not induce cycles, like setting every member to > NUMA_NO_NODE. (the "disable" state) > 3. Writer calls synchronize_rcu(). After it returns, no readers can > observe the "before" values. > 4. Writer sets the actual values it wants. (the "after" state) > 5. Readers use rcu_read_lock() over any critical section where they > read the array. They are guaranteed to only see one of the two > adjacent states (before+disabled, or disabled+after), but never > before+after within one RCU read-side critical section. > 6. Readers use READ_ONCE() or some other compiler directive to ensure > the compiler does not reorder or combine reads from multiple, > adjacent RCU read-side critical sections. Makes sense to me. > > Although, after writing this, plain old locks are sounding awfully tempting.
diff -puN mm/internal.h~auto-setup-default-migration-path-from-firmware mm/internal.h --- a/mm/internal.h~auto-setup-default-migration-path-from-firmware 2021-01-25 16:23:10.607866706 -0800 +++ b/mm/internal.h 2021-01-25 16:23:10.616866706 -0800 @@ -515,12 +515,17 @@ static inline void mminit_validate_memmo #ifdef CONFIG_NUMA extern int node_reclaim(struct pglist_data *, gfp_t, unsigned int); +extern int find_next_best_node(int node, nodemask_t *used_node_mask); #else static inline int node_reclaim(struct pglist_data *pgdat, gfp_t mask, unsigned int order) { return NODE_RECLAIM_NOSCAN; } +static inline int find_next_best_node(int node, nodemask_t *used_node_mask) +{ + return NUMA_NO_NODE; +} #endif extern int hwpoison_filter(struct page *p); diff -puN mm/migrate.c~auto-setup-default-migration-path-from-firmware mm/migrate.c --- a/mm/migrate.c~auto-setup-default-migration-path-from-firmware 2021-01-25 16:23:10.609866706 -0800 +++ b/mm/migrate.c 2021-01-25 16:23:10.617866706 -0800 @@ -1161,6 +1161,10 @@ out: return rc; } +/* + * Writes to this array occur without locking. READ_ONCE() + * is recommended for readers to ensure consistent reads. + */ static int node_demotion[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE}; /** @@ -1174,7 +1178,13 @@ static int node_demotion[MAX_NUMNODES] = */ int next_demotion_node(int node) { - return node_demotion[node]; + /* + * node_demotion[] is updated without excluding + * this function from running. READ_ONCE() avoids + * reading multiple, inconsistent 'node' values + * during an update. + */ + return READ_ONCE(node_demotion[node]); } /* @@ -3124,3 +3134,128 @@ void migrate_vma_finalize(struct migrate } EXPORT_SYMBOL(migrate_vma_finalize); #endif /* CONFIG_DEVICE_PRIVATE */ + +/* Disable reclaim-based migration. */ +static void disable_all_migrate_targets(void) +{ + int node; + + for_each_online_node(node) + node_demotion[node] = NUMA_NO_NODE; +} + +/* + * Find an automatic demotion target for 'node'. + * Failing here is OK. It might just indicate + * being at the end of a chain. + */ +static int establish_migrate_target(int node, nodemask_t *used) +{ + int migration_target; + + /* + * Can not set a migration target on a + * node with it already set. + * + * No need for READ_ONCE() here since this + * in the write path for node_demotion[]. + * This should be the only thread writing. + */ + if (node_demotion[node] != NUMA_NO_NODE) + return NUMA_NO_NODE; + + migration_target = find_next_best_node(node, used); + if (migration_target == NUMA_NO_NODE) + return NUMA_NO_NODE; + + node_demotion[node] = migration_target; + + return migration_target; +} + +/* + * When memory fills up on a node, memory contents can be + * automatically migrated to another node instead of + * discarded at reclaim. + * + * Establish a "migration path" which will start at nodes + * with CPUs and will follow the priorities used to build the + * page allocator zonelists. + * + * The difference here is that cycles must be avoided. If + * node0 migrates to node1, then neither node1, nor anything + * node1 migrates to can migrate to node0. + * + * This function can run simultaneously with readers of + * node_demotion[]. However, it can not run simultaneously + * with itself. Exclusion is provided by memory hotplug events + * being single-threaded. + */ +void __set_migration_target_nodes(void) +{ + nodemask_t next_pass = NODE_MASK_NONE; + nodemask_t this_pass = NODE_MASK_NONE; + nodemask_t used_targets = NODE_MASK_NONE; + int node; + + /* + * Avoid any oddities like cycles that could occur + * from changes in the topology. This will leave + * a momentary gap when migration is disabled. + */ + disable_all_migrate_targets(); + + /* + * Ensure that the "disable" is visible across the system. + * Readers will see either a combination of before+disable + * state or disable+after. They will never see before and + * after state together. + * + * The before+after state together might have cycles and + * could cause readers to do things like loop until this + * function finishes. This ensures they can only see a + * single "bad" read and would, for instance, only loop + * once. + */ + smp_wmb(); + + /* + * Allocations go close to CPUs, first. Assume that + * the migration path starts at the nodes with CPUs. + */ + next_pass = node_states[N_CPU]; +again: + this_pass = next_pass; + next_pass = NODE_MASK_NONE; + /* + * To avoid cycles in the migration "graph", ensure + * that migration sources are not future targets by + * setting them in 'used_targets'. Do this only + * once per pass so that multiple source nodes can + * share a target node. + * + * 'used_targets' will become unavailable in future + * passes. This limits some opportunities for + * multiple source nodes to share a desintation. + */ + nodes_or(used_targets, used_targets, this_pass); + for_each_node_mask(node, this_pass) { + int target_node = establish_migrate_target(node, &used_targets); + + if (target_node == NUMA_NO_NODE) + continue; + + /* Visit targets from this pass in the next pass: */ + node_set(target_node, next_pass); + } + /* Is another pass necessary? */ + if (!nodes_empty(next_pass)) + goto again; +} + +void set_migration_target_nodes(void) +{ + get_online_mems(); + __set_migration_target_nodes(); + put_online_mems(); +} diff -puN mm/page_alloc.c~auto-setup-default-migration-path-from-firmware mm/page_alloc.c --- a/mm/page_alloc.c~auto-setup-default-migration-path-from-firmware 2021-01-25 16:23:10.612866706 -0800 +++ b/mm/page_alloc.c 2021-01-25 16:23:10.619866706 -0800 @@ -5704,7 +5704,7 @@ static int node_load[MAX_NUMNODES]; * * Return: node id of the found node or %NUMA_NO_NODE if no node is found. */ -static int find_next_best_node(int node, nodemask_t *used_node_mask) +int find_next_best_node(int node, nodemask_t *used_node_mask) { int n, val; int min_val = INT_MAX;