diff mbox series

[RFC,05/13] mm/numa: automatically generate node migration order

Message ID 20210126003421.45897BF4@viggo.jf.intel.com (mailing list archive)
State New
Headers show
Series Migrate Pages in lieu of discard | expand

Commit Message

Dave Hansen Jan. 26, 2021, 12:34 a.m. UTC
From: Dave Hansen <dave.hansen@linux.intel.com>

When memory fills up on a node, memory contents can be
automatically migrated to another node.  The biggest problems are
knowing when to migrate and to where the migration should be
targeted.

The most straightforward way to generate the "to where" list
would be to follow the page allocator fallback lists.  Those
lists already tell us if memory is full where to look next.  It
would also be logical to move memory in that order.

But, the allocator fallback lists have a fatal flaw: most nodes
appear in all the lists.  This would potentially lead to
migration cycles (A->B, B->A, A->B, ...).

Instead of using the allocator fallback lists directly, keep a
separate node migration ordering.  But, reuse the same data used
to generate page allocator fallback in the first place:
find_next_best_node().

This means that the firmware data used to populate node distances
essentially dictates the ordering for now.  It should also be
architecture-neutral since all NUMA architectures have a working
find_next_best_node().

The protocol for node_demotion[] access and writing is not
standard.  It has no specific locking and is intended to be read
locklessly.  Readers must take care to avoid observing changes
that appear incoherent.  This was done so that node_demotion[]
locking has no chance of becoming a bottleneck on large systems
with lots of CPUs in direct reclaim.

This code is unused for now.  It will be called later in the
series.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: osalvador <osalvador@suse.de>
---

 b/mm/internal.h   |    5 +
 b/mm/migrate.c    |  137 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 b/mm/page_alloc.c |    2 
 3 files changed, 142 insertions(+), 2 deletions(-)

Comments

Yang Shi Jan. 29, 2021, 8:46 p.m. UTC | #1
On Mon, Jan 25, 2021 at 4:41 PM Dave Hansen <dave.hansen@linux.intel.com> wrote:
>
>
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> When memory fills up on a node, memory contents can be
> automatically migrated to another node.  The biggest problems are
> knowing when to migrate and to where the migration should be
> targeted.
>
> The most straightforward way to generate the "to where" list
> would be to follow the page allocator fallback lists.  Those
> lists already tell us if memory is full where to look next.  It
> would also be logical to move memory in that order.
>
> But, the allocator fallback lists have a fatal flaw: most nodes
> appear in all the lists.  This would potentially lead to
> migration cycles (A->B, B->A, A->B, ...).
>
> Instead of using the allocator fallback lists directly, keep a
> separate node migration ordering.  But, reuse the same data used
> to generate page allocator fallback in the first place:
> find_next_best_node().
>
> This means that the firmware data used to populate node distances
> essentially dictates the ordering for now.  It should also be
> architecture-neutral since all NUMA architectures have a working
> find_next_best_node().
>
> The protocol for node_demotion[] access and writing is not
> standard.  It has no specific locking and is intended to be read
> locklessly.  Readers must take care to avoid observing changes
> that appear incoherent.  This was done so that node_demotion[]
> locking has no chance of becoming a bottleneck on large systems
> with lots of CPUs in direct reclaim.
>
> This code is unused for now.  It will be called later in the
> series.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Yang Shi <yang.shi@linux.alibaba.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: osalvador <osalvador@suse.de>
> ---
>
>  b/mm/internal.h   |    5 +
>  b/mm/migrate.c    |  137 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  b/mm/page_alloc.c |    2
>  3 files changed, 142 insertions(+), 2 deletions(-)
>
> diff -puN mm/internal.h~auto-setup-default-migration-path-from-firmware mm/internal.h
> --- a/mm/internal.h~auto-setup-default-migration-path-from-firmware     2021-01-25 16:23:10.607866706 -0800
> +++ b/mm/internal.h     2021-01-25 16:23:10.616866706 -0800
> @@ -515,12 +515,17 @@ static inline void mminit_validate_memmo
>
>  #ifdef CONFIG_NUMA
>  extern int node_reclaim(struct pglist_data *, gfp_t, unsigned int);
> +extern int find_next_best_node(int node, nodemask_t *used_node_mask);
>  #else
>  static inline int node_reclaim(struct pglist_data *pgdat, gfp_t mask,
>                                 unsigned int order)
>  {
>         return NODE_RECLAIM_NOSCAN;
>  }
> +static inline int find_next_best_node(int node, nodemask_t *used_node_mask)
> +{
> +       return NUMA_NO_NODE;
> +}
>  #endif
>
>  extern int hwpoison_filter(struct page *p);
> diff -puN mm/migrate.c~auto-setup-default-migration-path-from-firmware mm/migrate.c
> --- a/mm/migrate.c~auto-setup-default-migration-path-from-firmware      2021-01-25 16:23:10.609866706 -0800
> +++ b/mm/migrate.c      2021-01-25 16:23:10.617866706 -0800
> @@ -1161,6 +1161,10 @@ out:
>         return rc;
>  }
>
> +/*
> + * Writes to this array occur without locking.  READ_ONCE()
> + * is recommended for readers to ensure consistent reads.
> + */
>  static int node_demotion[MAX_NUMNODES] = {[0 ...  MAX_NUMNODES - 1] = NUMA_NO_NODE};
>
>  /**
> @@ -1174,7 +1178,13 @@ static int node_demotion[MAX_NUMNODES] =
>   */
>  int next_demotion_node(int node)
>  {
> -       return node_demotion[node];
> +       /*
> +        * node_demotion[] is updated without excluding
> +        * this function from running.  READ_ONCE() avoids
> +        * reading multiple, inconsistent 'node' values
> +        * during an update.
> +        */

Don't we need a smp_rmb() here? The single write barrier might be not
enough in migration target set. Typically a write barrier should be
used in pairs with a read barrier.

> +       return READ_ONCE(node_demotion[node]);

Why not consolidate the patch #4 in this patch? The patch #4 just add
the definition of node_demotion and the function, then the function is
changed in this patch, and the function is not used by anyone between
the adding and changing.

>  }
>
>  /*
> @@ -3124,3 +3134,128 @@ void migrate_vma_finalize(struct migrate
>  }
>  EXPORT_SYMBOL(migrate_vma_finalize);
>  #endif /* CONFIG_DEVICE_PRIVATE */
> +
> +/* Disable reclaim-based migration. */
> +static void disable_all_migrate_targets(void)
> +{
> +       int node;
> +
> +       for_each_online_node(node)
> +               node_demotion[node] = NUMA_NO_NODE;
> +}
> +
> +/*
> + * Find an automatic demotion target for 'node'.
> + * Failing here is OK.  It might just indicate
> + * being at the end of a chain.
> + */
> +static int establish_migrate_target(int node, nodemask_t *used)
> +{
> +       int migration_target;
> +
> +       /*
> +        * Can not set a migration target on a
> +        * node with it already set.
> +        *
> +        * No need for READ_ONCE() here since this
> +        * in the write path for node_demotion[].
> +        * This should be the only thread writing.
> +        */
> +       if (node_demotion[node] != NUMA_NO_NODE)
> +               return NUMA_NO_NODE;
> +
> +       migration_target = find_next_best_node(node, used);
> +       if (migration_target == NUMA_NO_NODE)
> +               return NUMA_NO_NODE;
> +
> +       node_demotion[node] = migration_target;
> +
> +       return migration_target;
> +}
> +
> +/*
> + * When memory fills up on a node, memory contents can be
> + * automatically migrated to another node instead of
> + * discarded at reclaim.
> + *
> + * Establish a "migration path" which will start at nodes
> + * with CPUs and will follow the priorities used to build the
> + * page allocator zonelists.
> + *
> + * The difference here is that cycles must be avoided.  If
> + * node0 migrates to node1, then neither node1, nor anything
> + * node1 migrates to can migrate to node0.
> + *
> + * This function can run simultaneously with readers of
> + * node_demotion[].  However, it can not run simultaneously
> + * with itself.  Exclusion is provided by memory hotplug events
> + * being single-threaded.

Maybe an example diagram for the physical topology and how the
migration target is generated in the comment seems helpful to
understand the code.

> + */
> +void __set_migration_target_nodes(void)
> +{
> +       nodemask_t next_pass    = NODE_MASK_NONE;
> +       nodemask_t this_pass    = NODE_MASK_NONE;
> +       nodemask_t used_targets = NODE_MASK_NONE;
> +       int node;
> +
> +       /*
> +        * Avoid any oddities like cycles that could occur
> +        * from changes in the topology.  This will leave
> +        * a momentary gap when migration is disabled.
> +        */
> +       disable_all_migrate_targets();
> +
> +       /*
> +        * Ensure that the "disable" is visible across the system.
> +        * Readers will see either a combination of before+disable
> +        * state or disable+after.  They will never see before and
> +        * after state together.
> +        *
> +        * The before+after state together might have cycles and
> +        * could cause readers to do things like loop until this
> +        * function finishes.  This ensures they can only see a
> +        * single "bad" read and would, for instance, only loop
> +        * once.
> +        */
> +       smp_wmb();
> +
> +       /*
> +        * Allocations go close to CPUs, first.  Assume that
> +        * the migration path starts at the nodes with CPUs.
> +        */
> +       next_pass = node_states[N_CPU];
> +again:
> +       this_pass = next_pass;
> +       next_pass = NODE_MASK_NONE;
> +       /*
> +        * To avoid cycles in the migration "graph", ensure
> +        * that migration sources are not future targets by
> +        * setting them in 'used_targets'.  Do this only
> +        * once per pass so that multiple source nodes can
> +        * share a target node.
> +        *
> +        * 'used_targets' will become unavailable in future
> +        * passes.  This limits some opportunities for
> +        * multiple source nodes to share a desintation.

s/desination/destination

> +        */
> +       nodes_or(used_targets, used_targets, this_pass);
> +       for_each_node_mask(node, this_pass) {
> +               int target_node = establish_migrate_target(node, &used_targets);
> +
> +               if (target_node == NUMA_NO_NODE)
> +                       continue;
> +
> +               /* Visit targets from this pass in the next pass: */
> +               node_set(target_node, next_pass);
> +       }
> +       /* Is another pass necessary? */
> +       if (!nodes_empty(next_pass))
> +               goto again;
> +}
> +
> +void set_migration_target_nodes(void)

It seems this function is not called outside migrate.c, so it should be static.

> +{
> +       get_online_mems();
> +       __set_migration_target_nodes();
> +       put_online_mems();
> +}
> diff -puN mm/page_alloc.c~auto-setup-default-migration-path-from-firmware mm/page_alloc.c
> --- a/mm/page_alloc.c~auto-setup-default-migration-path-from-firmware   2021-01-25 16:23:10.612866706 -0800
> +++ b/mm/page_alloc.c   2021-01-25 16:23:10.619866706 -0800
> @@ -5704,7 +5704,7 @@ static int node_load[MAX_NUMNODES];
>   *
>   * Return: node id of the found node or %NUMA_NO_NODE if no node is found.
>   */
> -static int find_next_best_node(int node, nodemask_t *used_node_mask)
> +int find_next_best_node(int node, nodemask_t *used_node_mask)
>  {
>         int n, val;
>         int min_val = INT_MAX;
> _
>
Dave Hansen Feb. 1, 2021, 7:13 p.m. UTC | #2
On 1/29/21 12:46 PM, Yang Shi wrote:
...
>>  int next_demotion_node(int node)
>>  {
>> -       return node_demotion[node];
>> +       /*
>> +        * node_demotion[] is updated without excluding
>> +        * this function from running.  READ_ONCE() avoids
>> +        * reading multiple, inconsistent 'node' values
>> +        * during an update.
>> +        */
> 
> Don't we need a smp_rmb() here? The single write barrier might be not
> enough in migration target set. Typically a write barrier should be
> used in pairs with a read barrier.

I don't think we need one, practically.

Since there is no locking against node_demotion[] updates, although a
smp_rmb() would ensure that this read is up-to-date, it could change
freely after the smp_rmb().

In other words, smp_rmb() would shrink the window where a "stale" read
could occur but would not eliminate it.

>> +       return READ_ONCE(node_demotion[node]);
> 
> Why not consolidate the patch #4 in this patch? The patch #4 just add
> the definition of node_demotion and the function, then the function is
> changed in this patch, and the function is not used by anyone between
> the adding and changing.

I really wanted to highlight that the locking scheme and the READ_ONCE()
(or lack thereof) was specifically due to how node_demotion[] was being
updated.

The READ_ONCE() is not, for instance, inherent to the data structure.

...
>> +/*
>> + * When memory fills up on a node, memory contents can be
>> + * automatically migrated to another node instead of
>> + * discarded at reclaim.
>> + *
>> + * Establish a "migration path" which will start at nodes
>> + * with CPUs and will follow the priorities used to build the
>> + * page allocator zonelists.
>> + *
>> + * The difference here is that cycles must be avoided.  If
>> + * node0 migrates to node1, then neither node1, nor anything
>> + * node1 migrates to can migrate to node0.
>> + *
>> + * This function can run simultaneously with readers of
>> + * node_demotion[].  However, it can not run simultaneously
>> + * with itself.  Exclusion is provided by memory hotplug events
>> + * being single-threaded.
> 
> Maybe an example diagram for the physical topology and how the
> migration target is generated in the comment seems helpful to
> understand the code.

Sure.  Were you thinking of a code comment, or enhanced changelog?

Let's say there's a system with two sockets each with the same three
classes of memory: fast, medium and slow.  Each memory class is placed
in its own NUMA node and the CPUs are attached to the fast memory.  That
leaves 6 NUMA nodes (0-5):

	Socket A: 0, 1, 2
	Socket B: 3, 4, 5

The migration path for this configuration path would start on the nodes
with the processors and fast memory, progress through medium and end
with the slow memory:

	0 -> 1 -> 2 -> stop
	3 -> 4 -> 5 -> stop

This is represented in the node_demotion[] like this:

	{  1, // Node 0 migrates to 1
	   2, // Node 1 migrates to 2
	  -1, // Node 2 does not migrate
	   4, // Node 3 migrates to 1
	   5, // Node 4 migrates to 2
	  -1} // Node 5 does not migrate

Is that what you were thinking of?

...
>> +again:
>> +       this_pass = next_pass;
>> +       next_pass = NODE_MASK_NONE;
>> +       /*
>> +        * To avoid cycles in the migration "graph", ensure
>> +        * that migration sources are not future targets by
>> +        * setting them in 'used_targets'.  Do this only
>> +        * once per pass so that multiple source nodes can
>> +        * share a target node.
>> +        *
>> +        * 'used_targets' will become unavailable in future
>> +        * passes.  This limits some opportunities for
>> +        * multiple source nodes to share a desintation.
> 
> s/desination/destination

Fixed, thanks.

>> +        */
>> +       nodes_or(used_targets, used_targets, this_pass);
>> +       for_each_node_mask(node, this_pass) {
>> +               int target_node = establish_migrate_target(node, &used_targets);
>> +
>> +               if (target_node == NUMA_NO_NODE)
>> +                       continue;
>> +
>> +               /* Visit targets from this pass in the next pass: */
>> +               node_set(target_node, next_pass);
>> +       }
>> +       /* Is another pass necessary? */
>> +       if (!nodes_empty(next_pass))
>> +               goto again;
>> +}
>> +
>> +void set_migration_target_nodes(void)
> 
> It seems this function is not called outside migrate.c, so it should be static.

Fixed, thanks.
Oscar Salvador Feb. 2, 2021, 11:43 a.m. UTC | #3
On Mon, Feb 01, 2021 at 11:13:14AM -0800, Dave Hansen wrote:
> Sure.  Were you thinking of a code comment, or enhanced changelog?
> 
> Let's say there's a system with two sockets each with the same three
> classes of memory: fast, medium and slow.  Each memory class is placed
> in its own NUMA node and the CPUs are attached to the fast memory.  That
> leaves 6 NUMA nodes (0-5):
> 
> 	Socket A: 0, 1, 2
> 	Socket B: 3, 4, 5
> 
> The migration path for this configuration path would start on the nodes
> with the processors and fast memory, progress through medium and end
> with the slow memory:
> 
> 	0 -> 1 -> 2 -> stop
> 	3 -> 4 -> 5 -> stop
> 
> This is represented in the node_demotion[] like this:
> 
> 	{  1, // Node 0 migrates to 1
> 	   2, // Node 1 migrates to 2
> 	  -1, // Node 2 does not migrate
> 	   4, // Node 3 migrates to 1
> 	   5, // Node 4 migrates to 2
> 	  -1} // Node 5 does not migrate
> 
> Is that what you were thinking of?

I would not mind to have the above explanation in a comment somewhere
in the code.
Yang Shi Feb. 2, 2021, 5:46 p.m. UTC | #4
On Mon, Feb 1, 2021 at 11:13 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 1/29/21 12:46 PM, Yang Shi wrote:
> ...
> >>  int next_demotion_node(int node)
> >>  {
> >> -       return node_demotion[node];
> >> +       /*
> >> +        * node_demotion[] is updated without excluding
> >> +        * this function from running.  READ_ONCE() avoids
> >> +        * reading multiple, inconsistent 'node' values
> >> +        * during an update.
> >> +        */
> >
> > Don't we need a smp_rmb() here? The single write barrier might be not
> > enough in migration target set. Typically a write barrier should be
> > used in pairs with a read barrier.
>
> I don't think we need one, practically.
>
> Since there is no locking against node_demotion[] updates, although a
> smp_rmb() would ensure that this read is up-to-date, it could change
> freely after the smp_rmb().

Yes, but this should be able to guarantee we see "disable + after"
state. Isn't it more preferred?

>
> In other words, smp_rmb() would shrink the window where a "stale" read
> could occur but would not eliminate it.
>
> >> +       return READ_ONCE(node_demotion[node]);
> >
> > Why not consolidate the patch #4 in this patch? The patch #4 just add
> > the definition of node_demotion and the function, then the function is
> > changed in this patch, and the function is not used by anyone between
> > the adding and changing.
>
> I really wanted to highlight that the locking scheme and the READ_ONCE()
> (or lack thereof) was specifically due to how node_demotion[] was being
> updated.
>
> The READ_ONCE() is not, for instance, inherent to the data structure.
>
> ...
> >> +/*
> >> + * When memory fills up on a node, memory contents can be
> >> + * automatically migrated to another node instead of
> >> + * discarded at reclaim.
> >> + *
> >> + * Establish a "migration path" which will start at nodes
> >> + * with CPUs and will follow the priorities used to build the
> >> + * page allocator zonelists.
> >> + *
> >> + * The difference here is that cycles must be avoided.  If
> >> + * node0 migrates to node1, then neither node1, nor anything
> >> + * node1 migrates to can migrate to node0.
> >> + *
> >> + * This function can run simultaneously with readers of
> >> + * node_demotion[].  However, it can not run simultaneously
> >> + * with itself.  Exclusion is provided by memory hotplug events
> >> + * being single-threaded.
> >
> > Maybe an example diagram for the physical topology and how the
> > migration target is generated in the comment seems helpful to
> > understand the code.
>
> Sure.  Were you thinking of a code comment, or enhanced changelog?

I'd prefer a code comment.

>
> Let's say there's a system with two sockets each with the same three
> classes of memory: fast, medium and slow.  Each memory class is placed
> in its own NUMA node and the CPUs are attached to the fast memory.  That
> leaves 6 NUMA nodes (0-5):
>
>         Socket A: 0, 1, 2
>         Socket B: 3, 4, 5
>
> The migration path for this configuration path would start on the nodes
> with the processors and fast memory, progress through medium and end
> with the slow memory:
>
>         0 -> 1 -> 2 -> stop
>         3 -> 4 -> 5 -> stop
>
> This is represented in the node_demotion[] like this:
>
>         {  1, // Node 0 migrates to 1
>            2, // Node 1 migrates to 2
>           -1, // Node 2 does not migrate
>            4, // Node 3 migrates to 1
>            5, // Node 4 migrates to 2
>           -1} // Node 5 does not migrate
>
> Is that what you were thinking of?

Perfect.

>
> ...
> >> +again:
> >> +       this_pass = next_pass;
> >> +       next_pass = NODE_MASK_NONE;
> >> +       /*
> >> +        * To avoid cycles in the migration "graph", ensure
> >> +        * that migration sources are not future targets by
> >> +        * setting them in 'used_targets'.  Do this only
> >> +        * once per pass so that multiple source nodes can
> >> +        * share a target node.
> >> +        *
> >> +        * 'used_targets' will become unavailable in future
> >> +        * passes.  This limits some opportunities for
> >> +        * multiple source nodes to share a desintation.
> >
> > s/desination/destination
>
> Fixed, thanks.
>
> >> +        */
> >> +       nodes_or(used_targets, used_targets, this_pass);
> >> +       for_each_node_mask(node, this_pass) {
> >> +               int target_node = establish_migrate_target(node, &used_targets);
> >> +
> >> +               if (target_node == NUMA_NO_NODE)
> >> +                       continue;
> >> +
> >> +               /* Visit targets from this pass in the next pass: */
> >> +               node_set(target_node, next_pass);
> >> +       }
> >> +       /* Is another pass necessary? */
> >> +       if (!nodes_empty(next_pass))
> >> +               goto again;
> >> +}
> >> +
> >> +void set_migration_target_nodes(void)
> >
> > It seems this function is not called outside migrate.c, so it should be static.
>
> Fixed, thanks.
Dave Hansen Feb. 3, 2021, 12:43 a.m. UTC | #5
On 2/2/21 9:46 AM, Yang Shi wrote:
> On Mon, Feb 1, 2021 at 11:13 AM Dave Hansen <dave.hansen@intel.com> wrote:
>> On 1/29/21 12:46 PM, Yang Shi wrote:
>> ...
>>>>  int next_demotion_node(int node)
>>>>  {
>>>> -       return node_demotion[node];
>>>> +       /*
>>>> +        * node_demotion[] is updated without excluding
>>>> +        * this function from running.  READ_ONCE() avoids
>>>> +        * reading multiple, inconsistent 'node' values
>>>> +        * during an update.
>>>> +        */
>>> Don't we need a smp_rmb() here? The single write barrier might be not
>>> enough in migration target set. Typically a write barrier should be
>>> used in pairs with a read barrier.
>> I don't think we need one, practically.
>>
>> Since there is no locking against node_demotion[] updates, although a
>> smp_rmb() would ensure that this read is up-to-date, it could change
>> freely after the smp_rmb().
> Yes, but this should be able to guarantee we see "disable + after"
> state. Isn't it more preferred?

I'm debating how much of this is theoretical versus actually applicable
to what we have in the kernel.  But, I'm generally worried about code
like this that *looks* innocuous:

	int terminal_node = start_node;
	int next_node = next_demotion_node(start_node);
        while (next_node != NUMA_NO_NODE) {
		next_node = terminal_node;
                terminal_node = next_demotion_node(terminal_node);
        }

That could loop forever if it doesn't go out to memory during each loop.

However, if node_demotion[] *is* read on every trip through the loop, it
will eventually terminate.  READ_ONCE() can guarantee that, as could
compiler barriers like smp_rmb().

But, after staring at it for a while, I think RCU may be the most
clearly correct way to solve the problem.  Or, maybe just throw in the
towel and do a spinlock like a normal human being. :)

Anyway, here's what I was thinking I'd do with RCU:

 1. node_demotion[] starts off in a "before" state
 2. Writers to node_demotion[] first set the whole array such that
    it will not induce cycles, like setting every member to
    NUMA_NO_NODE. (the "disable" state)
 3. Writer calls synchronize_rcu().  After it returns, no readers can
    observe the "before" values.
 4. Writer sets the actual values it wants.  (the "after" state)
 5. Readers use rcu_read_lock() over any critical section where they
    read the array.  They are guaranteed to only see one of the two
    adjacent states (before+disabled, or disabled+after), but never
    before+after within one RCU read-side critical section.
 6. Readers use READ_ONCE() or some other compiler directive to ensure
    the compiler does not reorder or combine reads from multiple,
    adjacent RCU read-side critical sections.

Although, after writing this, plain old locks are sounding awfully tempting.
Yang Shi Feb. 4, 2021, 12:26 a.m. UTC | #6
On Tue, Feb 2, 2021 at 4:43 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 2/2/21 9:46 AM, Yang Shi wrote:
> > On Mon, Feb 1, 2021 at 11:13 AM Dave Hansen <dave.hansen@intel.com> wrote:
> >> On 1/29/21 12:46 PM, Yang Shi wrote:
> >> ...
> >>>>  int next_demotion_node(int node)
> >>>>  {
> >>>> -       return node_demotion[node];
> >>>> +       /*
> >>>> +        * node_demotion[] is updated without excluding
> >>>> +        * this function from running.  READ_ONCE() avoids
> >>>> +        * reading multiple, inconsistent 'node' values
> >>>> +        * during an update.
> >>>> +        */
> >>> Don't we need a smp_rmb() here? The single write barrier might be not
> >>> enough in migration target set. Typically a write barrier should be
> >>> used in pairs with a read barrier.
> >> I don't think we need one, practically.
> >>
> >> Since there is no locking against node_demotion[] updates, although a
> >> smp_rmb() would ensure that this read is up-to-date, it could change
> >> freely after the smp_rmb().
> > Yes, but this should be able to guarantee we see "disable + after"
> > state. Isn't it more preferred?
>
> I'm debating how much of this is theoretical versus actually applicable
> to what we have in the kernel.  But, I'm generally worried about code
> like this that *looks* innocuous:
>
>         int terminal_node = start_node;
>         int next_node = next_demotion_node(start_node);
>         while (next_node != NUMA_NO_NODE) {
>                 next_node = terminal_node;
>                 terminal_node = next_demotion_node(terminal_node);
>         }
>
> That could loop forever if it doesn't go out to memory during each loop.
>
> However, if node_demotion[] *is* read on every trip through the loop, it
> will eventually terminate.  READ_ONCE() can guarantee that, as could
> compiler barriers like smp_rmb().
>
> But, after staring at it for a while, I think RCU may be the most
> clearly correct way to solve the problem.  Or, maybe just throw in the
> towel and do a spinlock like a normal human being. :)
>
> Anyway, here's what I was thinking I'd do with RCU:
>
>  1. node_demotion[] starts off in a "before" state
>  2. Writers to node_demotion[] first set the whole array such that
>     it will not induce cycles, like setting every member to
>     NUMA_NO_NODE. (the "disable" state)
>  3. Writer calls synchronize_rcu().  After it returns, no readers can
>     observe the "before" values.
>  4. Writer sets the actual values it wants.  (the "after" state)
>  5. Readers use rcu_read_lock() over any critical section where they
>     read the array.  They are guaranteed to only see one of the two
>     adjacent states (before+disabled, or disabled+after), but never
>     before+after within one RCU read-side critical section.
>  6. Readers use READ_ONCE() or some other compiler directive to ensure
>     the compiler does not reorder or combine reads from multiple,
>     adjacent RCU read-side critical sections.

Makes sense to me.

>
> Although, after writing this, plain old locks are sounding awfully tempting.
diff mbox series

Patch

diff -puN mm/internal.h~auto-setup-default-migration-path-from-firmware mm/internal.h
--- a/mm/internal.h~auto-setup-default-migration-path-from-firmware	2021-01-25 16:23:10.607866706 -0800
+++ b/mm/internal.h	2021-01-25 16:23:10.616866706 -0800
@@ -515,12 +515,17 @@  static inline void mminit_validate_memmo
 
 #ifdef CONFIG_NUMA
 extern int node_reclaim(struct pglist_data *, gfp_t, unsigned int);
+extern int find_next_best_node(int node, nodemask_t *used_node_mask);
 #else
 static inline int node_reclaim(struct pglist_data *pgdat, gfp_t mask,
 				unsigned int order)
 {
 	return NODE_RECLAIM_NOSCAN;
 }
+static inline int find_next_best_node(int node, nodemask_t *used_node_mask)
+{
+	return NUMA_NO_NODE;
+}
 #endif
 
 extern int hwpoison_filter(struct page *p);
diff -puN mm/migrate.c~auto-setup-default-migration-path-from-firmware mm/migrate.c
--- a/mm/migrate.c~auto-setup-default-migration-path-from-firmware	2021-01-25 16:23:10.609866706 -0800
+++ b/mm/migrate.c	2021-01-25 16:23:10.617866706 -0800
@@ -1161,6 +1161,10 @@  out:
 	return rc;
 }
 
+/*
+ * Writes to this array occur without locking.  READ_ONCE()
+ * is recommended for readers to ensure consistent reads.
+ */
 static int node_demotion[MAX_NUMNODES] = {[0 ...  MAX_NUMNODES - 1] = NUMA_NO_NODE};
 
 /**
@@ -1174,7 +1178,13 @@  static int node_demotion[MAX_NUMNODES] =
  */
 int next_demotion_node(int node)
 {
-	return node_demotion[node];
+	/*
+	 * node_demotion[] is updated without excluding
+	 * this function from running.  READ_ONCE() avoids
+	 * reading multiple, inconsistent 'node' values
+	 * during an update.
+	 */
+	return READ_ONCE(node_demotion[node]);
 }
 
 /*
@@ -3124,3 +3134,128 @@  void migrate_vma_finalize(struct migrate
 }
 EXPORT_SYMBOL(migrate_vma_finalize);
 #endif /* CONFIG_DEVICE_PRIVATE */
+
+/* Disable reclaim-based migration. */
+static void disable_all_migrate_targets(void)
+{
+	int node;
+
+	for_each_online_node(node)
+		node_demotion[node] = NUMA_NO_NODE;
+}
+
+/*
+ * Find an automatic demotion target for 'node'.
+ * Failing here is OK.  It might just indicate
+ * being at the end of a chain.
+ */
+static int establish_migrate_target(int node, nodemask_t *used)
+{
+	int migration_target;
+
+	/*
+	 * Can not set a migration target on a
+	 * node with it already set.
+	 *
+	 * No need for READ_ONCE() here since this
+	 * in the write path for node_demotion[].
+	 * This should be the only thread writing.
+	 */
+	if (node_demotion[node] != NUMA_NO_NODE)
+		return NUMA_NO_NODE;
+
+	migration_target = find_next_best_node(node, used);
+	if (migration_target == NUMA_NO_NODE)
+		return NUMA_NO_NODE;
+
+	node_demotion[node] = migration_target;
+
+	return migration_target;
+}
+
+/*
+ * When memory fills up on a node, memory contents can be
+ * automatically migrated to another node instead of
+ * discarded at reclaim.
+ *
+ * Establish a "migration path" which will start at nodes
+ * with CPUs and will follow the priorities used to build the
+ * page allocator zonelists.
+ *
+ * The difference here is that cycles must be avoided.  If
+ * node0 migrates to node1, then neither node1, nor anything
+ * node1 migrates to can migrate to node0.
+ *
+ * This function can run simultaneously with readers of
+ * node_demotion[].  However, it can not run simultaneously
+ * with itself.  Exclusion is provided by memory hotplug events
+ * being single-threaded.
+ */
+void __set_migration_target_nodes(void)
+{
+	nodemask_t next_pass	= NODE_MASK_NONE;
+	nodemask_t this_pass	= NODE_MASK_NONE;
+	nodemask_t used_targets = NODE_MASK_NONE;
+	int node;
+
+	/*
+	 * Avoid any oddities like cycles that could occur
+	 * from changes in the topology.  This will leave
+	 * a momentary gap when migration is disabled.
+	 */
+	disable_all_migrate_targets();
+
+	/*
+	 * Ensure that the "disable" is visible across the system.
+	 * Readers will see either a combination of before+disable
+	 * state or disable+after.  They will never see before and
+	 * after state together.
+	 *
+	 * The before+after state together might have cycles and
+	 * could cause readers to do things like loop until this
+	 * function finishes.  This ensures they can only see a
+	 * single "bad" read and would, for instance, only loop
+	 * once.
+	 */
+	smp_wmb();
+
+	/*
+	 * Allocations go close to CPUs, first.  Assume that
+	 * the migration path starts at the nodes with CPUs.
+	 */
+	next_pass = node_states[N_CPU];
+again:
+	this_pass = next_pass;
+	next_pass = NODE_MASK_NONE;
+	/*
+	 * To avoid cycles in the migration "graph", ensure
+	 * that migration sources are not future targets by
+	 * setting them in 'used_targets'.  Do this only
+	 * once per pass so that multiple source nodes can
+	 * share a target node.
+	 *
+	 * 'used_targets' will become unavailable in future
+	 * passes.  This limits some opportunities for
+	 * multiple source nodes to share a desintation.
+	 */
+	nodes_or(used_targets, used_targets, this_pass);
+	for_each_node_mask(node, this_pass) {
+		int target_node = establish_migrate_target(node, &used_targets);
+
+		if (target_node == NUMA_NO_NODE)
+			continue;
+
+		/* Visit targets from this pass in the next pass: */
+		node_set(target_node, next_pass);
+	}
+	/* Is another pass necessary? */
+	if (!nodes_empty(next_pass))
+		goto again;
+}
+
+void set_migration_target_nodes(void)
+{
+	get_online_mems();
+	__set_migration_target_nodes();
+	put_online_mems();
+}
diff -puN mm/page_alloc.c~auto-setup-default-migration-path-from-firmware mm/page_alloc.c
--- a/mm/page_alloc.c~auto-setup-default-migration-path-from-firmware	2021-01-25 16:23:10.612866706 -0800
+++ b/mm/page_alloc.c	2021-01-25 16:23:10.619866706 -0800
@@ -5704,7 +5704,7 @@  static int node_load[MAX_NUMNODES];
  *
  * Return: node id of the found node or %NUMA_NO_NODE if no node is found.
  */
-static int find_next_best_node(int node, nodemask_t *used_node_mask)
+int find_next_best_node(int node, nodemask_t *used_node_mask)
 {
 	int n, val;
 	int min_val = INT_MAX;