diff mbox series

[03/10] mm/migrate: update node demotion order during on hotplug events

Message ID 20210401183221.977831DE@viggo.jf.intel.com (mailing list archive)
State New, archived
Headers show
Series Migrate Pages in lieu of discard | expand

Commit Message

Dave Hansen April 1, 2021, 6:32 p.m. UTC
From: Dave Hansen <dave.hansen@linux.intel.com>

Reclaim-based migration is attempting to optimize data placement in
memory based on the system topology.  If the system changes, so must
the migration ordering.

The implementation is conceptually simple and entirely unoptimized.
On any memory or CPU hotplug events, assume that a node was added or
removed and recalculate all migration targets.  This ensures that the
node_demotion[] array is always ready to be used in case the new
reclaim mode is enabled.

This recalculation is far from optimal, most glaringly that it does
not even attempt to figure out the hotplug event would have some
*actual* effect on the demotion order.  But, given the expected
paucity of hotplug events, this should be fine.

=== What does RCU provide? ===

Imaginge a simple loop which walks down the demotion path looking
for the last node:

        terminal_node = start_node;
        while (node_demotion[terminal_node] != NUMA_NO_NODE) {
                terminal_node = node_demotion[terminal_node];
        }

The initial values are:

        node_demotion[0] = 1;
        node_demotion[1] = NUMA_NO_NODE;

and are updated to:

        node_demotion[0] = NUMA_NO_NODE;
        node_demotion[1] = 0;

What guarantees that the loop did not observe:

        node_demotion[0] = 1;
        node_demotion[1] = 0;

and would loop forever?

With RCU, a rcu_read_lock/unlock() can be placed around the
loop.  Since the write side does a synchronize_rcu(), the loop
that observed the old contents is known to be complete after the
synchronize_rcu() has completed.

RCU, combined with disable_all_migrate_targets(), ensures that
the old migration state is not visible by the time
__set_migration_target_nodes() is called.

=== What does READ_ONCE() provide? ===

READ_ONCE() forbids the compiler from merging or reordering
successive reads of node_demotion[].  This ensures that any
updates are *eventually* observed.

Consider the above loop again.  The compiler could theoretically
read the entirety of node_demotion[] into local storage
(registers) and never go back to memory, and *permanently*
observe bad values for node_demotion[].

Note: RCU does not provide any universal compiler-ordering
guarantees:

	https://lore.kernel.org/lkml/20150921204327.GH4029@linux.vnet.ibm.com/

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Wei Xu <weixugc@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: osalvador <osalvador@suse.de>

--

Changes since 20210302:
 * remove duplicate synchronize_rcu()
---

 b/mm/migrate.c |  152 ++++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 129 insertions(+), 23 deletions(-)

Comments

Oscar Salvador April 8, 2021, 9:52 a.m. UTC | #1
On Thu, Apr 01, 2021 at 11:32:21AM -0700, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> Reclaim-based migration is attempting to optimize data placement in
> memory based on the system topology.  If the system changes, so must
> the migration ordering.
> 
> The implementation is conceptually simple and entirely unoptimized.
> On any memory or CPU hotplug events, assume that a node was added or
> removed and recalculate all migration targets.  This ensures that the
> node_demotion[] array is always ready to be used in case the new
> reclaim mode is enabled.
> 
> This recalculation is far from optimal, most glaringly that it does
> not even attempt to figure out the hotplug event would have some
> *actual* effect on the demotion order.  But, given the expected
> paucity of hotplug events, this should be fine.
> 
> === What does RCU provide? ===
> 
> Imaginge a simple loop which walks down the demotion path looking
> for the last node:
> 
>         terminal_node = start_node;
>         while (node_demotion[terminal_node] != NUMA_NO_NODE) {
>                 terminal_node = node_demotion[terminal_node];
>         }
> 
> The initial values are:
> 
>         node_demotion[0] = 1;
>         node_demotion[1] = NUMA_NO_NODE;
> 
> and are updated to:
> 
>         node_demotion[0] = NUMA_NO_NODE;
>         node_demotion[1] = 0;
> 
> What guarantees that the loop did not observe:
> 
>         node_demotion[0] = 1;
>         node_demotion[1] = 0;
> 
> and would loop forever?
> 
> With RCU, a rcu_read_lock/unlock() can be placed around the
> loop.  Since the write side does a synchronize_rcu(), the loop
> that observed the old contents is known to be complete after the
> synchronize_rcu() has completed.
> 
> RCU, combined with disable_all_migrate_targets(), ensures that
> the old migration state is not visible by the time
> __set_migration_target_nodes() is called.
> 
> === What does READ_ONCE() provide? ===
> 
> READ_ONCE() forbids the compiler from merging or reordering
> successive reads of node_demotion[].  This ensures that any
> updates are *eventually* observed.
> 
> Consider the above loop again.  The compiler could theoretically
> read the entirety of node_demotion[] into local storage
> (registers) and never go back to memory, and *permanently*
> observe bad values for node_demotion[].
> 
> Note: RCU does not provide any universal compiler-ordering
> guarantees:
> 
> 	https://lore.kernel.org/lkml/20150921204327.GH4029@linux.vnet.ibm.com/
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Reviewed-by: Yang Shi <shy828301@gmail.com>
> Cc: Wei Xu <weixugc@google.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: osalvador <osalvador@suse.de>
> 

...
  
> +#if defined(CONFIG_MEMORY_HOTPLUG)

I am not really into PMEM, and I ignore whether we need
CONFIG_MEMORY_HOTPLUG in order to have such memory on the system.
If so, the following can be partly ignored.

I think that you either want to check CONFIG_MEMORY_HOTPLUG +
CONFIG_CPU_HOTPLUG, or just do not put it under any conf dependency.

The thing is that migrate_on_reclaim_init() will only be called if
we have CONFIG_MEMORY_HOTPLUG, and when we do not have that (but we do have
CONFIG_CPU_HOTPLUG) the calls to set_migration_target_nodes() wont't be
made when the system brings up the CPUs during the boot phase,
which means node_demotion[] list won't be initialized.

But this brings me to the next point.

From a conceptual point of view, I think you want to build the
node_demotion[] list, being orthogonal to it whether we support CPU Or
MEMORY hotplug.

Now, in case we support CPU or MEMORY hotplug, we do want to be able to re-build
the list for .e.g: in case NUMA nodes become cpu-less or memory-less.

On x86_64, CPU_HOTPLUG is enabled by default if SMP, the same for
MEMORY_HOTPLUG, but I am not sure about other archs.
Can we have !CPU_HOTPLUG && MEMORY_HOTPLUG, !MEMORY_HOTPLUG &&
CPU_HOTPLUG? I do now really know, but I think you should be careful
about that.

If this was my call, I would:

- Do not place the burden to initialize node_demotion[] list in CPU
  hotplug boot phase (or if so, be carefull because if I disable
  MEMORY_HOTPLUG, I end up with no demotion_list[])
- Diferentiate between migration_{online,offline}_cpu and
  migrate_on_reclaim_callback() and place them under their respective
  configs-dependency.

But I might be missing some details so I might be off somewhere.

Another thing that caught my eye is that we are calling
set_migration_target_nodes() for every CPU the system brings up at boot
phase. I know systems with *lots* of CPUs.
I am not sure whether we have a mechanism to delay that until all CPUs
that are meant to be online are online? (after boot?)
That's probably happening in wonderland, but was just speaking out loud.

(Of course the same happen with memory_hotplug acpi operations.
All it takes is some qemu-handling)
Oscar Salvador April 9, 2021, 10:14 a.m. UTC | #2
On Thu, Apr 08, 2021 at 11:52:51AM +0200, Oscar Salvador wrote:
> I am not really into PMEM, and I ignore whether we need
> CONFIG_MEMORY_HOTPLUG in order to have such memory on the system.
> If so, the following can be partly ignored.

Ok, I refreshed by memory with [1].
From that, it seems that in order to use PMEM as RAM we need CONFIG_MEMORY_HOTPLUG.
But is that always the case? Can happen that in some scenario PMEM comes ready
to use and we do not need the hotplug trick?

Anyway, I would still like to clarify the state of the HOTPLUG_CPU.
On x86_64, HOTPLUG_CPU and MEMORY_HOTPLUG are tied by SPM means, but on arm64
one can have MEMORY_HOTPLUG while not having picked HOTPLUG_CPU.

My point is that we might want to put the callback functions and the callback
registration for cpu-hotplug guarded by its own HOTPLUG_CPU instead of guarding it
in the same MEMORY_HOTPLUG block to make it more clear?
Oscar Salvador April 9, 2021, 10:15 a.m. UTC | #3
On Fri, Apr 09, 2021 at 12:14:04PM +0200, Oscar Salvador wrote:
> On Thu, Apr 08, 2021 at 11:52:51AM +0200, Oscar Salvador wrote:
> > I am not really into PMEM, and I ignore whether we need
> > CONFIG_MEMORY_HOTPLUG in order to have such memory on the system.
> > If so, the following can be partly ignored.
> 
> Ok, I refreshed by memory with [1].

Bleh, being [1] https://lore.kernel.org/linux-mm/20190124231448.E102D18E@viggo.jf.intel.com/
David Hildenbrand April 9, 2021, 6:59 p.m. UTC | #4
On 09.04.21 12:14, Oscar Salvador wrote:
> On Thu, Apr 08, 2021 at 11:52:51AM +0200, Oscar Salvador wrote:
>> I am not really into PMEM, and I ignore whether we need
>> CONFIG_MEMORY_HOTPLUG in order to have such memory on the system.
>> If so, the following can be partly ignored.
> 
> Ok, I refreshed by memory with [1].
>  From that, it seems that in order to use PMEM as RAM we need CONFIG_MEMORY_HOTPLUG.
> But is that always the case? Can happen that in some scenario PMEM comes ready
> to use and we do not need the hotplug trick?

The only way to add more System RAM is via add_memory() and friends like 
add_memory_driver_managed(). These all require CONFIG_MEMORY_HOTPLUG.

Memory ballooning is a different case, but there we're only adjusting 
the managed page counters.
Oscar Salvador April 12, 2021, 7:19 a.m. UTC | #5
On Fri, Apr 09, 2021 at 08:59:21PM +0200, David Hildenbrand wrote:
 
> The only way to add more System RAM is via add_memory() and friends like
> add_memory_driver_managed(). These all require CONFIG_MEMORY_HOTPLUG.

Yeah, my point was more towards whether PMEM can come in a way that it does
not have to be hotplugged, but come functional by default (as RAM).
But after having read all papers out there, I do not think that it is possible.
David Hildenbrand April 12, 2021, 9:19 a.m. UTC | #6
On 12.04.21 09:19, Oscar Salvador wrote:
> On Fri, Apr 09, 2021 at 08:59:21PM +0200, David Hildenbrand wrote:
>   
>> The only way to add more System RAM is via add_memory() and friends like
>> add_memory_driver_managed(). These all require CONFIG_MEMORY_HOTPLUG.
> 
> Yeah, my point was more towards whether PMEM can come in a way that it does
> not have to be hotplugged, but come functional by default (as RAM).
> But after having read all papers out there, I do not think that it is possible.
> 

You mean e.g., configuring in the BIOS/firmware how an NVDIMM will get 
exposed to the OS (pmem vs. RAM). I once heard something about that, not 
sure if it's real. But from Linux' perspective, it would simply be 
System RAM and it would get treated like that.
diff mbox series

Patch

diff -puN mm/migrate.c~enable-numa-demotion mm/migrate.c
--- a/mm/migrate.c~enable-numa-demotion	2021-03-31 15:17:13.056000258 -0700
+++ b/mm/migrate.c	2021-03-31 15:17:13.062000258 -0700
@@ -49,6 +49,7 @@ 
 #include <linux/sched/mm.h>
 #include <linux/ptrace.h>
 #include <linux/oom.h>
+#include <linux/memory.h>
 
 #include <asm/tlbflush.h>
 
@@ -1198,8 +1199,12 @@  out:
  */
 
 /*
- * Writes to this array occur without locking.  READ_ONCE()
- * is recommended for readers to ensure consistent reads.
+ * Writes to this array occur without locking.  Cycles are
+ * not allowed: Node X demotes to Y which demotes to X...
+ *
+ * If multiple reads are performed, a single rcu_read_lock()
+ * must be held over all reads to ensure that no cycles are
+ * observed.
  */
 static int node_demotion[MAX_NUMNODES] __read_mostly =
 	{[0 ...  MAX_NUMNODES - 1] = NUMA_NO_NODE};
@@ -1215,13 +1220,22 @@  static int node_demotion[MAX_NUMNODES] _
  */
 int next_demotion_node(int node)
 {
+	int target;
+
 	/*
-	 * node_demotion[] is updated without excluding
-	 * this function from running.  READ_ONCE() avoids
-	 * reading multiple, inconsistent 'node' values
-	 * during an update.
+	 * node_demotion[] is updated without excluding this
+	 * function from running.  RCU doesn't provide any
+	 * compiler barriers, so the READ_ONCE() is required
+	 * to avoid compiler reordering or read merging.
+	 *
+	 * Make sure to use RCU over entire code blocks if
+	 * node_demotion[] reads need to be consistent.
 	 */
-	return READ_ONCE(node_demotion[node]);
+	rcu_read_lock();
+	target = READ_ONCE(node_demotion[node]);
+	rcu_read_unlock();
+
+	return target;
 }
 
 /*
@@ -3226,8 +3240,9 @@  void migrate_vma_finalize(struct migrate
 EXPORT_SYMBOL(migrate_vma_finalize);
 #endif /* CONFIG_DEVICE_PRIVATE */
 
+#if defined(CONFIG_MEMORY_HOTPLUG)
 /* Disable reclaim-based migration. */
-static void disable_all_migrate_targets(void)
+static void __disable_all_migrate_targets(void)
 {
 	int node;
 
@@ -3235,6 +3250,25 @@  static void disable_all_migrate_targets(
 		node_demotion[node] = NUMA_NO_NODE;
 }
 
+static void disable_all_migrate_targets(void)
+{
+	__disable_all_migrate_targets();
+
+	/*
+	 * Ensure that the "disable" is visible across the system.
+	 * Readers will see either a combination of before+disable
+	 * state or disable+after.  They will never see before and
+	 * after state together.
+	 *
+	 * The before+after state together might have cycles and
+	 * could cause readers to do things like loop until this
+	 * function finishes.  This ensures they can only see a
+	 * single "bad" read and would, for instance, only loop
+	 * once.
+	 */
+	synchronize_rcu();
+}
+
 /*
  * Find an automatic demotion target for 'node'.
  * Failing here is OK.  It might just indicate
@@ -3297,20 +3331,6 @@  static void __set_migration_target_nodes
 	disable_all_migrate_targets();
 
 	/*
-	 * Ensure that the "disable" is visible across the system.
-	 * Readers will see either a combination of before+disable
-	 * state or disable+after.  They will never see before and
-	 * after state together.
-	 *
-	 * The before+after state together might have cycles and
-	 * could cause readers to do things like loop until this
-	 * function finishes.  This ensures they can only see a
-	 * single "bad" read and would, for instance, only loop
-	 * once.
-	 */
-	smp_wmb();
-
-	/*
 	 * Allocations go close to CPUs, first.  Assume that
 	 * the migration path starts at the nodes with CPUs.
 	 */
@@ -3347,10 +3367,96 @@  again:
 /*
  * For callers that do not hold get_online_mems() already.
  */
-__maybe_unused // <- temporay to prevent warnings during bisects
 static void set_migration_target_nodes(void)
 {
 	get_online_mems();
 	__set_migration_target_nodes();
 	put_online_mems();
 }
+
+/*
+ * React to hotplug events that might affect the migration targets
+ * like events that online or offline NUMA nodes.
+ *
+ * The ordering is also currently dependent on which nodes have
+ * CPUs.  That means we need CPU on/offline notification too.
+ */
+static int migration_online_cpu(unsigned int cpu)
+{
+	set_migration_target_nodes();
+	return 0;
+}
+
+static int migration_offline_cpu(unsigned int cpu)
+{
+	set_migration_target_nodes();
+	return 0;
+}
+
+/*
+ * This leaves migrate-on-reclaim transiently disabled between
+ * the MEM_GOING_OFFLINE and MEM_OFFLINE events.  This runs
+ * whether reclaim-based migration is enabled or not, which
+ * ensures that the user can turn reclaim-based migration at
+ * any time without needing to recalculate migration targets.
+ *
+ * These callbacks already hold get_online_mems().  That is why
+ * __set_migration_target_nodes() can be used as opposed to
+ * set_migration_target_nodes().
+ */
+static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
+						 unsigned long action, void *arg)
+{
+	switch (action) {
+	case MEM_GOING_OFFLINE:
+		/*
+		 * Make sure there are not transient states where
+		 * an offline node is a migration target.  This
+		 * will leave migration disabled until the offline
+		 * completes and the MEM_OFFLINE case below runs.
+		 */
+		disable_all_migrate_targets();
+		break;
+	case MEM_OFFLINE:
+	case MEM_ONLINE:
+		/*
+		 * Recalculate the target nodes once the node
+		 * reaches its final state (online or offline).
+		 */
+		__set_migration_target_nodes();
+		break;
+	case MEM_CANCEL_OFFLINE:
+		/*
+		 * MEM_GOING_OFFLINE disabled all the migration
+		 * targets.  Reenable them.
+		 */
+		__set_migration_target_nodes();
+		break;
+	case MEM_GOING_ONLINE:
+	case MEM_CANCEL_ONLINE:
+		break;
+	}
+
+	return notifier_from_errno(0);
+}
+
+static int __init migrate_on_reclaim_init(void)
+{
+	int ret;
+
+	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "migrate on reclaim",
+				migration_online_cpu,
+				migration_offline_cpu);
+	/*
+	 * In the unlikely case that this fails, the automatic
+	 * migration targets may become suboptimal for nodes
+	 * where N_CPU changes.  With such a small impact in a
+	 * rare case, do not bother trying to do anything special.
+	 */
+	WARN_ON(ret < 0);
+
+	hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
+	return 0;
+}
+late_initcall(migrate_on_reclaim_init);
+#endif /* CONFIG_MEMORY_HOTPLUG */