[01/10] mm/numa: node demotion data structure and lookup

Message ID	20210401183218.E7C9CE24@viggo.jf.intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=d9cH=I6=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7542A60FE6 IronPort-SDR: 7FRYJNwEQF7E3+7A2duDiNgjNXDWDe2KKsmKDaQqz1hcf+GEvQq0sUD2wrsoD88kZSLtzEntPB SVSyMRw3Kerg== IronPort-SDR: JioM0Sj4VyscMKESqMCXUZ967wamiVnr5qKHzeazQa098ewbw65ZWDvck0Ud2p9YnZwZE1uz2l KP6i84s581TA== Subject: [PATCH 01/10] mm/numa: node demotion data structure and lookup To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org,Dave Hansen <dave.hansen@linux.intel.com>,shy828301@gmail.com,weixugc@google.com,rientjes@google.com,ying.huang@intel.com,dan.j.williams@intel.com,david@redhat.com,osalvador@suse.de From: Dave Hansen <dave.hansen@linux.intel.com> Date: Thu, 01 Apr 2021 11:32:18 -0700 References: <20210401183216.443C4443@viggo.jf.intel.com> In-Reply-To: <20210401183216.443C4443@viggo.jf.intel.com> Message-Id: <20210401183218.E7C9CE24@viggo.jf.intel.com> Received-SPF: none (linux.intel.com>: No applicable sender policy available) receiver=imf23; identity=mailfrom; envelope-from="<dave.hansen@linux.intel.com>"; helo=mga11.intel.com; client-ip=192.55.52.93 Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Migrate Pages in lieu of discard \| expand [00/10,v7,RESEND] Migrate Pages in lieu of discard [01/10] mm/numa: node demotion data structure and lookup [02/10] mm/numa: automatically generate node migration order [03/10] mm/migrate: update node demotion order during on hotplug events [04/10] mm/migrate: make migrate_pages() return nr_succeeded [05/10] mm/migrate: demote pages during reclaim [06/10] mm/vmscan: add page demotion counter [07/10] mm/vmscan: add helper for querying ability to age anonymous pages [08/10] mm/vmscan: Consider anonymous pages without swap [09/10] mm/vmscan: never demote for memcg reclaim [10/10] mm/migrate: new zone_reclaim_mode to enable reclaim migration

Message ID

20210401183218.E7C9CE24@viggo.jf.intel.com (mailing list archive)

State

New, archived

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7542A60FE6
IronPort-SDR: 
 7FRYJNwEQF7E3+7A2duDiNgjNXDWDe2KKsmKDaQqz1hcf+GEvQq0sUD2wrsoD88kZSLtzEntPB
 SVSyMRw3Kerg==
IronPort-SDR: 
 JioM0Sj4VyscMKESqMCXUZ967wamiVnr5qKHzeazQa098ewbw65ZWDvck0Ud2p9YnZwZE1uz2l
 KP6i84s581TA==
Subject: [PATCH 01/10] mm/numa: node demotion data structure and lookup
To: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org,Dave Hansen
 <dave.hansen@linux.intel.com>,shy828301@gmail.com,weixugc@google.com,rientjes@google.com,ying.huang@intel.com,dan.j.williams@intel.com,david@redhat.com,osalvador@suse.de
From: Dave Hansen <dave.hansen@linux.intel.com>
Date: Thu, 01 Apr 2021 11:32:18 -0700
References: <20210401183216.443C4443@viggo.jf.intel.com>
In-Reply-To: <20210401183216.443C4443@viggo.jf.intel.com>
Message-Id: <20210401183218.E7C9CE24@viggo.jf.intel.com>
Received-SPF: none (linux.intel.com>: No applicable sender policy available)
 receiver=imf23; identity=mailfrom;
 envelope-from="<dave.hansen@linux.intel.com>"; helo=mga11.intel.com;
 client-ip=192.55.52.93
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

Migrate Pages in lieu of discard | expand

Commit Message

Dave Hansen April 1, 2021, 6:32 p.m. UTC

From: Dave Hansen <dave.hansen@linux.intel.com>

Prepare for the kernel to auto-migrate pages to other memory nodes
with a user defined node migration table. This allows creating single
migration target for each NUMA node to enable the kernel to do NUMA
page migrations instead of simply reclaiming colder pages. A node
with no target is a "terminal node", so reclaim acts normally there.
The migration target does not fundamentally _need_ to be a single node,
but this implementation starts there to limit complexity.

If you consider the migration path as a graph, cycles (loops) in the
graph are disallowed.  This avoids wasting resources by constantly
migrating (A->B, B->A, A->B ...).  The expectation is that cycles will
never be allowed.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Wei Xu <weixugc@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: osalvador <osalvador@suse.de>

--

changes since 20200122:
 * Make node_demotion[] __read_mostly

changes in July 2020:
 - Remove loop from next_demotion_node() and get_online_mems().
   This means that the node returned by next_demotion_node()
   might now be offline, but the worst case is that the
   allocation fails.  That's fine since it is transient.
---

 b/mm/migrate.c |   17 +++++++++++++++++
 1 file changed, 17 insertions(+)

Comments

Oscar Salvador April 8, 2021, 8:03 a.m. UTC | #1

On Thu, Apr 01, 2021 at 11:32:18AM -0700, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> Prepare for the kernel to auto-migrate pages to other memory nodes
> with a user defined node migration table. This allows creating single
> migration target for each NUMA node to enable the kernel to do NUMA
> page migrations instead of simply reclaiming colder pages. A node
> with no target is a "terminal node", so reclaim acts normally there.
> The migration target does not fundamentally _need_ to be a single node,
> but this implementation starts there to limit complexity.
> 
> If you consider the migration path as a graph, cycles (loops) in the
> graph are disallowed.  This avoids wasting resources by constantly
> migrating (A->B, B->A, A->B ...).  The expectation is that cycles will
> never be allowed.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Reviewed-by: Yang Shi <shy828301@gmail.com>
> Cc: Wei Xu <weixugc@google.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: osalvador <osalvador@suse.de>

I think this patch and patch#2 could be squashed

Reviewed-by: Oscar Salvador <osalvador@suse.de>

Dave Hansen April 8, 2021, 9:29 p.m. UTC | #2

On 4/8/21 1:03 AM, Oscar Salvador wrote:
> I think this patch and patch#2 could be squashed
> 
> Reviewed-by: Oscar Salvador <osalvador@suse.de>

Yeah, that makes a lot of sense.  I'll do that for the next version.

Wei Xu April 9, 2021, 5:32 a.m. UTC | #3

It makes sense to start with a simple node tiering model like this
change, which looks good to me.

I do want to mention a likely usage scenario that motivates the need
for a list of nodes as the demotion target of a source node.

Access to a cross-socket DRAM node is still fast enough.  So to minimize
memory stranding, job processes can be allowed to fall back to
allocate pages from a remote DRAM node.

But cross-socket access to PMEM nodes (the slower tier) can be slow,
especially for random writes.  It is then desirable not to demote the
pages of a process to a remote PMEM node, even when the pages are on
a remote DRAM node, which has the remote PMEM node as its demotion
target.  At the same time, it is also desirable to still be able to
demote such pages when they become cold so that the more precious
DRAM occupied by these pages can be used for more active data.

To support such use cases, we need to be able to specify a list of
demotion target nodes for the remote DRAM node, which should include
the PMEM node closer to the process.  Certainly, we will also need an
ability to limit the demotion target nodes of a process (or a cgroup)
to ensure that only local PMEM nodes are eligible as the actual
demotion target.

Note that demoting a page to a remote PMEM node is more acceptable
than a process accesses the same remote PMEM node because demotion
is one-time, sequential access, and can also use non-temporal stores
to reduce the access overheads and bypass caches.

Reviewed-by: Wei Xu <weixugc@google.com>

On Thu, Apr 1, 2021 at 11:35 AM Dave Hansen <dave.hansen@linux.intel.com> wrote:
>
>
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> Prepare for the kernel to auto-migrate pages to other memory nodes
> with a user defined node migration table. This allows creating single
> migration target for each NUMA node to enable the kernel to do NUMA
> page migrations instead of simply reclaiming colder pages. A node
> with no target is a "terminal node", so reclaim acts normally there.
> The migration target does not fundamentally _need_ to be a single node,
> but this implementation starts there to limit complexity.
>
> If you consider the migration path as a graph, cycles (loops) in the
> graph are disallowed.  This avoids wasting resources by constantly
> migrating (A->B, B->A, A->B ...).  The expectation is that cycles will
> never be allowed.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Reviewed-by: Yang Shi <shy828301@gmail.com>
> Cc: Wei Xu <weixugc@google.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: osalvador <osalvador@suse.de>
>
> --
>
> changes since 20200122:
>  * Make node_demotion[] __read_mostly
>
> changes in July 2020:
>  - Remove loop from next_demotion_node() and get_online_mems().
>    This means that the node returned by next_demotion_node()
>    might now be offline, but the worst case is that the
>    allocation fails.  That's fine since it is transient.
> ---
>
>  b/mm/migrate.c |   17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
>
> diff -puN mm/migrate.c~0006-node-Define-and-export-memory-migration-path mm/migrate.c
> --- a/mm/migrate.c~0006-node-Define-and-export-memory-migration-path    2021-03-31 15:17:10.734000264 -0700
> +++ b/mm/migrate.c      2021-03-31 15:17:10.742000264 -0700
> @@ -1163,6 +1163,23 @@ out:
>         return rc;
>  }
>
> +static int node_demotion[MAX_NUMNODES] __read_mostly =
> +       {[0 ...  MAX_NUMNODES - 1] = NUMA_NO_NODE};
> +
> +/**
> + * next_demotion_node() - Get the next node in the demotion path
> + * @node: The starting node to lookup the next node
> + *
> + * @returns: node id for next memory node in the demotion path hierarchy
> + * from @node; NUMA_NO_NODE if @node is terminal.  This does not keep
> + * @node online or guarantee that it *continues* to be the next demotion
> + * target.
> + */
> +int next_demotion_node(int node)
> +{
> +       return node_demotion[node];
> +}
> +
>  /*
>   * Obtain the lock on page, remove all ptes and migrate the page
>   * to the newly allocated page in newpage.
> _

diff -puN mm/migrate.c~0006-node-Define-and-export-memory-migration-path mm/migrate.c
--- a/mm/migrate.c~0006-node-Define-and-export-memory-migration-path	2021-03-31 15:17:10.734000264 -0700
+++ b/mm/migrate.c	2021-03-31 15:17:10.742000264 -0700
@@ -1163,6 +1163,23 @@  out:
 	return rc;
 }
 
+static int node_demotion[MAX_NUMNODES] __read_mostly =
+	{[0 ...  MAX_NUMNODES - 1] = NUMA_NO_NODE};
+
+/**
+ * next_demotion_node() - Get the next node in the demotion path
+ * @node: The starting node to lookup the next node
+ *
+ * @returns: node id for next memory node in the demotion path hierarchy
+ * from @node; NUMA_NO_NODE if @node is terminal.  This does not keep
+ * @node online or guarantee that it *continues* to be the next demotion
+ * target.
+ */
+int next_demotion_node(int node)
+{
+	return node_demotion[node];
+}
+
 /*
  * Obtain the lock on page, remove all ptes and migrate the page
  * to the newly allocated page in newpage.

[01/10] mm/numa: node demotion data structure and lookup

Commit Message

Comments

Patch