[-V9,1/9] mm/numa: add node demotion data structure

From: Dave Hansen <dave.hansen@linux.intel.com>

From: Dave Hansen <dave.hansen@linux.intel.com>

Prepare for the kernel to auto-migrate pages to other memory nodes
with a node migration table. This allows creating single migration
target for each NUMA node to enable the kernel to do NUMA page
migrations instead of simply discarding colder pages. A node with no
target is a "terminal node", so reclaim acts normally there.  The
migration target does not fundamentally _need_ to be a single node,
but this implementation starts there to limit complexity.

When memory fills up on a node, memory contents can be
automatically migrated to another node.  The biggest problems are
knowing when to migrate and to where the migration should be
targeted.

The most straightforward way to generate the "to where" list would
be to follow the page allocator fallback lists.  Those lists
already tell us if memory is full where to look next.  It would
also be logical to move memory in that order.

But, the allocator fallback lists have a fatal flaw: most nodes
appear in all the lists.  This would potentially lead to migration
cycles (A->B, B->A, A->B, ...).

Instead of using the allocator fallback lists directly, keep a
separate node migration ordering.  But, reuse the same data used
to generate page allocator fallback in the first place:
find_next_best_node().

This means that the firmware data used to populate node distances
essentially dictates the ordering for now.  It should also be
architecture-neutral since all NUMA architectures have a working
find_next_best_node().

RCU is used to allow lock-less read of node_demotion[] and prevent
demotion cycles been observed.  If multiple reads of node_demotion[]
are performed, a single rcu_read_lock() must be held over all reads to
ensure no cycles are observed.  Details are as follows.

=== What does RCU provide? ===

Imaginge a simple loop which walks down the demotion path looking
for the last node:

        terminal_node = start_node;
        while (node_demotion[terminal_node] != NUMA_NO_NODE) {
                terminal_node = node_demotion[terminal_node];
        }

The initial values are:

        node_demotion[0] = 1;
        node_demotion[1] = NUMA_NO_NODE;

and are updated to:

        node_demotion[0] = NUMA_NO_NODE;
        node_demotion[1] = 0;

What guarantees that the cycle is not observed:

        node_demotion[0] = 1;
        node_demotion[1] = 0;

and would loop forever?

With RCU, a rcu_read_lock/unlock() can be placed around the
loop.  Since the write side does a synchronize_rcu(), the loop
that observed the old contents is known to be complete before the
synchronize_rcu() has completed.

RCU, combined with disable_all_migrate_targets(), ensures that
the old migration state is not visible by the time
__set_migration_target_nodes() is called.

=== What does READ_ONCE() provide? ===

READ_ONCE() forbids the compiler from merging or reordering
successive reads of node_demotion[].  This ensures that any
updates are *eventually* observed.

Consider the above loop again.  The compiler could theoretically
read the entirety of node_demotion[] into local storage
(registers) and never go back to memory, and *permanently*
observe bad values for node_demotion[].

Note: RCU does not provide any universal compiler-ordering
guarantees:

	https://lore.kernel.org/lkml/20150921204327.GH4029@linux.vnet.ibm.com/

This code is unused for now.  It will be called later in the
series.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>

--

Changes from 20210618:
 * Merge patches for data structure definition and initialization
 * Move RCU usage from the next patch in series per Zi's comments

Changes from 20210302:
 * Fix typo in node_demotion[] comment

Changes since 20200122:
 * Make node_demotion[] __read_mostly
 * Add big node_demotion[] comment

Changes in July 2020:
 - Remove loop from next_demotion_node() and get_online_mems().
   This means that the node returned by next_demotion_node()
   might now be offline, but the worst case is that the
   allocation fails.  That's fine since it is transient.
---
 mm/internal.h   |   5 ++
 mm/migrate.c    | 216 ++++++++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c |   2 +-
 3 files changed, 222 insertions(+), 1 deletion(-)

Message ID	20210625073204.1005986-2-ying.huang@intel.com (mailing list archive)
State	New
Headers	show Return-Path: <SRS0=O4g6=LT=kvack.org=owner-linux-mm@kernel.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 992DCC49EAF for <linux-mm@archiver.kernel.org>; Fri, 25 Jun 2021 07:33:10 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 0BF2161425 for <linux-mm@archiver.kernel.org>; Fri, 25 Jun 2021 07:33:09 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 0BF2161425 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 1B86C6B005D; Fri, 25 Jun 2021 03:33:08 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 18E888D0002; Fri, 25 Jun 2021 03:33:08 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 02FE78D0001; Fri, 25 Jun 2021 03:33:07 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0089.hostedemail.com [216.40.44.89]) by kanga.kvack.org (Postfix) with ESMTP id C24116B005D for <linux-mm@kvack.org>; Fri, 25 Jun 2021 03:33:07 -0400 (EDT) Received: from smtpin06.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 670BD1CB1E for <linux-mm@kvack.org>; Fri, 25 Jun 2021 07:33:06 +0000 (UTC) X-FDA: 78291429972.06.E1CDF49 Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by imf23.hostedemail.com (Postfix) with ESMTP id 84449A000270 for <linux-mm@kvack.org>; Fri, 25 Jun 2021 07:33:03 +0000 (UTC) IronPort-SDR: Z+xF6n3VMJaSQNCnPBHFcFFwNN3Pira/jooQO3axzypospJABfWl5IJD+C1QoAzfwHQD3JYlHn ZQWTcsH3x+0w== X-IronPort-AV: E=McAfee;i="6200,9189,10025"; a="207562656" X-IronPort-AV: E=Sophos;i="5.83,298,1616482800"; d="scan'208";a="207562656" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Jun 2021 00:32:56 -0700 IronPort-SDR: 57bFz641bh4+zBWvgoC8b60NzkhaxaEDil/SDzRnWgwXm6rTcyWMllv4Hq2/p2wURgaHbGOaW8 HlSuzDOUiLrA== X-IronPort-AV: E=Sophos;i="5.83,298,1616482800"; d="scan'208";a="488085702" Received: from msun2-mobl1.ccr.corp.intel.com (HELO yhuang6-mobl1.ccr.corp.intel.com) ([10.254.215.50]) by orsmga001-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Jun 2021 00:32:50 -0700 From: Huang Ying <ying.huang@intel.com> To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Dave Hansen <dave.hansen@linux.intel.com>, "Huang, Ying" <ying.huang@intel.com>, Yang Shi <shy828301@gmail.com>, Oscar Salvador <osalvador@suse.de>, Michal Hocko <mhocko@suse.com>, Wei Xu <weixugc@google.com>, Zi Yan <ziy@nvidia.com>, David Rientjes <rientjes@google.com>, Dan Williams <dan.j.williams@intel.com>, David Hildenbrand <david@redhat.com> Subject: [PATCH -V9 1/9] mm/numa: add node demotion data structure Date: Fri, 25 Jun 2021 15:31:56 +0800 Message-Id: <20210625073204.1005986-2-ying.huang@intel.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20210625073204.1005986-1-ying.huang@intel.com> References: <20210625073204.1005986-1-ying.huang@intel.com> MIME-Version: 1.0 X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 84449A000270 Authentication-Results: imf23.hostedemail.com; dkim=none; dmarc=fail reason="No valid SPF, No valid DKIM" header.from=intel.com (policy=none); spf=none (imf23.hostedemail.com: domain of ying.huang@intel.com has no SPF policy when checking 134.134.136.24) smtp.mailfrom=ying.huang@intel.com X-Stat-Signature: kjauhjf788pw39zxjr48p55eodgke5m6 X-HE-Tag: 1624606383-236280 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: <linux-mm.kvack.org>
Series	Migrate Pages in lieu of discard \| expand [-V9,0/9] Migrate Pages in lieu of discard [-V9,1/9] mm/numa: add node demotion data structure [-V9,2/9] mm/migrate: update node demotion order during on hotplug events [-V9,3/9] mm/migrate: enable returning precise migrate_pages() success count [-V9,4/9] mm/migrate: demote pages during reclaim [-V9,5/9] mm/vmscan: add page demotion counter [-V9,6/9] mm/vmscan: add helper for querying ability to age anonymous pages [-V9,7/9] mm/vmscan: Consider anonymous pages without swap [-V9,8/9] mm/vmscan: never demote for memcg reclaim [-V9,9/9] mm/migrate: add sysfs interface to enable reclaim migration

[-V9,1/9] mm/numa: add node demotion data structure

Commit Message

Patch