From patchwork Mon Jun 29 23:45:17 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Hansen X-Patchwork-Id: 11632871 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A062C92A for ; Mon, 29 Jun 2020 23:48:51 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 6D7AF20780 for ; Mon, 29 Jun 2020 23:48:51 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6D7AF20780 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 955DD8D0020; Mon, 29 Jun 2020 19:48:43 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 8DCCB8D001D; Mon, 29 Jun 2020 19:48:43 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7A9208D0020; Mon, 29 Jun 2020 19:48:43 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0231.hostedemail.com [216.40.44.231]) by kanga.kvack.org (Postfix) with ESMTP id 5ACA28D001D for ; Mon, 29 Jun 2020 19:48:43 -0400 (EDT) Received: from smtpin26.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 23974180AD806 for ; Mon, 29 Jun 2020 23:48:43 +0000 (UTC) X-FDA: 76983891726.26.trick90_240d32826e73 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin26.hostedemail.com (Postfix) with ESMTP id 0057C1804B66A for ; Mon, 29 Jun 2020 23:48:42 +0000 (UTC) X-Spam-Summary: 1,0,0,,d41d8cd98f00b204,dave.hansen@linux.intel.com,,RULES_HIT:30036:30045:30051:30054:30064:30070,0,RBL:134.134.136.31:@linux.intel.com:.lbl8.mailshell.net-64.95.201.95 62.18.0.100;04yf6sxiqr79smubd4sb73yf45gbbop1ucyremqjreq5acb4j1fxuztk84bsas7.n4jqgp5frpucbrhtbsbg4air8ynxypi4xh6ex3rzgzazh6z6f4k3noebnyxpc6d.6-lbl8.mailshell.net-223.238.255.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:none,Custom_rules:0:0:0,LFtime:24,LUA_SUMMARY:none X-HE-Tag: trick90_240d32826e73 X-Filterd-Recvd-Size: 8211 Received: from mga06.intel.com (mga06.intel.com [134.134.136.31]) by imf03.hostedemail.com (Postfix) with ESMTP for ; Mon, 29 Jun 2020 23:48:42 +0000 (UTC) IronPort-SDR: 59rf2M1MjmaI2ohrC8xn679e+a8Ec28iJHpAgx7geFq2rGI9OjrzyK1ueUL4P9NaC+Rmi2Tywh o4SChwZ6Tw8Q== X-IronPort-AV: E=McAfee;i="6000,8403,9666"; a="207619662" X-IronPort-AV: E=Sophos;i="5.75,296,1589266800"; d="scan'208";a="207619662" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 29 Jun 2020 16:48:41 -0700 IronPort-SDR: BMt7i1YF9CwoMQtjCi86Equ2aZvBB01RYfufR0Sm20hL7jkt2UdtHQPStv3hgaXXl098R04My0 S223ofrVtQVw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.75,296,1589266800"; d="scan'208";a="303239904" Received: from viggo.jf.intel.com (HELO localhost.localdomain) ([10.54.77.144]) by fmsmga004.fm.intel.com with ESMTP; 29 Jun 2020 16:48:41 -0700 Subject: [RFC][PATCH 8/8] mm/numa: new reclaim mode to enable reclaim-based migration To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org,Dave Hansen ,yang.shi@linux.alibaba.com,rientjes@google.com,ying.huang@intel.com,dan.j.williams@intel.com From: Dave Hansen Date: Mon, 29 Jun 2020 16:45:17 -0700 References: <20200629234503.749E5340@viggo.jf.intel.com> In-Reply-To: <20200629234503.749E5340@viggo.jf.intel.com> Message-Id: <20200629234517.A7EC4BD3@viggo.jf.intel.com> X-Rspamd-Queue-Id: 0057C1804B66A X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam01 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Dave Hansen Some method is obviously needed to enable reclaim-based migration. Just like traditional autonuma, there will be some workloads that will benefit like workloads with more "static" configurations where hot pages stay hot and cold pages stay cold. If pages come and go from the hot and cold sets, the benefits of this approach will be more limited. The benefits are truly workload-based and *not* hardware-based. We do not believe that there is a viable threshold where certain hardware configurations should have this mechanism enabled while others do not. To be conservative, earlier work defaulted to disable reclaim- based migration and did not include a mechanism to enable it. This propses extending the existing "zone_reclaim_mode" (now now really node_reclaim_mode) as a method to enable it. We are open to any alternative that allows end users to enable this mechanism or disable it it workload harm is detected (just like traditional autonuma). The implementation here is pretty simple and entirely unoptimized. On any memory hotplug events, assume that a node was added or removed and recalculate all migration targets. This ensures that the node_demotion[] array is always ready to be used in case the new reclaim mode is enabled. This recalculation is far from optimal, most glaringly that it does not even attempt to figure out if nodes are actually coming or going. Signed-off-by: Dave Hansen Cc: Yang Shi Cc: David Rientjes Cc: Huang Ying Cc: Dan Williams --- b/Documentation/admin-guide/sysctl/vm.rst | 9 ++++ b/mm/migrate.c | 61 +++++++++++++++++++++++++++++- b/mm/vmscan.c | 7 +-- 3 files changed, 73 insertions(+), 4 deletions(-) diff -puN Documentation/admin-guide/sysctl/vm.rst~enable-numa-demotion Documentation/admin-guide/sysctl/vm.rst --- a/Documentation/admin-guide/sysctl/vm.rst~enable-numa-demotion 2020-06-29 16:35:01.012312549 -0700 +++ b/Documentation/admin-guide/sysctl/vm.rst 2020-06-29 16:35:01.021312549 -0700 @@ -941,6 +941,7 @@ This is value OR'ed together of 1 (bit currently ignored) 2 Zone reclaim writes dirty pages out 4 Zone reclaim swaps pages +8 Zone reclaim migrates pages = =================================== zone_reclaim_mode is disabled by default. For file servers or workloads @@ -965,3 +966,11 @@ of other processes running on other node Allowing regular swap effectively restricts allocations to the local node unless explicitly overridden by memory policies or cpuset configurations. + +Page migration during reclaim is intended for systems with tiered memory +configurations. These systems have multiple types of memory with varied +performance characteristics instead of plain NUMA systems where the same +kind of memory is found at varied distances. Allowing page migration +during reclaim enables these systems to migrate pages from fast tiers to +slow tiers when the fast tier is under pressure. This migration is +performed before swap. diff -puN mm/migrate.c~enable-numa-demotion mm/migrate.c --- a/mm/migrate.c~enable-numa-demotion 2020-06-29 16:35:01.015312549 -0700 +++ b/mm/migrate.c 2020-06-29 16:35:01.022312549 -0700 @@ -49,6 +49,7 @@ #include #include #include +#include #include @@ -3165,6 +3166,10 @@ void set_migration_target_nodes(void) * Avoid any oddities like cycles that could occur * from changes in the topology. This will leave * a momentary gap when migration is disabled. + * + * This is superfluous for memory offlining since + * MEM_GOING_OFFLINE does it independently, but it + * does not hurt to do it a second time. */ disable_all_migrate_targets(); @@ -3211,6 +3216,60 @@ again: /* Is another pass necessary? */ if (!nodes_empty(next_pass)) goto again; +} - put_online_mems(); +/* + * React to hotplug events that might online or offline + * NUMA nodes. + * + * This leaves migrate-on-reclaim transiently disabled + * between the MEM_GOING_OFFLINE and MEM_OFFLINE events. + * This runs whether RECLAIM_MIGRATE is enabled or not. + * That ensures that the user can turn RECLAIM_MIGRATE + * without needing to recalcuate migration targets. + */ +#if defined(CONFIG_MEMORY_HOTPLUG) +static int __meminit migrate_on_reclaim_callback(struct notifier_block *self, + unsigned long action, void *arg) +{ + switch (action) { + case MEM_GOING_OFFLINE: + /* + * Make sure there are not transient states where + * an offline node is a migration target. This + * will leave migration disabled until the offline + * completes and the MEM_OFFLINE case below runs. + */ + disable_all_migrate_targets(); + break; + case MEM_OFFLINE: + case MEM_ONLINE: + /* + * Recalculate the target nodes once the node + * reaches its final state (online or offline). + */ + set_migration_target_nodes(); + break; + case MEM_CANCEL_OFFLINE: + /* + * MEM_GOING_OFFLINE disabled all the migration + * targets. Reenable them. + */ + set_migration_target_nodes(); + break; + case MEM_GOING_ONLINE: + case MEM_CANCEL_ONLINE: + break; + } + + return notifier_from_errno(0); } + +static int __init migrate_on_reclaim_init(void) +{ + hotplug_memory_notifier(migrate_on_reclaim_callback, 100); + return 0; +} +late_initcall(migrate_on_reclaim_init); +#endif /* CONFIG_MEMORY_HOTPLUG */ + diff -puN mm/vmscan.c~enable-numa-demotion mm/vmscan.c --- a/mm/vmscan.c~enable-numa-demotion 2020-06-29 16:35:01.017312549 -0700 +++ b/mm/vmscan.c 2020-06-29 16:35:01.023312549 -0700 @@ -4165,9 +4165,10 @@ int node_reclaim_mode __read_mostly; * These bit locations are exposed in the vm.zone_reclaim_mode sysctl * ABI. New bits are OK, but existing bits can never change. */ -#define RECLAIM_RSVD (1<<0) /* (currently ignored/unused) */ -#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */ -#define RECLAIM_UNMAP (1<<2) /* Unmap pages during reclaim */ +#define RECLAIM_RSVD (1<<0) /* (currently ignored/unused) */ +#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */ +#define RECLAIM_UNMAP (1<<2) /* Unmap pages during reclaim */ +#define RECLAIM_MIGRATE (1<<3) /* Migrate pages during reclaim */ /* * Priority for NODE_RECLAIM. This determines the fraction of pages