From patchwork Wed Nov 18 05:19:52 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Huang, Ying" <ying.huang@intel.com>
X-Patchwork-Id: 11913863
Return-Path: <SRS0=FDpK=EY=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-16.7 required=3.0 tests=BAYES_00,
	HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT
	autolearn=ham autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id AE16BC56202
	for <linux-mm@archiver.kernel.org>; Wed, 18 Nov 2020 05:20:18 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id E085824654
	for <linux-mm@archiver.kernel.org>; Wed, 18 Nov 2020 05:20:17 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E085824654
Authentication-Results: mail.kernel.org;
 dmarc=fail (p=none dis=none) header.from=intel.com
Authentication-Results: mail.kernel.org;
 spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 42C6D6B0068; Wed, 18 Nov 2020 00:20:17 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 3DC486B006C; Wed, 18 Nov 2020 00:20:17 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 31A4A6B006E; Wed, 18 Nov 2020 00:20:17 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0101.hostedemail.com
 [216.40.44.101])
	by kanga.kvack.org (Postfix) with ESMTP id 046676B0068
	for <linux-mm@kvack.org>; Wed, 18 Nov 2020 00:20:16 -0500 (EST)
Received: from smtpin16.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id 78B79181AEF23
	for <linux-mm@kvack.org>; Wed, 18 Nov 2020 05:20:16 +0000 (UTC)
X-FDA: 77496388032.16.fear84_591533627337
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin16.hostedemail.com (Postfix) with ESMTP id 55E91100E690C
	for <linux-mm@kvack.org>; Wed, 18 Nov 2020 05:20:16 +0000 (UTC)
X-HE-Tag: fear84_591533627337
X-Filterd-Recvd-Size: 6556
Received: from mga11.intel.com (mga11.intel.com [192.55.52.93])
	by imf31.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Wed, 18 Nov 2020 05:20:14 +0000 (UTC)
IronPort-SDR: 
 td+5Dzr6D2p2NzUx2HpcgMu+BY8hoM9HEf13bcJsryIFgw8yydy2+e7kDuVu8SBZYxBgMlACFC
 MpefSTRJ2vhw==
X-IronPort-AV: E=McAfee;i="6000,8403,9808"; a="167556941"
X-IronPort-AV: E=Sophos;i="5.77,486,1596524400";
   d="scan'208";a="167556941"
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from orsmga005.jf.intel.com ([10.7.209.41])
  by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 17 Nov 2020 21:20:13 -0800
IronPort-SDR: 
 zETzxMIwYyRUJPsiHoDoeJfKTPXsZzNMPDL7/UcVojzH76zDvzbiJ4hPb9raPnSMirPsWBN6AX
 Ac5PMcyIWp1g==
X-IronPort-AV: E=Sophos;i="5.77,486,1596524400";
   d="scan'208";a="544347034"
Received: from yhuang-mobile.sh.intel.com ([10.238.5.184])
  by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 17 Nov 2020 21:20:10 -0800
From: Huang Ying <ying.huang@intel.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	Huang Ying <ying.huang@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Ingo Molnar <mingo@redhat.com>,
	Mel Gorman <mgorman@suse.de>,
	Rik van Riel <riel@surriel.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Dave Hansen <dave.hansen@intel.com>,
	Andi Kleen <ak@linux.intel.com>,
	Michal Hocko <mhocko@suse.com>,
	David Rientjes <rientjes@google.com>
Subject: [RFC -V5] autonuma: Migrate on fault among multiple bound nodes
Date: Wed, 18 Nov 2020 13:19:52 +0800
Message-Id: <20201118051952.39097-1-ying.huang@intel.com>
X-Mailer: git-send-email 2.29.2
MIME-Version: 1.0
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Now, AutoNUMA can only optimize the page placement among the NUMA
nodes if the default memory policy is used.  Because the memory policy
specified explicitly should take precedence.  But this seems too
strict in some situations.  For example, on a system with 4 NUMA
nodes, if the memory of an application is bound to the node 0 and 1,
AutoNUMA can potentially migrate the pages between the node 0 and 1 to
reduce cross-node accessing without breaking the explicit memory
binding policy.

So in this patch, we add MPOL_F_AUTONUMA mode flag to set_mempolicy().
With the flag specified, AutoNUMA will be enabled within the thread to
optimize the page placement within the constrains of the specified
memory binding policy.  With the newly added flag, the NUMA balancing
control mechanism becomes,

- sysctl knob numa_balancing can enable/disable the NUMA balancing
  globally.

- even if sysctl numa_balancing is enabled, the NUMA balancing will be
  disabled for the memory areas or applications with the explicit memory
  policy by default.

- MPOL_F_AUTONUMA can be used to enable the NUMA balancing for the
  applications when specifying the explicit memory policy.

Various page placement optimization based on the NUMA balancing can be
done with these flags.  As the first step, in this patch, if the
memory of the application is bound to multiple nodes (MPOL_BIND), and
in the hint page fault handler the accessing node are in the policy
nodemask, the page will be tried to be migrated to the accessing node
to reduce the cross-node accessing.

In the previous version of the patch, we tried to reuse MPOL_MF_LAZY
for mbind().  But that flag is tied to MPOL_MF_MOVE.*, so it seems not
a good API/ABI for the purpose of the patch.

And because it's not clear whether it's necessary to enable AutoNUMA
for a specific memory area inside an application, so we only add the
flag at the thread level (set_mempolicy()) instead of the memory area
level (mbind()).  We can do that when it become necessary.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>

Changes:

v5:

- Remove mbind() support, because it's not clear that it's necessary.

v4:

- Use new flags instead of reuse MPOL_MF_LAZY.

v3:

- Rebased on latest upstream (v5.10-rc3)

- Revised the change log.

v2:

- Rebased on latest upstream (v5.10-rc1)

Huang Ying (2):
  mempolicy: Rename MPOL_F_MORON to MPOL_F_MOPRON
  autonuma: Migrate on fault among multiple bound nodes
---
 include/uapi/linux/mempolicy.h | 4 +++-
 mm/mempolicy.c                 | 9 +++++++++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 3354774af61e..adb49f13840e 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -28,12 +28,14 @@ enum {
 /* Flags for set_mempolicy */
 #define MPOL_F_STATIC_NODES	(1 << 15)
 #define MPOL_F_RELATIVE_NODES	(1 << 14)
+#define MPOL_F_AUTONUMA		(1 << 13) /* Optimize with AutoNUMA if possible */
 
 /*
  * MPOL_MODE_FLAGS is the union of all possible optional mode flags passed to
  * either set_mempolicy() or mbind().
  */
-#define MPOL_MODE_FLAGS	(MPOL_F_STATIC_NODES | MPOL_F_RELATIVE_NODES)
+#define MPOL_MODE_FLAGS							\
+	(MPOL_F_STATIC_NODES | MPOL_F_RELATIVE_NODES | MPOL_F_AUTONUMA)
 
 /* Flags for get_mempolicy */
 #define MPOL_F_NODE	(1<<0)	/* return next IL mode instead of node mask */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 3ca4898f3f24..dc77827e8c08 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -875,6 +875,9 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,
 		goto out;
 	}
 
+	if (new && new->mode == MPOL_BIND && (flags & MPOL_F_AUTONUMA))
+		new->flags |= (MPOL_F_MOF | MPOL_F_MORON);
+
 	ret = mpol_set_nodemask(new, nodes, scratch);
 	if (ret) {
 		mpol_put(new);
@@ -2490,6 +2493,12 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		break;
 
 	case MPOL_BIND:
+		/* Optimize placement among multiple nodes via NUMA balancing */
+		if (pol->flags & MPOL_F_MORON) {
+			if (node_isset(thisnid, pol->v.nodes))
+				break;
+			goto out;
+		}
 
 		/*
 		 * allows binding to multiple nodes.