From patchwork Fri Dec 4 09:15:32 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 11951165 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E514DC4361A for ; Fri, 4 Dec 2020 09:15:58 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 545CA22795 for ; Fri, 4 Dec 2020 09:15:58 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 545CA22795 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 5E5AC6B006C; Fri, 4 Dec 2020 04:15:57 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 56F1E6B006E; Fri, 4 Dec 2020 04:15:57 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 39D2C6B0070; Fri, 4 Dec 2020 04:15:57 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0103.hostedemail.com [216.40.44.103]) by kanga.kvack.org (Postfix) with ESMTP id 1FE646B006C for ; Fri, 4 Dec 2020 04:15:57 -0500 (EST) Received: from smtpin10.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id DDB28181AEF2A for ; Fri, 4 Dec 2020 09:15:56 +0000 (UTC) X-FDA: 77555042712.10.legs78_310a926273c3 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin10.hostedemail.com (Postfix) with ESMTP id BB74316A040 for ; Fri, 4 Dec 2020 09:15:56 +0000 (UTC) X-HE-Tag: legs78_310a926273c3 X-Filterd-Recvd-Size: 8691 Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by imf02.hostedemail.com (Postfix) with ESMTP for ; Fri, 4 Dec 2020 09:15:55 +0000 (UTC) IronPort-SDR: +PxNVWtTtKc7UpsupB8NrLtCQ9neD073X7VWXEFv5odMdtv2mG6crids4rvCPCNnGGjPHjXgeK kf6YcAg5mTbg== X-IronPort-AV: E=McAfee;i="6000,8403,9824"; a="237467130" X-IronPort-AV: E=Sophos;i="5.78,392,1599548400"; d="scan'208";a="237467130" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Dec 2020 01:15:51 -0800 IronPort-SDR: nkUdVP1tgS6YCUTFuqssTBYsLPaBjKW2yrliDNLjp0N8CEe5GAP2jonNXniLq6AReQULRZS4JS 2fTwUzxGpLDw== X-IronPort-AV: E=Sophos;i="5.78,392,1599548400"; d="scan'208";a="550879481" Received: from unknown (HELO yhuang6-mobl1.ccr.corp.intel.com) ([10.254.212.254]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Dec 2020 01:15:47 -0800 From: Huang Ying To: Peter Zijlstra , Mel Gorman Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Huang Ying , Andrew Morton , Ingo Molnar , Rik van Riel , Johannes Weiner , "Matthew Wilcox (Oracle)" , Dave Hansen , Andi Kleen , Michal Hocko , David Rientjes , linux-api@vger.kernel.org Subject: [PATCH -V7 1/3] numa balancing: Migrate on fault among multiple bound nodes Date: Fri, 4 Dec 2020 17:15:32 +0800 Message-Id: <20201204091534.72239-2-ying.huang@intel.com> X-Mailer: git-send-email 2.29.2 In-Reply-To: <20201204091534.72239-1-ying.huang@intel.com> References: <20201204091534.72239-1-ying.huang@intel.com> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Now, NUMA balancing can only optimize the page placement among the NUMA nodes if the default memory policy is used. Because the memory policy specified explicitly should take precedence. But this seems too strict in some situations. For example, on a system with 4 NUMA nodes, if the memory of an application is bound to the node 0 and 1, NUMA balancing can potentially migrate the pages between the node 0 and 1 to reduce cross-node accessing without breaking the explicit memory binding policy. So in this patch, we add MPOL_F_NUMA_BALANCING mode flag to set_mempolicy() when mode is MPOL_BIND. With the flag specified, NUMA balancing will be enabled within the thread to optimize the page placement within the constrains of the specified memory binding policy. With the newly added flag, the NUMA balancing control mechanism becomes, - sysctl knob numa_balancing can enable/disable the NUMA balancing globally. - even if sysctl numa_balancing is enabled, the NUMA balancing will be disabled for the memory areas or applications with the explicit memory policy by default. - MPOL_F_NUMA_BALANCING can be used to enable the NUMA balancing for the applications when specifying the explicit memory policy (MPOL_BIND). Various page placement optimization based on the NUMA balancing can be done with these flags. As the first step, in this patch, if the memory of the application is bound to multiple nodes (MPOL_BIND), and in the hint page fault handler the accessing node are in the policy nodemask, the page will be tried to be migrated to the accessing node to reduce the cross-node accessing. If the newly added MPOL_F_NUMA_BALANCING flag is specified by an application on an old kernel version without its support, set_mempolicy() will return -1 and errno will be set to EINVAL. The application can use this behavior to run on both old and new kernel versions. And if the MPOL_F_NUMA_BALANCING flag is specified for the mode other than MPOL_BIND, set_mempolicy() will return -1 and errno will be set to EINVAL as before. Because we don't support optimization based on the NUMA balancing for these modes. In the previous version of the patch, we tried to reuse MPOL_MF_LAZY for mbind(). But that flag is tied to MPOL_MF_MOVE.*, so it seems not a good API/ABI for the purpose of the patch. And because it's not clear whether it's necessary to enable NUMA balancing for a specific memory area inside an application, so we only add the flag at the thread level (set_mempolicy()) instead of the memory area level (mbind()). We can do that when it become necessary. To test the patch, we run a test case as follows on a 4-node machine with 192 GB memory (48 GB per node). 1. Change pmbench memory accessing benchmark to call set_mempolicy() to bind its memory to node 1 and 3 and enable NUMA balancing. Some related code snippets are as follows, #include #include struct bitmask *bmp; int ret; bmp = numa_parse_nodestring("1,3"); ret = set_mempolicy(MPOL_BIND | MPOL_F_NUMA_BALANCING, bmp->maskp, bmp->size + 1); /* If MPOL_F_NUMA_BALANCING isn't supported, fall back to MPOL_BIND */ if (ret < 0 && errno == EINVAL) ret = set_mempolicy(MPOL_BIND, bmp->maskp, bmp->size + 1); if (ret < 0) { perror("Failed to call set_mempolicy"); exit(-1); } 2. Run a memory eater on node 3 to use 40 GB memory before running pmbench. 3. Run pmbench with 64 processes, the working-set size of each process is 640 MB, so the total working-set size is 64 * 640 MB = 40 GB. The CPU and the memory (as in step 1.) of all pmbench processes is bound to node 1 and 3. So, after CPU usage is balanced, some pmbench processes run on the CPUs of the node 3 will access the memory of the node 1. 4. After the pmbench processes run for 100 seconds, kill the memory eater. Now it's possible for some pmbench processes to migrate their pages from node 1 to node 3 to reduce cross-node accessing. Test results show that, with the patch, the pages can be migrated from node 1 to node 3 after killing the memory eater, and the pmbench score can increase about 17.5%. Signed-off-by: "Huang, Ying" Acked-by: Mel Gorman Cc: Andrew Morton Cc: Ingo Molnar Cc: Rik van Riel Cc: Johannes Weiner Cc: "Matthew Wilcox (Oracle)" Cc: Dave Hansen Cc: Andi Kleen Cc: Michal Hocko Cc: David Rientjes Cc: linux-api@vger.kernel.org --- include/uapi/linux/mempolicy.h | 4 +++- mm/mempolicy.c | 16 ++++++++++++++++ 2 files changed, 19 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h index 3354774af61e..8948467b3992 100644 --- a/include/uapi/linux/mempolicy.h +++ b/include/uapi/linux/mempolicy.h @@ -28,12 +28,14 @@ enum { /* Flags for set_mempolicy */ #define MPOL_F_STATIC_NODES (1 << 15) #define MPOL_F_RELATIVE_NODES (1 << 14) +#define MPOL_F_NUMA_BALANCING (1 << 13) /* Optimize with NUMA balancing if possible */ /* * MPOL_MODE_FLAGS is the union of all possible optional mode flags passed to * either set_mempolicy() or mbind(). */ -#define MPOL_MODE_FLAGS (MPOL_F_STATIC_NODES | MPOL_F_RELATIVE_NODES) +#define MPOL_MODE_FLAGS \ + (MPOL_F_STATIC_NODES | MPOL_F_RELATIVE_NODES | MPOL_F_NUMA_BALANCING) /* Flags for get_mempolicy */ #define MPOL_F_NODE (1<<0) /* return next IL mode instead of node mask */ diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 3ca4898f3f24..c3f70d8e17d6 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -875,6 +875,16 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags, goto out; } + if (flags & MPOL_F_NUMA_BALANCING) { + if (new && new->mode == MPOL_BIND) { + new->flags |= (MPOL_F_MOF | MPOL_F_MORON); + } else { + ret = -EINVAL; + mpol_put(new); + goto out; + } + } + ret = mpol_set_nodemask(new, nodes, scratch); if (ret) { mpol_put(new); @@ -2490,6 +2500,12 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long break; case MPOL_BIND: + /* Optimize placement among multiple nodes via NUMA balancing */ + if (pol->flags & MPOL_F_MORON) { + if (node_isset(thisnid, pol->v.nodes)) + break; + goto out; + } /* * allows binding to multiple nodes. From patchwork Fri Dec 4 09:15:33 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 11951167 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9768BC433FE for ; Fri, 4 Dec 2020 09:16:00 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id F05A2225AB for ; Fri, 4 Dec 2020 09:15:59 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org F05A2225AB Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 597546B006E; Fri, 4 Dec 2020 04:15:58 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 4FA4D6B0070; Fri, 4 Dec 2020 04:15:58 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 39C3B6B0071; Fri, 4 Dec 2020 04:15:58 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0045.hostedemail.com [216.40.44.45]) by kanga.kvack.org (Postfix) with ESMTP id 169E16B006E for ; Fri, 4 Dec 2020 04:15:58 -0500 (EST) Received: from smtpin29.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id D7A2C8249980 for ; Fri, 4 Dec 2020 09:15:57 +0000 (UTC) X-FDA: 77555042754.29.love22_3306994273c3 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin29.hostedemail.com (Postfix) with ESMTP id B7FAB180868C1 for ; Fri, 4 Dec 2020 09:15:57 +0000 (UTC) X-HE-Tag: love22_3306994273c3 X-Filterd-Recvd-Size: 3198 Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by imf02.hostedemail.com (Postfix) with ESMTP for ; Fri, 4 Dec 2020 09:15:56 +0000 (UTC) IronPort-SDR: zsQr2jPDg+CSTcmCgcNw3hSiaX/7uVVM7YJBzSLodfFb8QWbF2mUzoJ29IWnFtSHwRN46WZ1xE QPzL4iMbGpHw== X-IronPort-AV: E=McAfee;i="6000,8403,9824"; a="237467136" X-IronPort-AV: E=Sophos;i="5.78,392,1599548400"; d="scan'208";a="237467136" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Dec 2020 01:15:55 -0800 IronPort-SDR: mr3ZlqUTIVb4q01Y8bYFcGc7DlSPk1NYe3aLoAsL1ACoNptIQJPUbeYEwco8ur2IdZ1BRjhMfs rCdQc26eTcwQ== X-IronPort-AV: E=Sophos;i="5.78,392,1599548400"; d="scan'208";a="550879507" Received: from unknown (HELO yhuang6-mobl1.ccr.corp.intel.com) ([10.254.212.254]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Dec 2020 01:15:51 -0800 From: Huang Ying To: Peter Zijlstra , Mel Gorman Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Huang Ying , "Matthew Wilcox (Oracle)" , Rafael Aquini , Andrew Morton , Ingo Molnar , Rik van Riel , Johannes Weiner , Dave Hansen , Andi Kleen , Michal Hocko , David Rientjes , linux-api@vger.kernel.org Subject: [PATCH -V7 2/3] NOT kernel/man2/set_mempolicy.2: Add mode flag MPOL_F_NUMA_BALANCING Date: Fri, 4 Dec 2020 17:15:33 +0800 Message-Id: <20201204091534.72239-3-ying.huang@intel.com> X-Mailer: git-send-email 2.29.2 In-Reply-To: <20201204091534.72239-1-ying.huang@intel.com> References: <20201204091534.72239-1-ying.huang@intel.com> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Signed-off-by: "Huang, Ying" --- man2/set_mempolicy.2 | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/man2/set_mempolicy.2 b/man2/set_mempolicy.2 index 68011eecb..fb2e6fd96 100644 --- a/man2/set_mempolicy.2 +++ b/man2/set_mempolicy.2 @@ -113,6 +113,15 @@ A nonempty .I nodemask specifies node IDs that are relative to the set of node IDs allowed by the process's current cpuset. +.TP +.BR MPOL_F_NUMA_BALANCING " (since Linux 5.11)" +When +.I mode +is MPOL_BIND, enable the Linux kernel NUMA balancing for the task if +it is supported by kernel. +If the flag isn't supported by Linux kernel, or is used with +.I mode +other than MPOL_BIND, return -1 and errno is set to EINVAL. .PP .I nodemask points to a bit mask of node IDs that contains up to @@ -293,6 +302,11 @@ argument specified both .B MPOL_F_STATIC_NODES and .BR MPOL_F_RELATIVE_NODES . +Or, the +.B MPOL_F_NUMA_BALANCING +isn't supported by the Linux kernel, or is used with +.I mode +other than MPOL_BIND. .TP .B ENOMEM Insufficient kernel memory was available. From patchwork Fri Dec 4 09:15:34 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 11951169 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E80BAC433FE for ; Fri, 4 Dec 2020 09:16:04 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 4CE90225AB for ; Fri, 4 Dec 2020 09:16:04 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 4CE90225AB Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id C737D6B0070; Fri, 4 Dec 2020 04:16:03 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id BFC1D6B0071; Fri, 4 Dec 2020 04:16:03 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AC3766B0072; Fri, 4 Dec 2020 04:16:03 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0146.hostedemail.com [216.40.44.146]) by kanga.kvack.org (Postfix) with ESMTP id 942026B0070 for ; Fri, 4 Dec 2020 04:16:03 -0500 (EST) Received: from smtpin20.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 54C7F180AD80F for ; Fri, 4 Dec 2020 09:16:03 +0000 (UTC) X-FDA: 77555043006.20.beef69_3c0a62b273c3 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin20.hostedemail.com (Postfix) with ESMTP id 232CA180BFBAD for ; Fri, 4 Dec 2020 09:16:03 +0000 (UTC) X-HE-Tag: beef69_3c0a62b273c3 X-Filterd-Recvd-Size: 9708 Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by imf02.hostedemail.com (Postfix) with ESMTP for ; Fri, 4 Dec 2020 09:16:01 +0000 (UTC) IronPort-SDR: Js7AVFLQWYZV45atfRfPekElIVTayekurXSfLoFuekSD0perAPMvIEU9yeczJGdsLOraFQNw+b aVwUw3TT/i9Q== X-IronPort-AV: E=McAfee;i="6000,8403,9824"; a="237467144" X-IronPort-AV: E=Sophos;i="5.78,392,1599548400"; d="scan'208";a="237467144" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Dec 2020 01:16:00 -0800 IronPort-SDR: 5twwErGhDlEER8+2Jvm+9v3O3UlxcxbYxk0mqDdzgCCYXmOERWKJJ3rFZKsoHbvtJVNKgdwRaM 6ev3PcWV/1OA== X-IronPort-AV: E=Sophos;i="5.78,392,1599548400"; d="scan'208";a="550879520" Received: from unknown (HELO yhuang6-mobl1.ccr.corp.intel.com) ([10.254.212.254]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Dec 2020 01:15:55 -0800 From: Huang Ying To: Peter Zijlstra , Mel Gorman Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Huang Ying , "Matthew Wilcox (Oracle)" , Rafael Aquini , Andrew Morton , Ingo Molnar , Rik van Riel , Johannes Weiner , Dave Hansen , Andi Kleen , Michal Hocko , David Rientjes , linux-api@vger.kernel.org Subject: [PATCH -V7 3/3] NOT kernel/numactl: Support to enable Linux kernel NUMA balancing Date: Fri, 4 Dec 2020 17:15:34 +0800 Message-Id: <20201204091534.72239-4-ying.huang@intel.com> X-Mailer: git-send-email 2.29.2 In-Reply-To: <20201204091534.72239-1-ying.huang@intel.com> References: <20201204091534.72239-1-ying.huang@intel.com> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: A new API: numa_set_membind_balancing() is added to libnuma. It is same as numa_set_membind() except that the Linux kernel NUMA balancing will be enabled for the task if the feature is supported by the kernel. At the same time, a new option: --balancing (-b) is added to numactl. Which can be used before the --membind/-m memory policy in the command line. With it, the Linux kernel NUMA balancing will be enabled for the process if --membind/-m is used and the feature is supported by the kernel. Signed-off-by: "Huang, Ying" --- libnuma.c | 14 ++++++++++++++ numa.3 | 15 +++++++++++++++ numa.h | 4 ++++ numactl.8 | 12 ++++++++++++ numactl.c | 17 ++++++++++++++--- numaif.h | 3 +++ versions.ldscript | 8 ++++++++ 7 files changed, 70 insertions(+), 3 deletions(-) diff --git a/libnuma.c b/libnuma.c index 88f479b..f073c50 100644 --- a/libnuma.c +++ b/libnuma.c @@ -1064,6 +1064,20 @@ numa_set_membind_v2(struct bitmask *bmp) make_internal_alias(numa_set_membind_v2); +void +numa_set_membind_balancing(struct bitmask *bmp) +{ + /* MPOL_F_NUMA_BALANCING: ignore if unsupported */ + if (set_mempolicy(MPOL_BIND | MPOL_F_NUMA_BALANCING, + bmp->maskp, bmp->size + 1) < 0) { + if (errno == EINVAL) { + errno = 0; + numa_set_membind_v2(bmp); + } else + numa_error("set_mempolicy"); + } +} + /* * copy a bitmask map body to a numa.h nodemask_t structure */ diff --git a/numa.3 b/numa.3 index 3e18098..af01c8f 100644 --- a/numa.3 +++ b/numa.3 @@ -80,6 +80,8 @@ numa \- NUMA policy library .br .BI "void numa_set_membind(struct bitmask *" nodemask ); .br +.BI "void numa_set_membind_balancing(struct bitmask *" nodemask ); +.br .B struct bitmask *numa_get_membind(void); .sp .BI "void *numa_alloc_onnode(size_t " size ", int " node ); @@ -538,6 +540,19 @@ that contains nodes other than those in the mask returned by .IR numa_get_mems_allowed () will result in an error. +.BR numa_set_membind_balancing () +sets the memory allocation mask and enable the Linux kernel NUMA +balancing for the task if the feature is supported by the kernel. +The task will only allocate memory from the nodes set in +.IR nodemask . +Passing an empty +.I nodemask +or a +.I nodemask +that contains nodes other than those in the mask returned by +.IR numa_get_mems_allowed () +will result in an error. + .BR numa_get_membind () returns the mask of nodes from which memory can currently be allocated. If the returned mask is equal to diff --git a/numa.h b/numa.h index bd1d676..5d8543a 100644 --- a/numa.h +++ b/numa.h @@ -192,6 +192,10 @@ void numa_set_localalloc(void); /* Only allocate memory from the nodes set in mask. 0 to turn off */ void numa_set_membind(struct bitmask *nodemask); +/* Only allocate memory from the nodes set in mask. Optimize page + placement with Linux kernel NUMA balancing if possible. 0 to turn off */ +void numa_set_membind_balancing(struct bitmask *bmp); + /* Return current membind */ struct bitmask *numa_get_membind(void); diff --git a/numactl.8 b/numactl.8 index f3bb22b..7d52688 100644 --- a/numactl.8 +++ b/numactl.8 @@ -25,6 +25,8 @@ numactl \- Control NUMA policy for processes or shared memory [ .B \-\-all ] [ +.B \-\-balancing +] [ .B \-\-interleave nodes ] [ .B \-\-preferred node @@ -168,6 +170,12 @@ but if memory cannot be allocated there fall back to other nodes. This option takes only a single node number. Relative notation may be used. .TP +.B \-\-balancing, \-b +Enable Linux kernel NUMA balancing for the process if it is supported by kernel. +This should only be used with +.I \-\-membind, \-m +only, otherwise ignored. +.TP .B \-\-show, \-s Show NUMA policy settings of the current process. .TP @@ -278,6 +286,10 @@ numactl \-\-cpunodebind=0 \-\-membind=0,1 -- process -l Run process as above, but with an option (-l) that would be confused with a numactl option. +numactl \-\-cpunodebind=0 \-\-balancing \-\-membind=0,1 process +Run process on node 0 with memory allocated on node 0 and 1. Optimize the +page placement with Linux kernel NUMA balancing mechanism if possible. + numactl \-\-cpunodebind=netdev:eth0 \-\-membind=netdev:eth0 network-server Run network-server on the node of network device eth0 with its memory also in the same node. diff --git a/numactl.c b/numactl.c index df9dbcb..5a9d2df 100644 --- a/numactl.c +++ b/numactl.c @@ -45,6 +45,7 @@ struct option opts[] = { {"membind", 1, 0, 'm'}, {"show", 0, 0, 's' }, {"localalloc", 0,0, 'l'}, + {"balancing", 0, 0, 'b'}, {"hardware", 0,0,'H' }, {"shm", 1, 0, 'S'}, @@ -65,9 +66,10 @@ struct option opts[] = { void usage(void) { fprintf(stderr, - "usage: numactl [--all | -a] [--interleave= | -i ] [--preferred= | -p ]\n" - " [--physcpubind= | -C ] [--cpunodebind= | -N ]\n" - " [--membind= | -m ] [--localalloc | -l] command args ...\n" + "usage: numactl [--all | -a] [--balancing | -b] [--interleave= | -i ]\n" + " [--preferred= | -p ] [--physcpubind= | -C ]\n" + " [--cpunodebind= | -N ] [--membind= | -m ]\n" + " [--localalloc | -l] command args ...\n" " numactl [--show | -s]\n" " numactl [--hardware | -H]\n" " numactl [--length | -l ] [--offset | -o ] [--shmmode | -M ]\n" @@ -90,6 +92,8 @@ void usage(void) "all numbers and ranges can be made cpuset-relative with +\n" "the old --cpubind argument is deprecated.\n" "use --cpunodebind or --physcpubind instead\n" + "use --balancing | -b to enable Linux kernel NUMA balancing\n" + "for the process if it is supported by kernel\n" " can have g (GB), m (MB) or k (KB) suffixes\n"); exit(1); } @@ -338,6 +342,7 @@ int do_dump = 0; int shmattached = 0; int did_node_cpu_parse = 0; int parse_all = 0; +int numa_balancing = 0; char *shmoption; void check_cpubind(int flag) @@ -431,6 +436,10 @@ int main(int ac, char **av) nopolicy(); hardware(); exit(0); + case 'b': /* --balancing */ + nopolicy(); + numa_balancing = 1; + break; case 'i': /* --interleave */ checknuma(); if (parse_all) @@ -507,6 +516,8 @@ int main(int ac, char **av) numa_set_bind_policy(1); if (shmfd >= 0) { numa_tonodemask_memory(shmptr, shmlen, mask); + } else if (numa_balancing) { + numa_set_membind_balancing(mask); } else { numa_set_membind(mask); } diff --git a/numaif.h b/numaif.h index 91aa230..32c12c3 100644 --- a/numaif.h +++ b/numaif.h @@ -29,6 +29,9 @@ extern long move_pages(int pid, unsigned long count, #define MPOL_LOCAL 4 #define MPOL_MAX 5 +/* Flags for set_mempolicy, specified in mode */ +#define MPOL_F_NUMA_BALANCING (1 << 13) /* Optimize with NUMA balancing if possible */ + /* Flags for get_mem_policy */ #define MPOL_F_NODE (1<<0) /* return next il node or node of address */ /* Warning: MPOL_F_NODE is unsupported and diff --git a/versions.ldscript b/versions.ldscript index 23074a0..358eeeb 100644 --- a/versions.ldscript +++ b/versions.ldscript @@ -146,3 +146,11 @@ libnuma_1.4 { local: *; } libnuma_1.3; + +# New interface for membind with NUMA balancing optimization +libnuma_1.5 { + global: + numa_set_membind_balancing; + local: + *; +} libnuma_1.4;