[RFC,-V2,6/8] autonuma, memory tiering: Select hotter pages to promote to fast memory node

From: Huang Ying <ying.huang@intel.com>

From: Huang Ying <ying.huang@intel.com>

In memory tiering system, to maximize the overall system performance,
the hot pages should be put in the fast memory node while the cold
pages should be put in the slow memory node.  In original memory
tiering autonuma implementation, we will try to promote almost all
recently accessed pages, and use the LRU algorithm in page reclaiming
to keep the hot pages in the fast memory node and demote the cold
pages to the slow memory node.  The problem of this solution is that
the cold pages with a low access frequency may be promoted then
demoted too.  So that the memory bandwidth is wasted.  And because
migration is rate-limited, the hot pages need to compete with the cold
pages for the limited migration bandwidth.

If we could select the hotter pages to promote to the fast memory node
in the first place, then the wasted migration bandwidth would be
reduced and the hot pages would be promoted more quickly.

The patch "autonuma, memory tiering: Only promote page if accessed
twice" in the series will prevent the really cold pages that are not
accessed in the last scan period from being promoted.  But the scan
period could be as long as tens seconds, so it doesn't work well
enough on selecting the hotter pages.

To identify the hotter pages, in this patch we implemented a method
suggested by Jianshi and Fengguang.  Which is based on autonuma page
table scanning and hint page fault as follows,

- When a range of the page table is scanned in autonuma, the timestamp
  and the address range is recorded in a ring buffer in struct
  mm_struct.  So we have information of recent N scans.

- When the autonuma hint page fault occurs, the fault address is
  searched in the ring buffer to get its scanning timestamp.  The hint
  page fault latency is defined as

    hint page fault timestamp - scan timestamp

  If the access frequency of the hotter pages is higher, the
  probability for their hint page fault latency to be shorter is
  higher too.  So the hint page fault latency is a good estimation of
  the page heat.

The size of ring buffer should record NUMA scanning history reasonably
long.  From task_scan_min(), the minimal interval between
task_numa_work() running is about 100 ms by default.  So we can keep
1600 ms history by default if set the size to 16.  If user choose to
use smaller sysctl_numa_balancing_scan_size, then we can only keep
shorter history.  In general, we want to keep no less than 1000 ms
history.  So 16 seems a reasonable choice.

The remaining problem is how to determine the hot threshold.  It's not
easy to be done automatically.  So we provide a sysctl knob:
kernel.numa_balancing_hot_threshold_ms.  All pages with hint page
fault latency < the threshold will be considered hot.  The system
administrator can determine the hot threshold via various information,
such as PMEM bandwidth limit, the average number of the pages pass the
hot threshold, etc.  The default hot threshold is 1 second, which
works well in our performance test.

The patch improves the score of pmbench memory accessing benchmark
with 80:20 read/write ratio and normal access address distribution by
9.2% with 50.3% fewer NUMA page migrations on a 2 socket Intel server
with Optance DC Persistent Memory.  That is, the cost of autonuma page
migration reduces considerably.

The downside of the patch is that the response time to the workload
hot spot changing may be much longer.  For example,

- A previous cold memory area becomes hot

- The hint page fault will be triggered.  But the hint page fault
  latency may not be shorter than the hot threshold.  So the pages may
  not be promoted.

- When the memory area is scanned again, maybe after a scan period,
  the hint page fault latency measured will be shorter than the hot
  threshold and the pages will be promoted.

To mitigate this,

- If there are enough free space in the fast memory node (> high
  watermark + 2 * promotion rate limit), the hot threshold will not be
  used, all pages will be promoted upon the hint page fault for fast
  response.

- If fast response is more important for system performance, the
  administrator can set a higher hot threshold.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: "Zhou, Jianshi" <jianshi.zhou@intel.com>
Suggested-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>

Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/mempolicy.h            |  5 +-
 include/linux/mm_types.h             |  9 ++++
 include/linux/sched/numa_balancing.h |  8 ++-
 include/linux/sched/sysctl.h         |  1 +
 kernel/sched/fair.c                  | 78 +++++++++++++++++++++++++---
 kernel/sysctl.c                      |  7 +++
 mm/huge_memory.c                     |  6 +--
 mm/memory.c                          |  7 ++-
 mm/mempolicy.c                       |  7 ++-
 9 files changed, 108 insertions(+), 20 deletions(-)

Message ID	20200218082634.1596727-7-ying.huang@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=f+6W=4G=kvack.org=owner-linux-mm@kernel.org> Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6E71A13A4 for <patchwork-linux-mm@patchwork.kernel.org>; Tue, 18 Feb 2020 08:27:38 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 2657D2464E for <patchwork-linux-mm@patchwork.kernel.org>; Tue, 18 Feb 2020 08:27:38 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2657D2464E Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 5DB2E6B000E; Tue, 18 Feb 2020 03:27:36 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 58D9E6B0010; Tue, 18 Feb 2020 03:27:36 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4A3066B0032; Tue, 18 Feb 2020 03:27:36 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0037.hostedemail.com [216.40.44.37]) by kanga.kvack.org (Postfix) with ESMTP id 2E9A36B000E for <linux-mm@kvack.org>; Tue, 18 Feb 2020 03:27:36 -0500 (EST) Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id B8B4F40D3 for <linux-mm@kvack.org>; Tue, 18 Feb 2020 08:27:35 +0000 (UTC) X-FDA: 76502568870.13.wound81_423d4887a4726 X-Spam-Summary: 1,0,0,,d41d8cd98f00b204,ying.huang@intel.com,:peterz@infradead.org:mingo@kernel.org::linux-kernel@vger.kernel.org:feng.tang@intel.com:ying.huang@intel.com:jianshi.zhou@intel.com:fengguang.wu@intel.com:akpm@linux-foundation.org:mhocko@suse.com:riel@redhat.com:mgorman@suse.de:dave.hansen@linux.intel.com:dan.j.williams@intel.com,RULES_HIT:30003:30034:30036:30051:30054:30055:30064:30070:30075:30080:30090,0,RBL:134.134.136.100:@intel.com:.lbl8.mailshell.net-62.18.0.100 64.95.201.95,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:ft,MSBL:0,DNSBL:none,Custom_rules:0:0:0,LFtime:1,LUA_SUMMARY:none X-HE-Tag: wound81_423d4887a4726 X-Filterd-Recvd-Size: 19403 Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by imf14.hostedemail.com (Postfix) with ESMTP for <linux-mm@kvack.org>; Tue, 18 Feb 2020 08:27:35 +0000 (UTC) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga003.jf.intel.com ([10.7.209.27]) by orsmga105.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 18 Feb 2020 00:27:34 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.70,455,1574150400"; d="scan'208";a="235466702" Received: from yhuang-dev.sh.intel.com ([10.239.159.151]) by orsmga003.jf.intel.com with ESMTP; 18 Feb 2020 00:27:31 -0800 From: "Huang, Ying" <ying.huang@intel.com> To: Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@kernel.org> Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Feng Tang <feng.tang@intel.com>, Huang Ying <ying.huang@intel.com>, "Zhou, Jianshi" <jianshi.zhou@intel.com>, Fengguang Wu <fengguang.wu@intel.com>, Andrew Morton <akpm@linux-foundation.org>, Michal Hocko <mhocko@suse.com>, Rik van Riel <riel@redhat.com>, Mel Gorman <mgorman@suse.de>, Dave Hansen <dave.hansen@linux.intel.com>, Dan Williams <dan.j.williams@intel.com> Subject: [RFC -V2 6/8] autonuma, memory tiering: Select hotter pages to promote to fast memory node Date: Tue, 18 Feb 2020 16:26:32 +0800 Message-Id: <20200218082634.1596727-7-ying.huang@intel.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200218082634.1596727-1-ying.huang@intel.com> References: <20200218082634.1596727-1-ying.huang@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: <linux-mm.kvack.org>
Series	autonuma: Optimize memory placement in memory tiering system \| expand [RFC,-V2,0/8] autonuma: Optimize memory placement in memory tiering system [RFC,-V2,1/8] autonuma: Add NUMA_BALANCING_MEMORY_TIERING mode [RFC,-V2,2/8] autonuma, memory tiering: Rate limit NUMA migration throughput [RFC,-V2,3/8] autonuma, memory tiering: Use kswapd to demote cold pages to PMEM [RFC,-V2,4/8] autonuma, memory tiering: Skip to scan fastest memory [RFC,-V2,5/8] autonuma, memory tiering: Only promote page if accessed twice [RFC,-V2,6/8] autonuma, memory tiering: Select hotter pages to promote to fast memory node [RFC,-V2,7/8] autonuma, memory tiering: Double hot threshold for write hint page fault [RFC,-V2,8/8] autonuma, memory tiering: Adjust hot threshold automatically

[RFC,-V2,6/8] autonuma, memory tiering: Select hotter pages to promote to fast memory node

Commit Message

Patch