[-V8,4/6] memory tiering: hot page selection with hint page fault latency

To optimize page placement in a memory tiering system with NUMA
balancing, the hot pages in the slow memory node need to be
identified.  Essentially, the original NUMA balancing implementation
selects the mostly recently accessed (MRU) pages as the hot pages.
But this isn't a very good algorithm to identify the hot pages.

So, in this patch we implemented a better hot page selection
algorithm.  Which is based on NUMA balancing page table scanning and
hint page fault as follows,

- When the page tables of the processes are scanned to change PTE/PMD
  to be PROT_NONE, the current time is recorded in struct page as scan
  time.

- When the page is accessed, hint page fault will occur.  The scan
  time is gotten from the struct page.  And The hint page fault
  latency is defined as

    hint page fault time - scan time

The shorter the hint page fault latency of a page is, the higher the
probability of their access frequency to be higher.  So the hint page
fault latency is a good estimation of the page hot/cold.

But it's hard to find some extra space in struct page to hold the scan
time.  Fortunately, we can reuse some bits used by the original NUMA
balancing.

NUMA balancing uses some bits in struct page to store the page
accessing CPU and PID (referring to page_cpupid_xchg_last()).  Which
is used by the multi-stage node selection algorithm to avoid to
migrate pages shared accessed by the NUMA nodes back and forth.  But
for pages in the slow memory node, even if they are shared accessed by
multiple NUMA nodes, as long as the pages are hot, they need to be
promoted to the fast memory node.  So the accessing CPU and PID
information are unnecessary for the slow memory pages.  We can reuse
these bits in struct page to record the scan time for them.  For the
fast memory pages, these bits are used as before.

The remaining problem is how to determine the hot threshold.  It's not
easy to be done automatically.  So we provide a sysctl knob:
kernel.numa_balancing_hot_threshold_ms.  All pages with hint page
fault latency < the threshold will be considered hot.  The system
administrator can determine the hot threshold via various information,
such as PMEM bandwidth limit, the average number of the pages pass the
hot threshold, etc.  The default hot threshold is 1 second, which
works well in our performance test.

The patch improves the score of pmbench memory accessing benchmark
with 80:20 read/write ratio and normal access address distribution by
16.8% with 41.1% less pages promoted (that is, less overhead) on a 2
socket Intel server with Optance DC Persistent Memory.

The downside of the patch is that the response time to the workload
hot spot changing may be much longer.  For example,

- A previous cold memory area becomes hot

- The hint page fault will be triggered.  But the hint page fault
  latency isn't shorter than the hot threshold.  So the pages will
  not be promoted.

- When the memory area is scanned again, maybe after a scan period,
  the hint page fault latency measured will be shorter than the hot
  threshold and the pages will be promoted.

To mitigate this,

- If there are enough free space in the fast memory node, the hot
  threshold will not be used, all pages will be promoted upon the hint
  page fault for fast response.

- If fast response is more important for system performance, the
  administrator can set a higher hot threshold.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Wei Xu <weixugc@google.com>
Cc: osalvador <osalvador@suse.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/mm.h           | 29 ++++++++++++++++
 include/linux/sched/sysctl.h |  1 +
 kernel/sched/fair.c          | 67 ++++++++++++++++++++++++++++++++++++
 kernel/sysctl.c              |  7 ++++
 mm/huge_memory.c             | 13 +++++--
 mm/memory.c                  | 11 +++++-
 mm/migrate.c                 | 12 +++++++
 mm/mmzone.c                  | 17 +++++++++
 mm/mprotect.c                |  8 ++++-
 9 files changed, 160 insertions(+), 5 deletions(-)

Message ID	20210914013701.344956-5-ying.huang@intel.com (mailing list archive)
State	New
Headers	show Return-Path: <SRS0=Ci01=OE=kvack.org=owner-linux-mm@kernel.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CE534C433EF for <linux-mm@archiver.kernel.org>; Tue, 14 Sep 2021 01:37:34 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 674A060F21 for <linux-mm@archiver.kernel.org>; Tue, 14 Sep 2021 01:37:34 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 674A060F21 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 050E26B0078; Mon, 13 Sep 2021 21:37:34 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 00053900002; Mon, 13 Sep 2021 21:37:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E0A116B007D; Mon, 13 Sep 2021 21:37:33 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0113.hostedemail.com [216.40.44.113]) by kanga.kvack.org (Postfix) with ESMTP id CFAC16B0078 for <linux-mm@kvack.org>; Mon, 13 Sep 2021 21:37:33 -0400 (EDT) Received: from smtpin25.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 745C98249980 for <linux-mm@kvack.org>; Tue, 14 Sep 2021 01:37:33 +0000 (UTC) X-FDA: 78584466786.25.E2BB67A Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by imf02.hostedemail.com (Postfix) with ESMTP id 84CD97001A08 for <linux-mm@kvack.org>; Tue, 14 Sep 2021 01:37:32 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6200,9189,10106"; a="285535026" X-IronPort-AV: E=Sophos;i="5.85,291,1624345200"; d="scan'208";a="285535026" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Sep 2021 18:37:32 -0700 X-IronPort-AV: E=Sophos;i="5.85,291,1624345200"; d="scan'208";a="543575443" Received: from yhuang6-desk2.sh.intel.com ([10.239.159.119]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Sep 2021 18:37:28 -0700 From: Huang Ying <ying.huang@intel.com> To: linux-kernel@vger.kernel.org Cc: Huang Ying <ying.huang@intel.com>, Andrew Morton <akpm@linux-foundation.org>, Michal Hocko <mhocko@suse.com>, Rik van Riel <riel@surriel.com>, Mel Gorman <mgorman@suse.de>, Peter Zijlstra <peterz@infradead.org>, Dave Hansen <dave.hansen@linux.intel.com>, Yang Shi <shy828301@gmail.com>, Zi Yan <ziy@nvidia.com>, Wei Xu <weixugc@google.com>, osalvador <osalvador@suse.de>, Shakeel Butt <shakeelb@google.com>, linux-mm@kvack.org Subject: [PATCH -V8 4/6] memory tiering: hot page selection with hint page fault latency Date: Tue, 14 Sep 2021 09:36:59 +0800 Message-Id: <20210914013701.344956-5-ying.huang@intel.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20210914013701.344956-1-ying.huang@intel.com> References: <20210914013701.344956-1-ying.huang@intel.com> MIME-Version: 1.0 Authentication-Results: imf02.hostedemail.com; dkim=none; dmarc=fail reason="No valid SPF, No valid DKIM" header.from=intel.com (policy=none); spf=none (imf02.hostedemail.com: domain of ying.huang@intel.com has no SPF policy when checking 134.134.136.100) smtp.mailfrom=ying.huang@intel.com X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 84CD97001A08 X-Stat-Signature: fckom65x7achqg8pramzjm7yke17nkw1 X-HE-Tag: 1631583452-673186 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: <linux-mm.kvack.org>
Series	NUMA balancing: optimize memory placement for memory tiering system \| expand [-V8,0/6] NUMA balancing: optimize memory placement for memory tiering system [-V8,1/6] NUMA balancing: optimize page placement for memory tiering system [-V8,2/6] memory tiering: add page promotion counter [-V8,3/6] memory tiering: skip to scan fast memory [-V8,4/6] memory tiering: hot page selection with hint page fault latency [-V8,5/6] memory tiering: rate limit NUMA migration throughput [-V8,6/6] memory tiering: adjust hot threshold automatically

[-V8,4/6] memory tiering: hot page selection with hint page fault latency

Commit Message

Patch