[-V9,4/6] memory tiering: hot page selection with hint page fault latency

To optimize page placement in a memory tiering system with NUMA
balancing, the hot pages in the slow memory node need to be
identified.  Essentially, the original NUMA balancing implementation
selects the mostly recently accessed (MRU) pages as the hot pages.
But this isn't a very good algorithm to identify the hot pages.

So, in this patch we implemented a better hot page selection
algorithm.  Which is based on NUMA balancing page table scanning and
hint page fault as follows,

- When the page tables of the processes are scanned to change PTE/PMD
  to be PROT_NONE, the current time is recorded in struct page as scan
  time.

- When the page is accessed, hint page fault will occur.  The scan
  time is gotten from the struct page.  And The hint page fault
  latency is defined as

    hint page fault time - scan time

The shorter the hint page fault latency of a page is, the higher the
probability of their access frequency to be higher.  So the hint page
fault latency is a good estimation of the page hot/cold.

But it's hard to find some extra space in struct page to hold the scan
time.  Fortunately, we can reuse some bits used by the original NUMA
balancing.

NUMA balancing uses some bits in struct page to store the page
accessing CPU and PID (referring to page_cpupid_xchg_last()).  Which
is used by the multi-stage node selection algorithm to avoid to
migrate pages shared accessed by the NUMA nodes back and forth.  But
for pages in the slow memory node, even if they are shared accessed by
multiple NUMA nodes, as long as the pages are hot, they need to be
promoted to the fast memory node.  So the accessing CPU and PID
information are unnecessary for the slow memory pages.  We can reuse
these bits in struct page to record the scan time for them.  For the
fast memory pages, these bits are used as before.

The remaining problem is how to determine the hot threshold.  It's not
easy to be done automatically.  So we provide a sysctl knob:
kernel.numa_balancing_hot_threshold_ms.  All pages with hint page
fault latency < the threshold will be considered hot.  The system
administrator can determine the hot threshold via various information,
such as PMEM bandwidth limit, the average number of the pages pass the
hot threshold, etc.  The default hot threshold is 1 second, which
works well in our performance test.

The downside of the patch is that the response time to the workload
hot spot changing may be much longer.  For example,

- A previous cold memory area becomes hot

- The hint page fault will be triggered.  But the hint page fault
  latency isn't shorter than the hot threshold.  So the pages will
  not be promoted.

- When the memory area is scanned again, maybe after a scan period,
  the hint page fault latency measured will be shorter than the hot
  threshold and the pages will be promoted.

To mitigate this,

- If there are enough free space in the fast memory node, the hot
  threshold will not be used, all pages will be promoted upon the hint
  page fault for fast response.

- If fast response is more important for system performance, the
  administrator can set a higher hot threshold.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Wei Xu <weixugc@google.com>
Cc: osalvador <osalvador@suse.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/mm.h           | 29 ++++++++++++++++
 include/linux/sched/sysctl.h |  1 +
 kernel/sched/fair.c          | 67 ++++++++++++++++++++++++++++++++++++
 kernel/sysctl.c              |  7 ++++
 mm/huge_memory.c             | 13 +++++--
 mm/memory.c                  | 11 +++++-
 mm/migrate.c                 | 12 +++++++
 mm/mmzone.c                  | 17 +++++++++
 mm/mprotect.c                |  8 ++++-
 9 files changed, 160 insertions(+), 5 deletions(-)

Message ID	20211008083938.1702663-5-ying.huang@intel.com (mailing list archive)
State	New
Headers	show Return-Path: <SRS0=/tU7=O4=kvack.org=owner-linux-mm@kernel.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 96353C433F5 for <linux-mm@archiver.kernel.org>; Fri, 8 Oct 2021 08:40:12 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 1E0926103C for <linux-mm@archiver.kernel.org>; Fri, 8 Oct 2021 08:40:12 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 1E0926103C Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id B8BD56B0075; Fri, 8 Oct 2021 04:40:11 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B3A986B0078; Fri, 8 Oct 2021 04:40:11 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A02F06B007B; Fri, 8 Oct 2021 04:40:11 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0002.hostedemail.com [216.40.44.2]) by kanga.kvack.org (Postfix) with ESMTP id 909DF6B0075 for <linux-mm@kvack.org>; Fri, 8 Oct 2021 04:40:11 -0400 (EDT) Received: from smtpin33.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 52C8B8249980 for <linux-mm@kvack.org>; Fri, 8 Oct 2021 08:40:11 +0000 (UTC) X-FDA: 78672623022.33.150151E Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by imf23.hostedemail.com (Postfix) with ESMTP id 7CEFF9001474 for <linux-mm@kvack.org>; Fri, 8 Oct 2021 08:40:10 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6200,9189,10130"; a="226752900" X-IronPort-AV: E=Sophos;i="5.85,357,1624345200"; d="scan'208";a="226752900" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Oct 2021 01:40:10 -0700 X-IronPort-AV: E=Sophos;i="5.85,357,1624345200"; d="scan'208";a="439860393" Received: from yhuang6-desk2.sh.intel.com ([10.239.159.119]) by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Oct 2021 01:40:06 -0700 From: Huang Ying <ying.huang@intel.com> To: linux-kernel@vger.kernel.org Cc: Huang Ying <ying.huang@intel.com>, Andrew Morton <akpm@linux-foundation.org>, Michal Hocko <mhocko@suse.com>, Rik van Riel <riel@surriel.com>, Mel Gorman <mgorman@suse.de>, Peter Zijlstra <peterz@infradead.org>, Dave Hansen <dave.hansen@linux.intel.com>, Yang Shi <shy828301@gmail.com>, Zi Yan <ziy@nvidia.com>, Wei Xu <weixugc@google.com>, osalvador <osalvador@suse.de>, Shakeel Butt <shakeelb@google.com>, linux-mm@kvack.org Subject: [PATCH -V9 4/6] memory tiering: hot page selection with hint page fault latency Date: Fri, 8 Oct 2021 16:39:36 +0800 Message-Id: <20211008083938.1702663-5-ying.huang@intel.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20211008083938.1702663-1-ying.huang@intel.com> References: <20211008083938.1702663-1-ying.huang@intel.com> MIME-Version: 1.0 X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 7CEFF9001474 X-Stat-Signature: 97438oe74dnzuw5yoq4br1poayfwkx6y Authentication-Results: imf23.hostedemail.com; dkim=none; dmarc=fail reason="No valid SPF, No valid DKIM" header.from=intel.com (policy=none); spf=none (imf23.hostedemail.com: domain of ying.huang@intel.com has no SPF policy when checking 192.55.52.115) smtp.mailfrom=ying.huang@intel.com X-HE-Tag: 1633682410-583167 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: <linux-mm.kvack.org>
Series	NUMA balancing: optimize memory placement for memory tiering system \| expand [-V9,0/6] NUMA balancing: optimize memory placement for memory tiering system [-V9,1/6] NUMA Balancing: add page promotion counter [-V9,2/6] NUMA balancing: optimize page placement for memory tiering system [-V9,3/6] memory tiering: skip to scan fast memory [-V9,4/6] memory tiering: hot page selection with hint page fault latency [-V9,5/6] memory tiering: rate limit NUMA migration throughput [-V9,6/6] memory tiering: adjust hot threshold automatically

[-V9,4/6] memory tiering: hot page selection with hint page fault latency

Commit Message

Patch