From patchwork Tue Sep 26 06:09:07 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Huang, Ying" <ying.huang@intel.com>
X-Patchwork-Id: 13398715
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 78CF3E7D0C5
	for <linux-mm@archiver.kernel.org>; Tue, 26 Sep 2023 06:09:56 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 15F398D006B; Tue, 26 Sep 2023 02:09:56 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 0E8248D0005; Tue, 26 Sep 2023 02:09:56 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id EF2158D006B; Tue, 26 Sep 2023 02:09:55 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com
 [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id CDED18D0005
	for <linux-mm@kvack.org>; Tue, 26 Sep 2023 02:09:55 -0400 (EDT)
Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay03.hostedemail.com (Postfix) with ESMTP id 9092CA0FFC
	for <linux-mm@kvack.org>; Tue, 26 Sep 2023 06:09:55 +0000 (UTC)
X-FDA: 81277722750.28.66D9F0F
Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.100])
	by imf19.hostedemail.com (Postfix) with ESMTP id 7609C1A000E
	for <linux-mm@kvack.org>; Tue, 26 Sep 2023 06:09:53 +0000 (UTC)
Authentication-Results: imf19.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=LjCOiiGX;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf19.hostedemail.com: domain of ying.huang@intel.com designates
 134.134.136.100 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1695708593;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=A9GbfQNp00BFsmuDGD/dfys76FtFZk3lcsP4ad5IHRw=;
	b=2fMjC3kN9hje5sTVU8wS9LNUO+SqCa5lwF2kkQ9ERI4j6XlUn1OBPfftVT8t0vZuAp61es
	vt4TCsSS3hBZ9xgfZAwPZetVcjWPb55XV5/Iy16TKjKi5UaRWihSOShBwSWYqZ248CNwmz
	wchvMR1c308WrW9xrsTGCtDndtYeYlg=
ARC-Authentication-Results: i=1;
	imf19.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=LjCOiiGX;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf19.hostedemail.com: domain of ying.huang@intel.com designates
 134.134.136.100 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695708593; a=rsa-sha256;
	cv=none;
	b=REHlSbRyx8ObWwrD52oQGLsnC3P6hE9ANNfE1vFP4tlP6VRMHootvRjKUxZ4VYow5MQQVg
	oQkzHHdn+dKyF/tTmKscPjGdSHw54PQGdUcKLJgcS1jNzO85YFKzeynqbEAaISdQOU3EL0
	A4SvNSimr2e7vLTCLYLCBE4x99eEpa8=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1695708593; x=1727244593;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=LGvzAogqWzGmZcRGU4lmeTYFMWywK93t3PV7OpA2ock=;
  b=LjCOiiGXcrK7w4BZIaY/eJaTkjokYoaY3Ah61ktpxIrHspt3rdIVE0e6
   QZvLlmoh7NzHgiNloYYax+FV9/+CQMhHJZp08XnN2ZUhDE6E06nMgbq2D
   f4QlWaib36sp3Y7Ttd1lLEongvv4WcOUxvb/za6IdpZeQSRz8TVIgtq8R
   GW1NQgI4YpAf9BRdWaKmbfla7aj4z8e4qSkc+0W1cFVaUhwOoQL8yeM92
   sup4ynEMFU2Ddzo/dLmj5RW3pd68KfehOaZtB5aosSVclfrz0ahb0+CqP
   krxTkXlbLqc12iIkP5fb/qOdbIAoVoOBDKuAId0pe+vchvGrwidu5aRlC
   Q==;
X-IronPort-AV: E=McAfee;i="6600,9927,10843"; a="447991390"
X-IronPort-AV: E=Sophos;i="6.03,177,1694761200";
   d="scan'208";a="447991390"
Received: from fmsmga001.fm.intel.com ([10.253.24.23])
  by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 25 Sep 2023 23:09:52 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10843"; a="892075936"
X-IronPort-AV: E=Sophos;i="6.03,177,1694761200";
   d="scan'208";a="892075936"
Received: from aozhu-mobl.ccr.corp.intel.com (HELO
 yhuang6-mobl2.ccr.corp.intel.com) ([10.255.31.94])
  by fmsmga001-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 25 Sep 2023 23:08:43 -0700
From: Huang Ying <ying.huang@intel.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	Arjan Van De Ven <arjan@linux.intel.com>,
	Huang Ying <ying.huang@intel.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Vlastimil Babka <vbabka@suse.cz>,
	David Hildenbrand <david@redhat.com>,
	Johannes Weiner <jweiner@redhat.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Michal Hocko <mhocko@suse.com>,
	Pavel Tatashin <pasha.tatashin@soleen.com>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Lameter <cl@linux.com>
Subject: [PATCH -V2 06/10] mm: add framework for PCP high auto-tuning
Date: Tue, 26 Sep 2023 14:09:07 +0800
Message-Id: <20230926060911.266511-7-ying.huang@intel.com>
X-Mailer: git-send-email 2.39.2
In-Reply-To: <20230926060911.266511-1-ying.huang@intel.com>
References: <20230926060911.266511-1-ying.huang@intel.com>
MIME-Version: 1.0
X-Rspam-User: 
X-Rspamd-Server: rspam12
X-Rspamd-Queue-Id: 7609C1A000E
X-Stat-Signature: w5edthm87errdcm7u768fj65rzubb779
X-HE-Tag: 1695708593-22290
X-HE-Meta: 
 U2FsdGVkX1+Aa7uXD2OHjrUvMh3LK12CbJdFpxPPG3ai/EBlcpelpQM5661tnqJHdHRH41hvsZxCyrO2KKZGTag4urJeoEpqPms7se5qBZJy6vOz4dhhltA3E38YJf5wXs9X+CI3pgSAPwJMX06JTvIyQ5LE/2wcZvi/Hs2NROERmfPMUqfDjwFTTWYHhdy7MSI3YQjtNZuSobuJU4/vg5IUd3VHGLRxpWVWeN+T3tBFV93dg4K/+E/VE5Fnvm8E3UeZs+0n3BUwwIypnrW0DJFnkRLjM83Z7UU8+GFQhS0n+TRQ3VriHhNX62Q+MvZFASh4NNtMAVbUAyLI9gWU1k1Y+RLHy6+Ko3utqFJ/3A1R324c9V2sVjKwEAYpaEnloCDTAU3N2A45U+ciY/JFziHnDUcBB4QJNc2kbXbWns3NOiLNeJ4xXmPPzAJvnPUbNZdOGimyvV+eb0Z+n7rIzBtq4OFbm/gLEi+JEZCaGhzbly/O/Inon7YkN1hNCoIfSvpPCkJYpwlr7D6R0Tm5/W31Cy9tu/rpwvSVoXB1WSJar5FG/tl84mjdbqsD+ruWbrACnTYnngbiaLXSL+CwazfKJMjqGqXhP5gzI4ainx3X0wuuRJ1wWwsS90rsybBJfV8Mwo0pArDGT6DPpiTwIK4IOuNWJj2Kj3avKy5EoMBChZ0OPetA51J+BFkTtYzSVmuon70+30ctgFYGtk5lrREF7NSNeGzuG1n9pVD8/GQafDhkL41iwRakOP3egkV2+ETS2ec6/rFn2/R/8hvL0+KbRHNzm4akUL5ZpFksucWPlGNaFTl4BIY3v4pet+ruiDLk9jhAotk8wzFamDPAHIiEoYkpS5RGtBQJl4jihsVny+bTxObWwLWcPMwhWE2KXEhR9zDSNaJFu8jgGqgx2X2CBP4weDqa/Oatm/+f7YWEqvXm+3JLNAjSwVQrf4n/jIRa0lpgF+MWtPylBzb
 8bf8M28D
 MSh/42Orh6ToGTG1UUKhUpxgwxULg1Wk+6/mHxLKhXmClzmV8rrBKyP7a+zkkob17ASR32v37lbBwdvrqDkLIIXMOcMW8HWKZqA2RRsUieRKtzdc6yuv3/raWrxXSGxAvjP9yib/ZKAimWGx9VqGkuehwJcDX1UdH0mt/Y4Ua3M0Hj7FiBtj84eW3i3sB3VHtH3pEezof2p90lSbmdG3iv82cWlvGkxxB47zt+G/Xp++U1e3fWg+1dlbYkCNUWbLVZJflm9KednGJJpptt/WhM/28cMMMqZGpbmtHDbROyYtCjH4tHCtrDj8fv8OQrj/05/okMVAEl8/I4kmpTg5cA1OvBmE+NL8aWLogDshA2AArczqjf+gjPUZQJZHN1DbBJByREVhApyUlLP7964ihysV6O7Avp5nxTrK3jxjA7JR2kI/f89AFoit2lJy2vn6nPbUZ54ICq34qADqXTSn+eXVmGzFNDpd3aWh+jcVejHqgQ8RoYrfHri8+JVn/4d8vC17zl1St0rJ3Ip+ZuVn4p4/57eZhmY6f0LbK
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

The page allocation performance requirements of different workloads
are usually different.  So, we need to tune PCP (per-CPU pageset) high
to optimize the workload page allocation performance.  Now, we have a
system wide sysctl knob (percpu_pagelist_high_fraction) to tune PCP
high by hand.  But, it's hard to find out the best value by hand.  And
one global configuration may not work best for the different workloads
that run on the same system.  One solution to these issues is to tune
PCP high of each CPU automatically.

This patch adds the framework for PCP high auto-tuning.  With it,
pcp->high of each CPU will be changed automatically by tuning
algorithm at runtime.  The minimal high (pcp->high_min) is the
original PCP high value calculated based on the low watermark pages.
While the maximal high (pcp->high_max) is the PCP high value when
percpu_pagelist_high_fraction sysctl knob is set to
MIN_PERCPU_PAGELIST_HIGH_FRACTION.  That is, the maximal pcp->high
that can be set via sysctl knob by hand.

It's possible that PCP high auto-tuning doesn't work well for some
workloads.  So, when PCP high is tuned by hand via the sysctl knob,
the auto-tuning will be disabled.  The PCP high set by hand will be
used instead.

This patch only adds the framework, so pcp->high will be set to
pcp->high_min (original default) always.  We will add actual
auto-tuning algorithm in the following patches in the series.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
---
 Documentation/admin-guide/sysctl/vm.rst | 12 +++--
 include/linux/mmzone.h                  |  5 +-
 mm/page_alloc.c                         | 71 ++++++++++++++++---------
 3 files changed, 58 insertions(+), 30 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 45ba1f4dc004..7386366fe114 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -843,10 +843,14 @@ each zone between per-cpu lists.
 The batch value of each per-cpu page list remains the same regardless of
 the value of the high fraction so allocation latencies are unaffected.
 
-The initial value is zero. Kernel uses this value to set the high pcp->high
-mark based on the low watermark for the zone and the number of local
-online CPUs.  If the user writes '0' to this sysctl, it will revert to
-this default behavior.
+The initial value is zero. With this value, kernel will tune pcp->high
+automatically according to the requirements of workloads.  The lower
+limit of tuning is based on the low watermark for the zone and the
+number of local online CPUs.  The upper limit is the page number when
+the sysctl is set to the minimal value (8).  If the user writes '0' to
+this sysctl, it will revert to this default behavior.  In another
+words, if the user write other value, the auto-tuning will be disabled
+and the user specified pcp->high will be used.
 
 
 stat_interval
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4f7420e35fbb..d6cfb5023f3e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -683,6 +683,8 @@ struct per_cpu_pages {
 	spinlock_t lock;	/* Protects lists field */
 	int count;		/* number of pages in the list */
 	int high;		/* high watermark, emptying needed */
+	int high_min;		/* min high watermark */
+	int high_max;		/* max high watermark */
 	int batch;		/* chunk size for buddy add/remove */
 	u8 flags;		/* protected by pcp->lock */
 	u8 alloc_factor;	/* batch scaling factor during allocate */
@@ -842,7 +844,8 @@ struct zone {
 	 * the high and batch values are copied to individual pagesets for
 	 * faster access
 	 */
-	int pageset_high;
+	int pageset_high_min;
+	int pageset_high_max;
 	int pageset_batch;
 
 #ifndef CONFIG_SPARSEMEM
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b9226845abf7..df07580dbd53 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2353,7 +2353,7 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int high, bool free_high)
 static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
 		       bool free_high)
 {
-	int high = READ_ONCE(pcp->high);
+	int high = READ_ONCE(pcp->high_min);
 
 	if (unlikely(!high || free_high))
 		return 0;
@@ -2692,7 +2692,7 @@ static int nr_pcp_alloc(struct per_cpu_pages *pcp, int order)
 {
 	int high, batch, max_nr_alloc;
 
-	high = READ_ONCE(pcp->high);
+	high = READ_ONCE(pcp->high_min);
 	batch = READ_ONCE(pcp->batch);
 
 	/* Check for PCP disabled or boot pageset */
@@ -5298,14 +5298,15 @@ static int zone_batchsize(struct zone *zone)
 }
 
 static int percpu_pagelist_high_fraction;
-static int zone_highsize(struct zone *zone, int batch, int cpu_online)
+static int zone_highsize(struct zone *zone, int batch, int cpu_online,
+			 int high_fraction)
 {
 #ifdef CONFIG_MMU
 	int high;
 	int nr_split_cpus;
 	unsigned long total_pages;
 
-	if (!percpu_pagelist_high_fraction) {
+	if (!high_fraction) {
 		/*
 		 * By default, the high value of the pcp is based on the zone
 		 * low watermark so that if they are full then background
@@ -5318,15 +5319,15 @@ static int zone_highsize(struct zone *zone, int batch, int cpu_online)
 		 * value is based on a fraction of the managed pages in the
 		 * zone.
 		 */
-		total_pages = zone_managed_pages(zone) / percpu_pagelist_high_fraction;
+		total_pages = zone_managed_pages(zone) / high_fraction;
 	}
 
 	/*
 	 * Split the high value across all online CPUs local to the zone. Note
 	 * that early in boot that CPUs may not be online yet and that during
 	 * CPU hotplug that the cpumask is not yet updated when a CPU is being
-	 * onlined. For memory nodes that have no CPUs, split pcp->high across
-	 * all online CPUs to mitigate the risk that reclaim is triggered
+	 * onlined. For memory nodes that have no CPUs, split the high value
+	 * across all online CPUs to mitigate the risk that reclaim is triggered
 	 * prematurely due to pages stored on pcp lists.
 	 */
 	nr_split_cpus = cpumask_weight(cpumask_of_node(zone_to_nid(zone))) + cpu_online;
@@ -5354,19 +5355,21 @@ static int zone_highsize(struct zone *zone, int batch, int cpu_online)
  * However, guaranteeing these relations at all times would require e.g. write
  * barriers here but also careful usage of read barriers at the read side, and
  * thus be prone to error and bad for performance. Thus the update only prevents
- * store tearing. Any new users of pcp->batch and pcp->high should ensure they
- * can cope with those fields changing asynchronously, and fully trust only the
- * pcp->count field on the local CPU with interrupts disabled.
+ * store tearing. Any new users of pcp->batch, pcp->high_min and pcp->high_max
+ * should ensure they can cope with those fields changing asynchronously, and
+ * fully trust only the pcp->count field on the local CPU with interrupts
+ * disabled.
  *
  * mutex_is_locked(&pcp_batch_high_lock) required when calling this function
  * outside of boot time (or some other assurance that no concurrent updaters
  * exist).
  */
-static void pageset_update(struct per_cpu_pages *pcp, unsigned long high,
-		unsigned long batch)
+static void pageset_update(struct per_cpu_pages *pcp, unsigned long high_min,
+			   unsigned long high_max, unsigned long batch)
 {
 	WRITE_ONCE(pcp->batch, batch);
-	WRITE_ONCE(pcp->high, high);
+	WRITE_ONCE(pcp->high_min, high_min);
+	WRITE_ONCE(pcp->high_max, high_max);
 }
 
 static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonestat *pzstats)
@@ -5386,20 +5389,21 @@ static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonesta
 	 * need to be as careful as pageset_update() as nobody can access the
 	 * pageset yet.
 	 */
-	pcp->high = BOOT_PAGESET_HIGH;
+	pcp->high_min = BOOT_PAGESET_HIGH;
+	pcp->high_max = BOOT_PAGESET_HIGH;
 	pcp->batch = BOOT_PAGESET_BATCH;
 	pcp->free_factor = 0;
 }
 
-static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high,
-		unsigned long batch)
+static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high_min,
+					      unsigned long high_max, unsigned long batch)
 {
 	struct per_cpu_pages *pcp;
 	int cpu;
 
 	for_each_possible_cpu(cpu) {
 		pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
-		pageset_update(pcp, high, batch);
+		pageset_update(pcp, high_min, high_max, batch);
 	}
 }
 
@@ -5409,19 +5413,34 @@ static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long h
  */
 static void zone_set_pageset_high_and_batch(struct zone *zone, int cpu_online)
 {
-	int new_high, new_batch;
+	int new_high_min, new_high_max, new_batch;
 
 	new_batch = max(1, zone_batchsize(zone));
-	new_high = zone_highsize(zone, new_batch, cpu_online);
+	if (percpu_pagelist_high_fraction) {
+		new_high_min = zone_highsize(zone, new_batch, cpu_online,
+					     percpu_pagelist_high_fraction);
+		/*
+		 * PCP high is tuned manually, disable auto-tuning via
+		 * setting high_min and high_max to the manual value.
+		 */
+		new_high_max = new_high_min;
+	} else {
+		new_high_min = zone_highsize(zone, new_batch, cpu_online, 0);
+		new_high_max = zone_highsize(zone, new_batch, cpu_online,
+					     MIN_PERCPU_PAGELIST_HIGH_FRACTION);
+	}
 
-	if (zone->pageset_high == new_high &&
+	if (zone->pageset_high_min == new_high_min &&
+	    zone->pageset_high_max == new_high_max &&
 	    zone->pageset_batch == new_batch)
 		return;
 
-	zone->pageset_high = new_high;
+	zone->pageset_high_min = new_high_min;
+	zone->pageset_high_max = new_high_max;
 	zone->pageset_batch = new_batch;
 
-	__zone_set_pageset_high_and_batch(zone, new_high, new_batch);
+	__zone_set_pageset_high_and_batch(zone, new_high_min, new_high_max,
+					  new_batch);
 }
 
 void __meminit setup_zone_pageset(struct zone *zone)
@@ -5529,7 +5548,8 @@ __meminit void zone_pcp_init(struct zone *zone)
 	 */
 	zone->per_cpu_pageset = &boot_pageset;
 	zone->per_cpu_zonestats = &boot_zonestats;
-	zone->pageset_high = BOOT_PAGESET_HIGH;
+	zone->pageset_high_min = BOOT_PAGESET_HIGH;
+	zone->pageset_high_max = BOOT_PAGESET_HIGH;
 	zone->pageset_batch = BOOT_PAGESET_BATCH;
 
 	if (populated_zone(zone))
@@ -6431,13 +6451,14 @@ EXPORT_SYMBOL(free_contig_range);
 void zone_pcp_disable(struct zone *zone)
 {
 	mutex_lock(&pcp_batch_high_lock);
-	__zone_set_pageset_high_and_batch(zone, 0, 1);
+	__zone_set_pageset_high_and_batch(zone, 0, 0, 1);
 	__drain_all_pages(zone, true);
 }
 
 void zone_pcp_enable(struct zone *zone)
 {
-	__zone_set_pageset_high_and_batch(zone, zone->pageset_high, zone->pageset_batch);
+	__zone_set_pageset_high_and_batch(zone, zone->pageset_high_min,
+		zone->pageset_high_max, zone->pageset_batch);
 	mutex_unlock(&pcp_batch_high_lock);
 }