From patchwork Wed Oct 12 18:56:06 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Alex Zhu (Kernel)" <alexlzhu@fb.com>
X-Patchwork-Id: 13005364
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 2F081C4332F
	for <linux-mm@archiver.kernel.org>; Wed, 12 Oct 2022 18:56:24 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id A2F946B0074; Wed, 12 Oct 2022 14:56:23 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 944526B0075; Wed, 12 Oct 2022 14:56:23 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 794AD900002; Wed, 12 Oct 2022 14:56:23 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com
 [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 6597F6B0074
	for <linux-mm@kvack.org>; Wed, 12 Oct 2022 14:56:23 -0400 (EDT)
Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id 3423580369
	for <linux-mm@kvack.org>; Wed, 12 Oct 2022 18:56:23 +0000 (UTC)
X-FDA: 80013203046.26.5C6760A
Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com
 [67.231.145.42])
	by imf27.hostedemail.com (Postfix) with ESMTP id A65D740024
	for <linux-mm@kvack.org>; Wed, 12 Oct 2022 18:56:22 +0000 (UTC)
Received: from pps.filterd (m0109334.ppops.net [127.0.0.1])
	by mx0a-00082601.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id
 29CCjiDs012296
	for <linux-mm@kvack.org>; Wed, 12 Oct 2022 11:56:21 -0700
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com;
 h=from : to : cc : subject
 : date : message-id : in-reply-to : references : mime-version :
 content-transfer-encoding : content-type; s=facebook;
 bh=iOPFNV3TDL6pZAYmVzuNAKpBn9uzlElnDWBEEhHLfeY=;
 b=JaBxmTFAKZiQDixwp9L4GKcS9BmdsW5TL/QvUDfq5iXIJwski2YQOeMhZXCWH6LrrEGD
 gwtWQ7E/D4/66VN1tQtkThTpIid7ZepK6S6SHEcJRcWXBXtxKyzvqPDB/PQXu/LPJ+Od
 VScFJTZL72KC4exFXaKqUzhzufGAJWAxpR4=
Received: from mail.thefacebook.com ([163.114.132.120])
	by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3k5h82gcf4-2
	(version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT)
	for <linux-mm@kvack.org>; Wed, 12 Oct 2022 11:56:21 -0700
Received: from twshared19720.14.frc2.facebook.com (2620:10d:c085:108::4) by
 mail.thefacebook.com (2620:10d:c085:11d::4) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.2375.31; Wed, 12 Oct 2022 11:56:20 -0700
Received: by devvm6390.atn0.facebook.com (Postfix, from userid 352741)
	id 052E84E25128; Wed, 12 Oct 2022 11:56:12 -0700 (PDT)
From: <alexlzhu@fb.com>
To: <linux-mm@kvack.org>, <kernel-team@fb.com>
CC: <willy@infradead.org>, <hannes@cmpxchg.org>, <riel@surriel.com>,
        Alexander
 Zhu <alexlzhu@fb.com>
Subject: [PATCH v2 1/3] mm: add thp_utilization metrics to debugfs
Date: Wed, 12 Oct 2022 11:56:06 -0700
Message-ID: 
 <1546999e21a418c2510b3ed02b2b1f76b2b0f5b7.1665600372.git.alexlzhu@fb.com>
X-Mailer: git-send-email 2.30.2
In-Reply-To: <cover.1665600372.git.alexlzhu@fb.com>
References: <cover.1665600372.git.alexlzhu@fb.com>
MIME-Version: 1.0
X-FB-Internal: Safe
X-Proofpoint-GUID: rgt_zUyM8L-BRFTBz_XFpYhM9JKhbVVX
X-Proofpoint-ORIG-GUID: rgt_zUyM8L-BRFTBz_XFpYhM9JKhbVVX
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.205,Aquarius:18.0.895,Hydra:6.0.545,FMLib:17.11.122.1
 definitions=2022-10-12_09,2022-10-12_01,2022-06-22_01
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1665600982; a=rsa-sha256;
	cv=none;
	b=KTUJOKcufGMPZGYd/XIgEtXnj2pkHmQXdUuYPBJTce+RstD2aSLP0t0va0Hxd8CCEZH7q+
	yZXTtuDpELERAhgAtRkATTQUYdcKIb5EaweJxba45eWSS9KukubL1p5R6MkUkwhApQk2I+
	QygKXiAapVVTGa1Eo4tKbEGLOaaKeAA=
ARC-Authentication-Results: i=1;
	imf27.hostedemail.com;
	dkim=pass header.d=fb.com header.s=facebook header.b=JaBxmTFA;
	dmarc=pass (policy=reject) header.from=fb.com;
	spf=pass (imf27.hostedemail.com: domain of
 "prvs=12844fa265=alexlzhu@meta.com" designates 67.231.145.42 as permitted
 sender) smtp.mailfrom="prvs=12844fa265=alexlzhu@meta.com"
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1665600982;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=iOPFNV3TDL6pZAYmVzuNAKpBn9uzlElnDWBEEhHLfeY=;
	b=dx29Kg9pWaay6wIbGbt0Y0M4YAWkPCGczUGVpmJokiLyHLn1vkRsSnVJ+UyWmX5NHY426c
	wPS0kgTwhVq2YuwxZSG86wJ9fgSrfuwzxRUXorD2BuzOpTNYJk8cmyXpEtAdnKQQNbGdPB
	6SCPeNKeszo9XlA/ZhqYvlDzSr0f3D0=
X-Stat-Signature: czprtoozjb9oo5g3xfid7fism6cc9xhm
X-Rspamd-Queue-Id: A65D740024
Authentication-Results: imf27.hostedemail.com;
	dkim=pass header.d=fb.com header.s=facebook header.b=JaBxmTFA;
	dmarc=pass (policy=reject) header.from=fb.com;
	spf=pass (imf27.hostedemail.com: domain of
 "prvs=12844fa265=alexlzhu@meta.com" designates 67.231.145.42 as permitted
 sender) smtp.mailfrom="prvs=12844fa265=alexlzhu@meta.com"
X-Rspam-User: 
X-Rspamd-Server: rspam03
X-HE-Tag: 1665600982-50804
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

From: Alexander Zhu <alexlzhu@fb.com>

This change introduces a tool that scans through all of physical
memory for anonymous THPs and groups them into buckets based
on utilization. It also includes an interface under
/sys/kernel/debug/thp_utilization.

Sample Output:

Utilized[0-50]: 1331 680884
Utilized[51-101]: 9 3983
Utilized[102-152]: 3 1187
Utilized[153-203]: 0 0
Utilized[204-255]: 2 539
Utilized[256-306]: 5 1135
Utilized[307-357]: 1 192
Utilized[358-408]: 0 0
Utilized[409-459]: 1 57
Utilized[460-512]: 400 13
Last Scan Time: 223.98s
Last Scan Duration: 70.65s

This indicates that there are 1331 THPs that have between 0 and 50
utilized (non zero) pages. In total there are 680884 zero pages in
this utilization bucket. THPs in the [0-50] bucket compose 76% of total
THPs, and are responsible for 99% of total zero pages across all
THPs. In other words, the least utilized THPs are responsible for almost
all of the memory waste when THP is always enabled. Similar results
have been observed across production workloads.

The last two lines indicate the timestamp and duration of the most recent
scan through all of physical memory. Here we see that the last scan
occurred 223.98 seconds after boot time and took 70.65 seconds.

Utilization of a THP is defined as the percentage of nonzero
pages in the THP. The worker thread will scan through all
of physical memory and obtain utilization of all anonymous
THPs. It will gather this information by periodically scanning
through all of physical memory for anonymous THPs, group them
into buckets based on utilization, and report utilization
information through debugfs under /sys/kernel/debug/thp_utilization.

Signed-off-by: Alexander Zhu <alexlzhu@fb.com>
---
v1 to v2
-reversed ordering of is_transparent_hugepage and PageAnon in is_anon_transparent_hugepage, page->mapping is only meaningful for user pages

RFC to v1
-Refactored out the code to obtain the thp_utilization_bucket, as that now has to be used in multiple places.

 Documentation/admin-guide/mm/transhuge.rst |   9 +
 include/linux/huge_mm.h                    |   3 +
 mm/huge_memory.c                           | 202 +++++++++++++++++++++
 3 files changed, 214 insertions(+)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 8ee78ec232eb..21d86303c97e 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -304,6 +304,15 @@ To identify what applications are mapping file transparent huge pages, it
 is necessary to read ``/proc/PID/smaps`` and count the FileHugeMapped fields
 for each mapping.
 
+The utilization of transparent hugepages can be viewed by reading
+``/sys/kernel/debug/thp_utilization``. The utilization of a THP is defined
+as the ratio of non zero filled 4kb pages to the total number of pages in a
+THP. The buckets are labelled by the range of total utilized 4kb pages with
+one line per utilization bucket. Each line contains the total number of
+THPs in that bucket and the total number of zero filled 4kb pages summed
+over all THPs in that bucket. The last two lines show the timestamp and
+duration respectively of the most recent scan over all of physical memory.
+
 Note that reading the smaps file is expensive and reading it
 frequently will incur overhead.
 
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a1341fdcf666..13ac7b2f29ae 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -178,6 +178,9 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
 unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
 		unsigned long len, unsigned long pgoff, unsigned long flags);
 
+int thp_number_utilized_pages(struct page *page);
+int thp_utilization_bucket(int num_utilized_pages);
+
 void prep_transhuge_page(struct page *page);
 void free_transhuge_page(struct page *page);
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1cc4a5f4791e..29e97df37c29 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -46,6 +46,16 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/thp.h>
 
+/*
+ * The number of utilization buckets THPs will be grouped in
+ * under /sys/kernel/debug/thp_utilization.
+ */
+#define THP_UTIL_BUCKET_NR 10
+/*
+ * The number of PFNs (and hence hugepages) to scan through on each periodic
+ * run of the scanner that generates /sys/kernel/debug/thp_utilization.
+ */
+#define THP_UTIL_SCAN_SIZE 256
 /*
  * By default, transparent hugepage support is disabled in order to avoid
  * risking an increased memory footprint for applications that are not
@@ -71,6 +81,25 @@ static atomic_t huge_zero_refcount;
 struct page *huge_zero_page __read_mostly;
 unsigned long huge_zero_pfn __read_mostly = ~0UL;
 
+static void thp_utilization_workfn(struct work_struct *work);
+static DECLARE_DELAYED_WORK(thp_utilization_work, thp_utilization_workfn);
+
+struct thp_scan_info_bucket {
+	int nr_thps;
+	int nr_zero_pages;
+};
+
+struct thp_scan_info {
+	struct thp_scan_info_bucket buckets[THP_UTIL_BUCKET_NR];
+	struct zone *scan_zone;
+	struct timespec64 last_scan_duration;
+	struct timespec64 last_scan_time;
+	unsigned long pfn;
+};
+
+static struct thp_scan_info thp_scan_debugfs;
+static struct thp_scan_info thp_scan;
+
 bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
 			bool smaps, bool in_pf, bool enforce_sysfs)
 {
@@ -485,6 +514,7 @@ static int __init hugepage_init(void)
 	if (err)
 		goto err_slab;
 
+	schedule_delayed_work(&thp_utilization_work, HZ);
 	err = register_shrinker(&huge_zero_page_shrinker, "thp-zero");
 	if (err)
 		goto err_hzp_shrinker;
@@ -599,6 +629,11 @@ static inline bool is_transparent_hugepage(struct page *page)
 	       page[1].compound_dtor == TRANSHUGE_PAGE_DTOR;
 }
 
+static inline bool is_anon_transparent_hugepage(struct page *page)
+{
+	return is_transparent_hugepage(page) && PageAnon(page);
+}
+
 static unsigned long __thp_get_unmapped_area(struct file *filp,
 		unsigned long addr, unsigned long len,
 		loff_t off, unsigned long flags, unsigned long size)
@@ -649,6 +684,49 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
 }
 EXPORT_SYMBOL_GPL(thp_get_unmapped_area);
 
+int thp_number_utilized_pages(struct page *page)
+{
+	struct folio *folio;
+	unsigned long page_offset, value;
+	int thp_nr_utilized_pages = HPAGE_PMD_NR;
+	int step_size = sizeof(unsigned long);
+	bool is_all_zeroes;
+	void *kaddr;
+	int i;
+
+	if (!page || !is_anon_transparent_hugepage(page))
+		return -1;
+
+	folio = page_folio(page);
+	for (i = 0; i < folio_nr_pages(folio); i++) {
+		kaddr = kmap_local_folio(folio, i);
+		is_all_zeroes = true;
+		for (page_offset = 0; page_offset < PAGE_SIZE; page_offset += step_size) {
+			value = *(unsigned long *)(kaddr + page_offset);
+			if (value != 0) {
+				is_all_zeroes = false;
+				break;
+			}
+		}
+		if (is_all_zeroes)
+			thp_nr_utilized_pages--;
+
+		kunmap_local(kaddr);
+	}
+	return thp_nr_utilized_pages;
+}
+
+int thp_utilization_bucket(int num_utilized_pages)
+{
+	int bucket;
+
+	if (num_utilized_pages < 0 || num_utilized_pages > HPAGE_PMD_NR)
+		return -1;
+	/* Group THPs into utilization buckets */
+	bucket = num_utilized_pages * THP_UTIL_BUCKET_NR / HPAGE_PMD_NR;
+	return min(bucket, THP_UTIL_BUCKET_NR - 1);
+}
+
 static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
 			struct page *page, gfp_t gfp)
 {
@@ -3174,6 +3252,42 @@ static int __init split_huge_pages_debugfs(void)
 	return 0;
 }
 late_initcall(split_huge_pages_debugfs);
+
+static int thp_utilization_show(struct seq_file *seqf, void *pos)
+{
+	int i;
+	int start;
+	int end;
+
+	for (i = 0; i < THP_UTIL_BUCKET_NR; i++) {
+		start = i * HPAGE_PMD_NR / THP_UTIL_BUCKET_NR;
+		end = (i + 1 == THP_UTIL_BUCKET_NR)
+			   ? HPAGE_PMD_NR
+			   : ((i + 1) * HPAGE_PMD_NR / THP_UTIL_BUCKET_NR - 1);
+		/* The last bucket will need to contain 100 */
+		seq_printf(seqf, "Utilized[%d-%d]: %d %d\n", start, end,
+			   thp_scan_debugfs.buckets[i].nr_thps,
+			   thp_scan_debugfs.buckets[i].nr_zero_pages);
+	}
+	seq_printf(seqf, "Last Scan Time: %lu.%02lus\n",
+		   (unsigned long)thp_scan_debugfs.last_scan_time.tv_sec,
+		   (thp_scan_debugfs.last_scan_time.tv_nsec / (NSEC_PER_SEC / 100)));
+
+	seq_printf(seqf, "Last Scan Duration: %lu.%02lus\n",
+		   (unsigned long)thp_scan_debugfs.last_scan_duration.tv_sec,
+		   (thp_scan_debugfs.last_scan_duration.tv_nsec / (NSEC_PER_SEC / 100)));
+
+	return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(thp_utilization);
+
+static int __init thp_utilization_debugfs(void)
+{
+	debugfs_create_file("thp_utilization", 0200, NULL, NULL,
+			    &thp_utilization_fops);
+	return 0;
+}
+late_initcall(thp_utilization_debugfs);
 #endif
 
 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
@@ -3269,3 +3383,91 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 	trace_remove_migration_pmd(address, pmd_val(pmde));
 }
 #endif
+
+static void thp_scan_next_zone(void)
+{
+	struct timespec64 current_time;
+	int i;
+	bool update_debugfs;
+	/*
+	 * THP utilization worker thread has reached the end
+	 * of the memory zone. Proceed to the next zone.
+	 */
+	thp_scan.scan_zone = next_zone(thp_scan.scan_zone);
+	update_debugfs = !thp_scan.scan_zone;
+	thp_scan.scan_zone = update_debugfs ? (first_online_pgdat())->node_zones
+			: thp_scan.scan_zone;
+	thp_scan.pfn = (thp_scan.scan_zone->zone_start_pfn + HPAGE_PMD_NR - 1)
+			& ~(HPAGE_PMD_SIZE - 1);
+	if (!update_debugfs)
+		return;
+	/*
+	 * If the worker has scanned through all of physical
+	 * memory. Then update information displayed in /sys/kernel/debug/thp_utilization
+	 */
+	ktime_get_ts64(&current_time);
+	thp_scan_debugfs.last_scan_duration = timespec64_sub(current_time,
+							     thp_scan_debugfs.last_scan_time);
+	thp_scan_debugfs.last_scan_time = current_time;
+
+	for (i = 0; i < THP_UTIL_BUCKET_NR; i++) {
+		thp_scan_debugfs.buckets[i].nr_thps = thp_scan.buckets[i].nr_thps;
+		thp_scan_debugfs.buckets[i].nr_zero_pages = thp_scan.buckets[i].nr_zero_pages;
+		thp_scan.buckets[i].nr_thps = 0;
+		thp_scan.buckets[i].nr_zero_pages = 0;
+	}
+}
+
+static void thp_util_scan(unsigned long pfn_end)
+{
+	struct page *page = NULL;
+	int bucket, num_utilized_pages, current_pfn;
+	int i;
+	/*
+	 * Scan through each memory zone in chunks of THP_UTIL_SCAN_SIZE
+	 * PFNs every second looking for anonymous THPs.
+	 */
+	for (i = 0; i < THP_UTIL_SCAN_SIZE; i++) {
+		current_pfn = thp_scan.pfn;
+		thp_scan.pfn += HPAGE_PMD_NR;
+		if (current_pfn >= pfn_end)
+			return;
+
+		if (!pfn_valid(current_pfn))
+			continue;
+
+		page = pfn_to_page(current_pfn);
+		num_utilized_pages = thp_number_utilized_pages(page);
+		bucket = thp_utilization_bucket(num_utilized_pages);
+		if (bucket < 0)
+			continue;
+
+		thp_scan.buckets[bucket].nr_thps++;
+		thp_scan.buckets[bucket].nr_zero_pages += (HPAGE_PMD_NR - num_utilized_pages);
+	}
+}
+
+static void thp_utilization_workfn(struct work_struct *work)
+{
+	unsigned long pfn_end;
+
+	if (!thp_scan.scan_zone)
+		thp_scan.scan_zone = (first_online_pgdat())->node_zones;
+	/*
+	 * Worker function that scans through all of physical memory
+	 * for anonymous THPs.
+	 */
+	pfn_end = (thp_scan.scan_zone->zone_start_pfn +
+			thp_scan.scan_zone->spanned_pages + HPAGE_PMD_NR - 1)
+			& ~(HPAGE_PMD_SIZE - 1);
+	/* If we have reached the end of the zone or end of physical memory
+	 * move on to the next zone. Otherwise, scan the next PFNs in the
+	 * current zone.
+	 */
+	if (!populated_zone(thp_scan.scan_zone) || thp_scan.pfn >= pfn_end)
+		thp_scan_next_zone();
+	else
+		thp_util_scan(pfn_end);
+
+	schedule_delayed_work(&thp_utilization_work, HZ);
+}

From patchwork Wed Oct 12 18:56:07 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Alex Zhu (Kernel)" <alexlzhu@fb.com>
X-Patchwork-Id: 13005365
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 8C9DDC43217
	for <linux-mm@archiver.kernel.org>; Wed, 12 Oct 2022 18:56:25 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 6FBB86B0075; Wed, 12 Oct 2022 14:56:24 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 6ACB6900002; Wed, 12 Oct 2022 14:56:24 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 4D66A6B007B; Wed, 12 Oct 2022 14:56:24 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com
 [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id 2EF606B0075
	for <linux-mm@kvack.org>; Wed, 12 Oct 2022 14:56:24 -0400 (EDT)
Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay08.hostedemail.com (Postfix) with ESMTP id 02F4D1412D2
	for <linux-mm@kvack.org>; Wed, 12 Oct 2022 18:56:23 +0000 (UTC)
X-FDA: 80013203088.18.EA4E5E3
Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com
 [67.231.145.42])
	by imf05.hostedemail.com (Postfix) with ESMTP id 805F4100028
	for <linux-mm@kvack.org>; Wed, 12 Oct 2022 18:56:23 +0000 (UTC)
Received: from pps.filterd (m0109334.ppops.net [127.0.0.1])
	by mx0a-00082601.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id
 29CCd7Ba012230
	for <linux-mm@kvack.org>; Wed, 12 Oct 2022 11:56:22 -0700
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com;
 h=from : to : cc : subject
 : date : message-id : in-reply-to : references : content-type :
 content-transfer-encoding : mime-version; s=facebook;
 bh=bZXvZo42CUOOXJBUDE63m72Us4nHCtVw3FnxtztulVc=;
 b=BgvO1YrDt1Q26BPli3pH7xjBE8g4FW9uQ2zrHq2xg+tTV/+8+ERVPXFH/fED2bSxwO5x
 hBQBn0YxEwa4X73bLI4p5soBVedXeDWxk3Xu+XoL5UzqKhHQE24VuAJeq7AJZYBfu6TC
 y3RhJ5x7i8VW3v+RRHnSOtnqzGxCVNLEg9E=
Received: from mail.thefacebook.com ([163.114.132.120])
	by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3k5h82gcf6-2
	(version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT)
	for <linux-mm@kvack.org>; Wed, 12 Oct 2022 11:56:22 -0700
Received: from twshared26370.03.ash8.facebook.com (2620:10d:c085:108::4) by
 mail.thefacebook.com (2620:10d:c085:11d::5) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.2375.31; Wed, 12 Oct 2022 11:56:21 -0700
Received: by devvm6390.atn0.facebook.com (Postfix, from userid 352741)
	id 0E3794E2512E; Wed, 12 Oct 2022 11:56:12 -0700 (PDT)
From: <alexlzhu@fb.com>
To: <linux-mm@kvack.org>, <kernel-team@fb.com>
CC: <willy@infradead.org>, <hannes@cmpxchg.org>, <riel@surriel.com>,
        Alexander
 Zhu <alexlzhu@fb.com>
Subject: [PATCH v2 2/3] mm: changes to split_huge_page() to free zero filled
 tail pages
Date: Wed, 12 Oct 2022 11:56:07 -0700
Message-ID: 
 <8d286a7029079c1da0b062fbb045317508409187.1665600372.git.alexlzhu@fb.com>
X-Mailer: git-send-email 2.30.2
In-Reply-To: <cover.1665600372.git.alexlzhu@fb.com>
References: <cover.1665600372.git.alexlzhu@fb.com>
X-FB-Internal: Safe
X-Proofpoint-GUID: nW-ycUyJeHUehyzklyMSOb1sJ24HqDgF
X-Proofpoint-ORIG-GUID: nW-ycUyJeHUehyzklyMSOb1sJ24HqDgF
X-Proofpoint-UnRewURL: 0 URL was un-rewritten
MIME-Version: 1.0
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.205,Aquarius:18.0.895,Hydra:6.0.545,FMLib:17.11.122.1
 definitions=2022-10-12_09,2022-10-12_01,2022-06-22_01
ARC-Authentication-Results: i=1;
	imf05.hostedemail.com;
	dkim=pass header.d=fb.com header.s=facebook header.b=BgvO1YrD;
	spf=pass (imf05.hostedemail.com: domain of
 "prvs=12844fa265=alexlzhu@meta.com" designates 67.231.145.42 as permitted
 sender) smtp.mailfrom="prvs=12844fa265=alexlzhu@meta.com";
	dmarc=pass (policy=reject) header.from=fb.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1665600983; a=rsa-sha256;
	cv=none;
	b=Jlj8rnjeM1NkkA4Yy3k05lCX60lOi2Ys/2k1OPFGv6sp2Nj++gvI4oY59ejWtme52x1nQr
	I/oCsZLmWOhTtAxG+1QT5MeEsZbeNLDp8bwTuRmOqMKCpBwbNrv2123M+bOhtMe0rW8BVD
	yWGGbwv2/rxzuP+ERTNZa7c+8c2Yjjs=
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1665600983;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=bZXvZo42CUOOXJBUDE63m72Us4nHCtVw3FnxtztulVc=;
	b=v89YQj5xbEAsEZ9tgPDXwM2mZvOGuPG73D7IDJWrGVtFc8PZVlMttfoglwcJ6oe3wLDbnX
	1mdNpYuei8tH27E/GS0x4qGGqwhIYSsENK1vKXatdGSMLX1XSqd94Ugn/iU0E/gWl8u9PT
	xVDBrVbo3AbVvN3jAN12lVw3VJ1iiRI=
X-Rspamd-Server: rspam05
X-Rspam-User: 
Authentication-Results: imf05.hostedemail.com;
	dkim=pass header.d=fb.com header.s=facebook header.b=BgvO1YrD;
	spf=pass (imf05.hostedemail.com: domain of
 "prvs=12844fa265=alexlzhu@meta.com" designates 67.231.145.42 as permitted
 sender) smtp.mailfrom="prvs=12844fa265=alexlzhu@meta.com";
	dmarc=pass (policy=reject) header.from=fb.com
X-Stat-Signature: ge5hychgxrin3qjxeyzo6gzubj8rypdu
X-Rspamd-Queue-Id: 805F4100028
X-HE-Tag: 1665600983-419853
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

From: Alexander Zhu <alexlzhu@fb.com>

Currently, when /sys/kernel/mm/transparent_hugepage/enabled=always is set
there are a large number of transparent hugepages that are almost entirely
zero filled.  This is mentioned in a number of previous patchsets
including:
https://lore.kernel.org/all/20210731063938.1391602-1-yuzhao@google.com/
https://lore.kernel.org/all/
1635422215-99394-1-git-send-email-ningzhang@linux.alibaba.com/

Currently, split_huge_page() does not have a way to identify zero filled
pages within the THP. Thus these zero pages get remapped and continue to
create memory waste. In this patch, we identify and free tail pages that
are zero filled in split_huge_page(). In this way, we avoid mapping these
pages back into page table entries and can free up unused memory within
THPs. This is based off the previously mentioned patchset by Yu Zhao.
However, we chose to free anonymous zero tail pages whenever they are
encountered instead of only on reclaim or migration.

We also add self tests to verify the RssAnon value to make sure zero
pages are not remapped except in the case of userfaultfd. In the case
of userfaultfd we remap to the shared zero page, similar to what is
done by KSM.

Signed-off-by: Alexander Zhu <alexlzhu@fb.com>
---
v1 to v2
-Modified split_huge_page self test based off more recent changes. 

RFC to v1

-Added support to map to the read only zero page when splitting a THP registered with userfaultfd. Also added a self test to verify that this is working.
-Only trigger the unmap_clean/zap in split_huge_page on anonymous THPs. We cannot zap zero pages for file THPs.

 include/linux/rmap.h                          |   2 +-
 include/linux/vm_event_item.h                 |   3 +
 mm/huge_memory.c                              |  44 ++++++-
 mm/migrate.c                                  |  72 +++++++++--
 mm/migrate_device.c                           |   4 +-
 mm/vmstat.c                                   |   3 +
 .../selftests/vm/split_huge_page_test.c       | 113 +++++++++++++++++-
 tools/testing/selftests/vm/vm_util.c          |  23 ++++
 tools/testing/selftests/vm/vm_util.h          |   3 +
 9 files changed, 252 insertions(+), 15 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index bd3504d11b15..3f83bbcf1333 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -428,7 +428,7 @@ int folio_mkclean(struct folio *);
 int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff,
 		      struct vm_area_struct *vma);
 
-void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked);
+void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked, bool unmap_clean);
 
 int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma);
 
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 3518dba1e02f..3618b10ddec9 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -111,6 +111,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 		THP_SPLIT_PUD,
 #endif
+		THP_SPLIT_FREE,
+		THP_SPLIT_UNMAP,
+		THP_SPLIT_REMAP_READONLY_ZERO_PAGE,
 		THP_ZERO_PAGE_ALLOC,
 		THP_ZERO_PAGE_ALLOC_FAILED,
 		THP_SWPOUT,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 29e97df37c29..a08885228cb2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2451,7 +2451,7 @@ static void unmap_folio(struct folio *folio)
 		try_to_unmap(folio, ttu_flags | TTU_IGNORE_MLOCK);
 }
 
-static void remap_page(struct folio *folio, unsigned long nr)
+static void remap_page(struct folio *folio, unsigned long nr, bool unmap_clean)
 {
 	int i = 0;
 
@@ -2459,7 +2459,7 @@ static void remap_page(struct folio *folio, unsigned long nr)
 	if (!folio_test_anon(folio))
 		return;
 	for (;;) {
-		remove_migration_ptes(folio, folio, true);
+		remove_migration_ptes(folio, folio, true, unmap_clean);
 		i += folio_nr_pages(folio);
 		if (i >= nr)
 			break;
@@ -2574,6 +2574,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	struct address_space *swap_cache = NULL;
 	unsigned long offset = 0;
 	unsigned int nr = thp_nr_pages(head);
+	LIST_HEAD(pages_to_free);
+	int nr_pages_to_free = 0;
 	int i;
 
 	/* complete memcg works before add pages to LRU */
@@ -2636,7 +2638,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	}
 	local_irq_enable();
 
-	remap_page(folio, nr);
+	remap_page(folio, nr, PageAnon(head));
 
 	if (PageSwapCache(head)) {
 		swp_entry_t entry = { .val = page_private(head) };
@@ -2650,6 +2652,33 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 			continue;
 		unlock_page(subpage);
 
+		/*
+		 * If a tail page has only two references left, one inherited
+		 * from the isolation of its head and the other from
+		 * lru_add_page_tail() which we are about to drop, it means this
+		 * tail page was concurrently zapped. Then we can safely free it
+		 * and save page reclaim or migration the trouble of trying it.
+		 */
+		if (list && page_ref_freeze(subpage, 2)) {
+			VM_BUG_ON_PAGE(PageLRU(subpage), subpage);
+			VM_BUG_ON_PAGE(PageCompound(subpage), subpage);
+			VM_BUG_ON_PAGE(page_mapped(subpage), subpage);
+
+			ClearPageActive(subpage);
+			ClearPageUnevictable(subpage);
+			list_move(&subpage->lru, &pages_to_free);
+			nr_pages_to_free++;
+			continue;
+		}
+		/*
+		 * If a tail page has only one reference left, it will be freed
+		 * by the call to free_page_and_swap_cache below. Since zero
+		 * subpages are no longer remapped, there will only be one
+		 * reference left in cases outside of reclaim or migration.
+		 */
+		if (page_ref_count(subpage) == 1)
+			nr_pages_to_free++;
+
 		/*
 		 * Subpages may be freed if there wasn't any mapping
 		 * like if add_to_swap() is running on a lru page that
@@ -2659,6 +2688,13 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		 */
 		free_page_and_swap_cache(subpage);
 	}
+
+	if (!nr_pages_to_free)
+		return;
+
+	mem_cgroup_uncharge_list(&pages_to_free);
+	free_unref_page_list(&pages_to_free);
+	count_vm_events(THP_SPLIT_FREE, nr_pages_to_free);
 }
 
 /* Racy check whether the huge page can be split */
@@ -2830,7 +2866,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		if (mapping)
 			xas_unlock(&xas);
 		local_irq_enable();
-		remap_page(folio, folio_nr_pages(folio));
+		remap_page(folio, folio_nr_pages(folio), false);
 		ret = -EBUSY;
 	}
 
diff --git a/mm/migrate.c b/mm/migrate.c
index c228afba0963..504ea5d7fd43 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -168,13 +168,62 @@ void putback_movable_pages(struct list_head *l)
 	}
 }
 
+static bool try_to_unmap_clean(struct page_vma_mapped_walk *pvmw, struct page *page)
+{
+	void *addr;
+	bool dirty;
+	pte_t newpte;
+
+	VM_BUG_ON_PAGE(PageCompound(page), page);
+	VM_BUG_ON_PAGE(!PageAnon(page), page);
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	VM_BUG_ON_PAGE(pte_present(*pvmw->pte), page);
+
+	if (PageMlocked(page) || (pvmw->vma->vm_flags & VM_LOCKED))
+		return false;
+
+	/*
+	 * The pmd entry mapping the old thp was flushed and the pte mapping
+	 * this subpage has been non present. Therefore, this subpage is
+	 * inaccessible. We don't need to remap it if it contains only zeros.
+	 */
+	addr = kmap_local_page(page);
+	dirty = memchr_inv(addr, 0, PAGE_SIZE);
+	kunmap_local(addr);
+
+	if (dirty)
+		return false;
+
+	pte_clear_not_present_full(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, false);
+
+	if (userfaultfd_armed(pvmw->vma)) {
+		newpte = pte_mkspecial(pfn_pte(page_to_pfn(ZERO_PAGE(pvmw->address)),
+					       pvmw->vma->vm_page_prot));
+		ptep_clear_flush(pvmw->vma, pvmw->address, pvmw->pte);
+		set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte);
+		dec_mm_counter(pvmw->vma->vm_mm, MM_ANONPAGES);
+		count_vm_event(THP_SPLIT_REMAP_READONLY_ZERO_PAGE);
+		return true;
+	}
+
+	dec_mm_counter(pvmw->vma->vm_mm, mm_counter(page));
+	count_vm_event(THP_SPLIT_UNMAP);
+	return true;
+}
+
+struct rmap_walk_arg {
+	struct folio *folio;
+	bool unmap_clean;
+};
+
 /*
  * Restore a potential migration pte to a working pte entry
  */
 static bool remove_migration_pte(struct folio *folio,
-		struct vm_area_struct *vma, unsigned long addr, void *old)
+		struct vm_area_struct *vma, unsigned long addr, void *arg)
 {
-	DEFINE_FOLIO_VMA_WALK(pvmw, old, vma, addr, PVMW_SYNC | PVMW_MIGRATION);
+	struct rmap_walk_arg *rmap_walk_arg = arg;
+	DEFINE_FOLIO_VMA_WALK(pvmw, rmap_walk_arg->folio, vma, addr, PVMW_SYNC | PVMW_MIGRATION);
 
 	while (page_vma_mapped_walk(&pvmw)) {
 		rmap_t rmap_flags = RMAP_NONE;
@@ -197,6 +246,8 @@ static bool remove_migration_pte(struct folio *folio,
 			continue;
 		}
 #endif
+		if (rmap_walk_arg->unmap_clean && try_to_unmap_clean(&pvmw, new))
+			continue;
 
 		folio_get(folio);
 		pte = mk_pte(new, READ_ONCE(vma->vm_page_prot));
@@ -272,13 +323,20 @@ static bool remove_migration_pte(struct folio *folio,
  * Get rid of all migration entries and replace them by
  * references to the indicated page.
  */
-void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked)
+void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked, bool unmap_clean)
 {
+	struct rmap_walk_arg rmap_walk_arg = {
+		.folio = src,
+		.unmap_clean = unmap_clean,
+	};
+
 	struct rmap_walk_control rwc = {
 		.rmap_one = remove_migration_pte,
-		.arg = src,
+		.arg = &rmap_walk_arg,
 	};
 
+	VM_BUG_ON_FOLIO(unmap_clean && src != dst, src);
+
 	if (locked)
 		rmap_walk_locked(dst, &rwc);
 	else
@@ -866,7 +924,7 @@ static int writeout(struct address_space *mapping, struct folio *folio)
 	 * At this point we know that the migration attempt cannot
 	 * be successful.
 	 */
-	remove_migration_ptes(folio, folio, false);
+	remove_migration_ptes(folio, folio, false, false);
 
 	rc = mapping->a_ops->writepage(&folio->page, &wbc);
 
@@ -1122,7 +1180,7 @@ static int __unmap_and_move(struct folio *src, struct folio *dst,
 
 	if (page_was_mapped)
 		remove_migration_ptes(src,
-			rc == MIGRATEPAGE_SUCCESS ? dst : src, false);
+			rc == MIGRATEPAGE_SUCCESS ? dst : src, false, false);
 
 out_unlock_both:
 	folio_unlock(dst);
@@ -1332,7 +1390,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 
 	if (page_was_mapped)
 		remove_migration_ptes(src,
-			rc == MIGRATEPAGE_SUCCESS ? dst : src, false);
+			rc == MIGRATEPAGE_SUCCESS ? dst : src, false, false);
 
 unlock_put_anon:
 	folio_unlock(dst);
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 5ab6ab9d2ed8..b5160c7ee229 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -426,7 +426,7 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
 			continue;
 
 		folio = page_folio(page);
-		remove_migration_ptes(folio, folio, false);
+		remove_migration_ptes(folio, folio, false, false);
 
 		migrate->src[i] = 0;
 		folio_unlock(folio);
@@ -802,7 +802,7 @@ void migrate_vma_finalize(struct migrate_vma *migrate)
 
 		src = page_folio(page);
 		dst = page_folio(newpage);
-		remove_migration_ptes(src, dst, false);
+		remove_migration_ptes(src, dst, false, false);
 		folio_unlock(src);
 
 		if (is_zone_device_page(page))
diff --git a/mm/vmstat.c b/mm/vmstat.c
index b2371d745e00..3d802eb6754d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1359,6 +1359,9 @@ const char * const vmstat_text[] = {
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 	"thp_split_pud",
 #endif
+	"thp_split_free",
+	"thp_split_unmap",
+	"thp_split_remap_readonly_zero_page",
 	"thp_zero_page_alloc",
 	"thp_zero_page_alloc_failed",
 	"thp_swpout",
diff --git a/tools/testing/selftests/vm/split_huge_page_test.c b/tools/testing/selftests/vm/split_huge_page_test.c
index 76e1c36dd9e5..de227b88d43d 100644
--- a/tools/testing/selftests/vm/split_huge_page_test.c
+++ b/tools/testing/selftests/vm/split_huge_page_test.c
@@ -16,6 +16,9 @@
 #include <sys/mount.h>
 #include <malloc.h>
 #include <stdbool.h>
+#include <sys/syscall.h> /* Definition of SYS_* constants */
+#include <linux/userfaultfd.h>
+#include <sys/ioctl.h>
 #include "vm_util.h"
 
 uint64_t pagesize;
@@ -88,6 +91,113 @@ static void write_debugfs(const char *fmt, ...)
 	}
 }
 
+static char *allocate_zero_filled_hugepage(size_t len)
+{
+	char *result;
+	size_t i;
+
+	result = memalign(pmd_pagesize, len);
+	if (!result) {
+		printf("Fail to allocate memory\n");
+		exit(EXIT_FAILURE);
+	}
+	madvise(result, len, MADV_HUGEPAGE);
+
+	for (i = 0; i < len; i++)
+		result[i] = (char)0;
+
+	return result;
+}
+
+static void verify_rss_anon_split_huge_page_all_zeroes(char *one_page, int nr_hpages, size_t len)
+{
+	uint64_t rss_anon_before, rss_anon_after;
+	size_t i;
+
+	if (!check_huge_anon(one_page, 4, pmd_pagesize)) {
+		printf("No THP is allocated\n");
+		exit(EXIT_FAILURE);
+	}
+
+	rss_anon_before = rss_anon();
+	if (!rss_anon_before) {
+		printf("No RssAnon is allocated before split\n");
+		exit(EXIT_FAILURE);
+	}
+	/* split all THPs */
+	write_debugfs(PID_FMT, getpid(), (uint64_t)one_page,
+		      (uint64_t)one_page + len);
+
+	for (i = 0; i < len; i++)
+		if (one_page[i] != (char)0) {
+			printf("%ld byte corrupted\n", i);
+			exit(EXIT_FAILURE);
+		}
+
+	if (!check_huge_anon(one_page, 0, pmd_pagesize)) {
+		printf("Still AnonHugePages not split\n");
+		exit(EXIT_FAILURE);
+	}
+
+	rss_anon_after = rss_anon();
+	if (rss_anon_after >= rss_anon_before) {
+		printf("Incorrect RssAnon value. Before: %ld After: %ld\n",
+		       rss_anon_before, rss_anon_after);
+		exit(EXIT_FAILURE);
+	}
+}
+
+void split_pmd_zero_pages(void)
+{
+	char *one_page;
+	int nr_hpages = 4;
+	size_t len = nr_hpages * pmd_pagesize;
+
+	one_page = allocate_zero_filled_hugepage(len);
+	verify_rss_anon_split_huge_page_all_zeroes(one_page, nr_hpages, len);
+	printf("Split zero filled huge pages successful\n");
+	free(one_page);
+}
+
+void split_pmd_zero_pages_uffd(void)
+{
+	char *one_page;
+	int nr_hpages = 4;
+	size_t len = nr_hpages * pmd_pagesize;
+	long uffd; /* userfaultfd file descriptor */
+	struct uffdio_api uffdio_api;
+	struct uffdio_register uffdio_register;
+
+	/* Create and enable userfaultfd object. */
+
+	uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
+	if (uffd == -1) {
+		perror("userfaultfd");
+		exit(1);
+	}
+
+	uffdio_api.api = UFFD_API;
+	uffdio_api.features = 0;
+	if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1) {
+		perror("ioctl-UFFDIO_API");
+		exit(1);
+	}
+
+	one_page = allocate_zero_filled_hugepage(len);
+
+	uffdio_register.range.start = (unsigned long)one_page;
+	uffdio_register.range.len = len;
+	uffdio_register.mode = UFFDIO_REGISTER_MODE_WP;
+	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1) {
+		perror("ioctl-UFFDIO_REGISTER");
+		exit(1);
+	}
+
+	verify_rss_anon_split_huge_page_all_zeroes(one_page, nr_hpages, len);
+	printf("Split zero filled huge pages with uffd successful\n");
+	free(one_page);
+}
+
 void split_pmd_thp(void)
 {
 	char *one_page;
@@ -121,7 +231,6 @@ void split_pmd_thp(void)
 			exit(EXIT_FAILURE);
 		}
 
-
 	if (check_huge_anon(one_page, 0, pmd_pagesize)) {
 		printf("Still AnonHugePages not split\n");
 		exit(EXIT_FAILURE);
@@ -301,6 +410,8 @@ int main(int argc, char **argv)
 	pageshift = ffs(pagesize) - 1;
 	pmd_pagesize = read_pmd_pagesize();
 
+	split_pmd_zero_pages();
+	split_pmd_zero_pages_uffd();
 	split_pmd_thp();
 	split_pte_mapped_thp();
 	split_file_backed_thp();
diff --git a/tools/testing/selftests/vm/vm_util.c b/tools/testing/selftests/vm/vm_util.c
index f11f8adda521..72f3edc64aaf 100644
--- a/tools/testing/selftests/vm/vm_util.c
+++ b/tools/testing/selftests/vm/vm_util.c
@@ -6,6 +6,7 @@
 
 #define PMD_SIZE_FILE_PATH "/sys/kernel/mm/transparent_hugepage/hpage_pmd_size"
 #define SMAP_FILE_PATH "/proc/self/smaps"
+#define STATUS_FILE_PATH "/proc/self/status"
 #define MAX_LINE_LENGTH 500
 
 uint64_t pagemap_get_entry(int fd, char *start)
@@ -72,6 +73,28 @@ uint64_t read_pmd_pagesize(void)
 	return strtoul(buf, NULL, 10);
 }
 
+uint64_t rss_anon(void)
+{
+	uint64_t rss_anon = 0;
+	int ret;
+	FILE *fp;
+	char buffer[MAX_LINE_LENGTH];
+
+	fp = fopen(STATUS_FILE_PATH, "r");
+	if (!fp)
+		ksft_exit_fail_msg("%s: Failed to open file %s\n", __func__, STATUS_FILE_PATH);
+
+	if (!check_for_pattern(fp, "RssAnon:", buffer, sizeof(buffer)))
+		goto err_out;
+
+	if (sscanf(buffer, "RssAnon:%10ld kB", &rss_anon) != 1)
+		ksft_exit_fail_msg("Reading status error\n");
+
+err_out:
+	fclose(fp);
+	return rss_anon;
+}
+
 bool __check_huge(void *addr, char *pattern, int nr_hpages,
 		  uint64_t hpage_size)
 {
diff --git a/tools/testing/selftests/vm/vm_util.h b/tools/testing/selftests/vm/vm_util.h
index 5c35de454e08..dd1885f66097 100644
--- a/tools/testing/selftests/vm/vm_util.h
+++ b/tools/testing/selftests/vm/vm_util.h
@@ -1,12 +1,15 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #include <stdint.h>
 #include <stdbool.h>
+#include <stddef.h>
+#include <stdio.h>
 
 uint64_t pagemap_get_entry(int fd, char *start);
 bool pagemap_is_softdirty(int fd, char *start);
 void clear_softdirty(void);
 bool check_for_pattern(FILE *fp, const char *pattern, char *buf, size_t len);
 uint64_t read_pmd_pagesize(void);
+uint64_t rss_anon(void);
 bool check_huge_anon(void *addr, int nr_hpages, uint64_t hpage_size);
 bool check_huge_file(void *addr, int nr_hpages, uint64_t hpage_size);
 bool check_huge_shmem(void *addr, int nr_hpages, uint64_t hpage_size);

From patchwork Wed Oct 12 18:56:08 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Alex Zhu (Kernel)" <alexlzhu@fb.com>
X-Patchwork-Id: 13005363
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id CB3B1C43219
	for <linux-mm@archiver.kernel.org>; Wed, 12 Oct 2022 18:56:21 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 2906B6B0073; Wed, 12 Oct 2022 14:56:21 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 264B66B0074; Wed, 12 Oct 2022 14:56:21 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 0B733900002; Wed, 12 Oct 2022 14:56:21 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com
 [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id E9BC16B0073
	for <linux-mm@kvack.org>; Wed, 12 Oct 2022 14:56:20 -0400 (EDT)
Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id ADA62C09E9
	for <linux-mm@kvack.org>; Wed, 12 Oct 2022 18:56:20 +0000 (UTC)
X-FDA: 80013202920.17.02C7DA6
Received: from mx0b-00082601.pphosted.com (mx0b-00082601.pphosted.com
 [67.231.153.30])
	by imf09.hostedemail.com (Postfix) with ESMTP id 2FCC014002B
	for <linux-mm@kvack.org>; Wed, 12 Oct 2022 18:56:19 +0000 (UTC)
Received: from pps.filterd (m0109331.ppops.net [127.0.0.1])
	by mx0a-00082601.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id
 29CFHkSp004747
	for <linux-mm@kvack.org>; Wed, 12 Oct 2022 11:56:19 -0700
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com;
 h=from : to : cc : subject
 : date : message-id : in-reply-to : references : mime-version :
 content-transfer-encoding : content-type; s=facebook;
 bh=Z/B9VD22N0jqAGhGDQdHHHIexn30BNrn7Vionx/8mWY=;
 b=Uky5ydKguqqUzx9Ezf2AR8MVGAHvLVUAR9emx3ZstgC+EdIxZ80VTEis6m+rx6J4ZqyD
 4+BN99pVjrDvp4nEVp7XjCdjYxdIUixn175NT+0xIe83cYpZttv0JqaDEoE39NadlVtH
 WUibtPL1pHnDrfVIOSmzEWFY+Z63r2b3Hts=
Received: from maileast.thefacebook.com ([163.114.130.16])
	by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3k5q82ejny-2
	(version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT)
	for <linux-mm@kvack.org>; Wed, 12 Oct 2022 11:56:19 -0700
Received: from twshared25017.14.frc2.facebook.com (2620:10d:c0a8:1b::d) by
 mail.thefacebook.com (2620:10d:c0a8:82::f) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.2375.31; Wed, 12 Oct 2022 11:56:18 -0700
Received: by devvm6390.atn0.facebook.com (Postfix, from userid 352741)
	id 163154E25130; Wed, 12 Oct 2022 11:56:12 -0700 (PDT)
From: <alexlzhu@fb.com>
To: <linux-mm@kvack.org>, <kernel-team@fb.com>
CC: <willy@infradead.org>, <hannes@cmpxchg.org>, <riel@surriel.com>,
        Alexander
 Zhu <alexlzhu@fb.com>
Subject: [PATCH v2 3/3] mm: THP low utilization shrinker
Date: Wed, 12 Oct 2022 11:56:08 -0700
Message-ID: 
 <c72b1c27d5722b0f5f77eb69beaa9451e447bb4b.1665600372.git.alexlzhu@fb.com>
X-Mailer: git-send-email 2.30.2
In-Reply-To: <cover.1665600372.git.alexlzhu@fb.com>
References: <cover.1665600372.git.alexlzhu@fb.com>
MIME-Version: 1.0
X-FB-Internal: Safe
X-Proofpoint-ORIG-GUID: grYHJoGQ7-l7KxeNSwJvc3hKyofb0uxe
X-Proofpoint-GUID: grYHJoGQ7-l7KxeNSwJvc3hKyofb0uxe
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.205,Aquarius:18.0.895,Hydra:6.0.545,FMLib:17.11.122.1
 definitions=2022-10-12_09,2022-10-12_01,2022-06-22_01
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1665600980; a=rsa-sha256;
	cv=none;
	b=RTnxjlVMeKOVNAqIegDPXWNQEitcsQQtkVe4RzQQchcLED2BmNSXQaaF0jdxwFHh5w6WEN
	9XQuh0pu4B05S/dOwhS4f+yB4M1+yZg1ciGU70stTPk7NMpjrBckYg3IGxPNtdmtuw09PF
	IuA2NS6uaiU0MJUChQK3gqedQOARiBc=
ARC-Authentication-Results: i=1;
	imf09.hostedemail.com;
	dkim=pass header.d=fb.com header.s=facebook header.b=Uky5ydKg;
	spf=pass (imf09.hostedemail.com: domain of
 "prvs=12844fa265=alexlzhu@meta.com" designates 67.231.153.30 as permitted
 sender) smtp.mailfrom="prvs=12844fa265=alexlzhu@meta.com";
	dmarc=pass (policy=reject) header.from=fb.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1665600980;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=Z/B9VD22N0jqAGhGDQdHHHIexn30BNrn7Vionx/8mWY=;
	b=rjgDL7vrMFO1EJde4zxUDXA+0WwpObShdxbG0Q4JaGGjJLWPYesHAfh/F2necxmyI2VZ8b
	O7jqPeBHf6ROl++PhFsIdMeQ3lrjd3Ee0vkaVp+uMObd4je2onROB+ERzxZCRMLvSMeF9p
	zbHH9SDpxV2PILyySTAH2zBX8+8SJ4c=
X-Rspam-User: 
X-Stat-Signature: mryho1ras6h3wobwqho7bx6rjuc4i1ba
X-Rspamd-Server: rspam01
X-Rspamd-Queue-Id: 2FCC014002B
Authentication-Results: imf09.hostedemail.com;
	dkim=pass header.d=fb.com header.s=facebook header.b=Uky5ydKg;
	spf=pass (imf09.hostedemail.com: domain of
 "prvs=12844fa265=alexlzhu@meta.com" designates 67.231.153.30 as permitted
 sender) smtp.mailfrom="prvs=12844fa265=alexlzhu@meta.com";
	dmarc=pass (policy=reject) header.from=fb.com
X-HE-Tag: 1665600979-204640
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

From: Alexander Zhu <alexlzhu@fb.com>

This patch introduces a shrinker that will remove THPs in the lowest
utilization bucket. As previously mentioned, we have observed that
almost all of the memory waste when THPs are always enabled
is contained in the lowest utilization bucket. The shrinker will
add these THPs to a list_lru and split anonymous THPs based off
information from kswapd. It requires the changes from
thp_utilization to identify the least utilized THPs, and the
changes to split_huge_page to identify and free zero pages
within THPs.

Signed-off-by: Alexander Zhu <alexlzhu@fb.com>
---
v1 to v2
-Changed lru_lock to be irq safe. Added irq_save and restore around list_lru adds/deletes.
-Changed low_util_free_page() to trylock the page, and if it fails, unlock lru_lock and return LRU_SKIP. This is to avoid deadlock between reclaim, which calls split_huge_page() and the THP Shrinker
-Changed low_util_free_page() to unlock lru_lock, split_huge_page, then lock lru_lock. This way split_huge_page is not called with the lru_lock held. That leads to deadlock as split_huge_page calls on_each_cpu_mask 
-Changed list_lru_shrink_walk to list_lru_shrink_walk_irq. 

RFC to v1
-Remove all THPs that are not in the top utilization bucket. This is what we have found to perform the best in production testing, we have found that there are an almost trivial number of THPs in the middle range of buckets that account for most of the memory waste. 
-Added check for THP utilization prior to split_huge_page for the THP Shrinker. This is to account for THPs that move to the top bucket, but were underutilized at the time they were added to the list_lru. 
-Multiply the shrink_count and scan_count by HPAGE_PMD_NR. This is because a THP is 512 pages, and should count as 512 objects in reclaim. This way reclaim is triggered at a more appropriate frequency than in the RFC. 

 include/linux/huge_mm.h  |   7 +++
 include/linux/list_lru.h |  24 +++++++++
 include/linux/mm_types.h |   5 ++
 mm/huge_memory.c         | 113 ++++++++++++++++++++++++++++++++++++++-
 mm/list_lru.c            |  51 ++++++++++++++++++
 mm/page_alloc.c          |   6 +++
 6 files changed, 204 insertions(+), 2 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 13ac7b2f29ae..75e4080256be 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -192,6 +192,8 @@ static inline int split_huge_page(struct page *page)
 }
 void deferred_split_huge_page(struct page *page);
 
+void add_underutilized_thp(struct page *page);
+
 void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long address, bool freeze, struct folio *folio);
 
@@ -305,6 +307,11 @@ static inline struct list_head *page_deferred_list(struct page *page)
 	return &page[2].deferred_list;
 }
 
+static inline struct list_head *page_underutilized_thp_list(struct page *page)
+{
+	return &page[3].underutilized_thp_list;
+}
+
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
 #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index b35968ee9fb5..c2cf146ea880 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -89,6 +89,18 @@ void memcg_reparent_list_lrus(struct mem_cgroup *memcg, struct mem_cgroup *paren
  */
 bool list_lru_add(struct list_lru *lru, struct list_head *item);
 
+/**
+ * list_lru_add_page: add an element to the lru list's tail
+ * @list_lru: the lru pointer
+ * @page: the page containing the item
+ * @item: the item to be deleted.
+ *
+ * This function works the same as list_lru_add in terms of list
+ * manipulation. Used for non slab objects contained in the page.
+ *
+ * Return value: true if the list was updated, false otherwise
+ */
+bool list_lru_add_page(struct list_lru *lru, struct page *page, struct list_head *item);
 /**
  * list_lru_del: delete an element to the lru list
  * @list_lru: the lru pointer
@@ -102,6 +114,18 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item);
  */
 bool list_lru_del(struct list_lru *lru, struct list_head *item);
 
+/**
+ * list_lru_del_page: delete an element to the lru list
+ * @list_lru: the lru pointer
+ * @page: the page containing the item
+ * @item: the item to be deleted.
+ *
+ * This function works the same as list_lru_del in terms of list
+ * manipulation. Used for non slab objects contained in the page.
+ *
+ * Return value: true if the list was updated, false otherwise
+ */
+bool list_lru_del_page(struct list_lru *lru, struct page *page, struct list_head *item);
 /**
  * list_lru_count_one: return the number of objects currently held by @lru
  * @lru: the lru pointer.
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 500e536796ca..da1d1cf42158 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -152,6 +152,11 @@ struct page {
 			/* For both global and memcg */
 			struct list_head deferred_list;
 		};
+		struct { /* Third tail page of compound page */
+			unsigned long _compound_pad_3; /* compound_head */
+			unsigned long _compound_pad_4;
+			struct list_head underutilized_thp_list;
+		};
 		struct {	/* Page table pages */
 			unsigned long _pt_pad_1;	/* compound_head */
 			pgtable_t pmd_huge_pte; /* protected by page->ptl */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a08885228cb2..8edefa7d91c8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -81,6 +81,8 @@ static atomic_t huge_zero_refcount;
 struct page *huge_zero_page __read_mostly;
 unsigned long huge_zero_pfn __read_mostly = ~0UL;
 
+static struct list_lru huge_low_util_page_lru;
+
 static void thp_utilization_workfn(struct work_struct *work);
 static DECLARE_DELAYED_WORK(thp_utilization_work, thp_utilization_workfn);
 
@@ -263,6 +265,57 @@ static struct shrinker huge_zero_page_shrinker = {
 	.seeks = DEFAULT_SEEKS,
 };
 
+static enum lru_status low_util_free_page(struct list_head *item,
+					  struct list_lru_one *lru,
+					  spinlock_t *lock,
+					  void *cb_arg)
+{
+	int bucket, num_utilized_pages;
+	struct page *head = compound_head(list_entry(item,
+									struct page,
+									underutilized_thp_list));
+
+	if (get_page_unless_zero(head)) {
+		if (!trylock_page(head)) {
+			spin_unlock_irq(lock);
+			return LRU_SKIP;
+		}
+		list_lru_isolate(lru, item);
+		num_utilized_pages = thp_number_utilized_pages(head);
+		bucket = thp_utilization_bucket(num_utilized_pages);
+		if (bucket < THP_UTIL_BUCKET_NR - 1) {
+			spin_unlock_irq(lock);
+			split_huge_page(head);
+			spin_lock_irq(lock);
+		}
+		unlock_page(head);
+		put_page(head);
+	}
+
+	return LRU_REMOVED_RETRY;
+}
+
+static unsigned long shrink_huge_low_util_page_count(struct shrinker *shrink,
+						     struct shrink_control *sc)
+{
+	return HPAGE_PMD_NR * list_lru_shrink_count(&huge_low_util_page_lru, sc);
+}
+
+static unsigned long shrink_huge_low_util_page_scan(struct shrinker *shrink,
+						    struct shrink_control *sc)
+{
+	return HPAGE_PMD_NR * list_lru_shrink_walk_irq(&huge_low_util_page_lru,
+							sc, low_util_free_page, NULL);
+}
+
+static struct shrinker huge_low_util_page_shrinker = {
+	.count_objects = shrink_huge_low_util_page_count,
+	.scan_objects = shrink_huge_low_util_page_scan,
+	.seeks = DEFAULT_SEEKS,
+	.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE |
+		SHRINKER_NONSLAB,
+};
+
 #ifdef CONFIG_SYSFS
 static ssize_t enabled_show(struct kobject *kobj,
 			    struct kobj_attribute *attr, char *buf)
@@ -515,6 +568,9 @@ static int __init hugepage_init(void)
 		goto err_slab;
 
 	schedule_delayed_work(&thp_utilization_work, HZ);
+	err = register_shrinker(&huge_low_util_page_shrinker, "thp-low-util");
+	if (err)
+		goto err_low_util_shrinker;
 	err = register_shrinker(&huge_zero_page_shrinker, "thp-zero");
 	if (err)
 		goto err_hzp_shrinker;
@@ -522,6 +578,9 @@ static int __init hugepage_init(void)
 	if (err)
 		goto err_split_shrinker;
 
+	err = list_lru_init_memcg(&huge_low_util_page_lru, &huge_low_util_page_shrinker);
+	if (err)
+		goto err_low_util_list_lru;
 	/*
 	 * By default disable transparent hugepages on smaller systems,
 	 * where the extra memory used could hurt more than TLB overhead
@@ -538,10 +597,14 @@ static int __init hugepage_init(void)
 
 	return 0;
 err_khugepaged:
+	list_lru_destroy(&huge_low_util_page_lru);
+err_low_util_list_lru:
 	unregister_shrinker(&deferred_split_shrinker);
 err_split_shrinker:
 	unregister_shrinker(&huge_zero_page_shrinker);
 err_hzp_shrinker:
+	unregister_shrinker(&huge_low_util_page_shrinker);
+err_low_util_shrinker:
 	khugepaged_destroy();
 err_slab:
 	hugepage_exit_sysfs(hugepage_kobj);
@@ -616,6 +679,7 @@ void prep_transhuge_page(struct page *page)
 	 */
 
 	INIT_LIST_HEAD(page_deferred_list(page));
+	INIT_LIST_HEAD(page_underutilized_thp_list(page));
 	set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR);
 }
 
@@ -2529,8 +2593,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
 			 LRU_GEN_MASK | LRU_REFS_MASK));
 
 	/* ->mapping in first tail page is compound_mapcount */
-	VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
-			page_tail);
+	VM_BUG_ON_PAGE(tail > 3 && page_tail->mapping != TAIL_MAPPING, page_tail);
 	page_tail->mapping = head->mapping;
 	page_tail->index = head->index + tail;
 	page_tail->private = 0;
@@ -2737,6 +2800,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	struct folio *folio = page_folio(page);
 	struct deferred_split *ds_queue = get_deferred_split_queue(&folio->page);
 	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
+	struct list_head *underutilized_thp_list = page_underutilized_thp_list(&folio->page);
 	struct anon_vma *anon_vma = NULL;
 	struct address_space *mapping = NULL;
 	int extra_pins, ret;
@@ -2844,6 +2908,9 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 			list_del(page_deferred_list(&folio->page));
 		}
 		spin_unlock(&ds_queue->split_queue_lock);
+		if (!list_empty(underutilized_thp_list))
+			list_lru_del_page(&huge_low_util_page_lru, &folio->page,
+					  underutilized_thp_list);
 		if (mapping) {
 			int nr = folio_nr_pages(folio);
 
@@ -2886,6 +2953,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 void free_transhuge_page(struct page *page)
 {
 	struct deferred_split *ds_queue = get_deferred_split_queue(page);
+	struct list_head *underutilized_thp_list = page_underutilized_thp_list(page);
 	unsigned long flags;
 
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
@@ -2894,6 +2962,12 @@ void free_transhuge_page(struct page *page)
 		list_del(page_deferred_list(page));
 	}
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
+	if (!list_empty(underutilized_thp_list))
+		list_lru_del_page(&huge_low_util_page_lru, page, underutilized_thp_list);
+
+	if (PageLRU(page))
+		__folio_clear_lru_flags(page_folio(page));
+
 	free_compound_page(page);
 }
 
@@ -2934,6 +3008,38 @@ void deferred_split_huge_page(struct page *page)
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 }
 
+void add_underutilized_thp(struct page *page)
+{
+	VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+
+	if (PageSwapCache(page))
+		return;
+
+	/*
+	 * Need to take a reference on the page to prevent the page from getting free'd from
+	 * under us while we are adding the THP to the shrinker.
+	 */
+	if (!get_page_unless_zero(page))
+		return;
+
+	lock_page(page);
+
+	if (!is_anon_transparent_hugepage(page))
+		goto out;
+
+	if (is_huge_zero_page(page))
+		goto out;
+
+	if (memcg_list_lru_alloc(page_memcg(page), &huge_low_util_page_lru, GFP_KERNEL))
+		goto out;
+
+	list_lru_add_page(&huge_low_util_page_lru, page, page_underutilized_thp_list(page));
+
+out:
+	unlock_page(page);
+	put_page(page);
+}
+
 static unsigned long deferred_split_count(struct shrinker *shrink,
 		struct shrink_control *sc)
 {
@@ -3478,6 +3584,9 @@ static void thp_util_scan(unsigned long pfn_end)
 		if (bucket < 0)
 			continue;
 
+		if (bucket < THP_UTIL_BUCKET_NR - 1)
+			add_underutilized_thp(page);
+
 		thp_scan.buckets[bucket].nr_thps++;
 		thp_scan.buckets[bucket].nr_zero_pages += (HPAGE_PMD_NR - num_utilized_pages);
 	}
diff --git a/mm/list_lru.c b/mm/list_lru.c
index a05e5bef3b40..273b267e2e55 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -140,6 +140,33 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item)
 }
 EXPORT_SYMBOL_GPL(list_lru_add);
 
+bool list_lru_add_page(struct list_lru *lru, struct page *page, struct list_head *item)
+{
+	int nid = page_to_nid(page);
+	struct list_lru_node *nlru = &lru->node[nid];
+	struct list_lru_one *l;
+	struct mem_cgroup *memcg;
+	unsigned long flags;
+
+	spin_lock_irqsave(&nlru->lock, flags);
+	if (list_empty(item)) {
+		memcg = page_memcg(page);
+		memcg_list_lru_alloc(memcg, lru, GFP_KERNEL);
+		l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
+		list_add_tail(item, &l->list);
+		/* Set shrinker bit if the first element was added */
+		if (!l->nr_items++)
+			set_shrinker_bit(memcg, nid,
+					 lru_shrinker_id(lru));
+		nlru->nr_items++;
+		spin_unlock_irqrestore(&nlru->lock, flags);
+		return true;
+	}
+	spin_unlock_irqrestore(&nlru->lock, flags);
+	return false;
+}
+EXPORT_SYMBOL_GPL(list_lru_add_page);
+
 bool list_lru_del(struct list_lru *lru, struct list_head *item)
 {
 	int nid = page_to_nid(virt_to_page(item));
@@ -160,6 +187,30 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item)
 }
 EXPORT_SYMBOL_GPL(list_lru_del);
 
+bool list_lru_del_page(struct list_lru *lru, struct page *page, struct list_head *item)
+{
+	int nid = page_to_nid(page);
+	struct list_lru_node *nlru = &lru->node[nid];
+	struct list_lru_one *l;
+	struct mem_cgroup *memcg;
+	unsigned long flags;
+
+	spin_lock_irqsave(&nlru->lock, flags);
+	if (!list_empty(item)) {
+		memcg = page_memcg(page);
+		memcg_list_lru_alloc(memcg, lru, GFP_KERNEL);
+		l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
+		list_del_init(item);
+		l->nr_items--;
+		nlru->nr_items--;
+		spin_unlock_irqrestore(&nlru->lock, flags);
+		return true;
+	}
+	spin_unlock_irqrestore(&nlru->lock, flags);
+	return false;
+}
+EXPORT_SYMBOL_GPL(list_lru_del_page);
+
 void list_lru_isolate(struct list_lru_one *list, struct list_head *item)
 {
 	list_del_init(item);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ac2c9f12a7b2..468eaaade7fe 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1335,6 +1335,12 @@ static int free_tail_pages_check(struct page *head_page, struct page *page)
 		 * deferred_list.next -- ignore value.
 		 */
 		break;
+	case 3:
+		/*
+		 * the third tail page: ->mapping is
+		 * underutilized_thp_list.next -- ignore value.
+		 */
+		break;
 	default:
 		if (page->mapping != TAIL_MAPPING) {
 			bad_page(page, "corrupted mapping in tail page");