From patchwork Wed Oct 19 03:42:18 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Alex Zhu (Kernel)" <alexlzhu@fb.com>
X-Patchwork-Id: 13011334
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 63963C433FE
	for <linux-mm@archiver.kernel.org>; Wed, 19 Oct 2022 03:42:32 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 998466B0073; Tue, 18 Oct 2022 23:42:31 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 9482B6B0074; Tue, 18 Oct 2022 23:42:31 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 7E9136B0075; Tue, 18 Oct 2022 23:42:31 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com
 [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 6FA8D6B0073
	for <linux-mm@kvack.org>; Tue, 18 Oct 2022 23:42:31 -0400 (EDT)
Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id 1BB3A4038D
	for <linux-mm@kvack.org>; Wed, 19 Oct 2022 03:42:31 +0000 (UTC)
X-FDA: 80036301702.24.D46F2AA
Received: from mx0b-00082601.pphosted.com (mx0b-00082601.pphosted.com
 [67.231.153.30])
	by imf20.hostedemail.com (Postfix) with ESMTP id B7FE51C0012
	for <linux-mm@kvack.org>; Wed, 19 Oct 2022 03:42:30 +0000 (UTC)
Received: from pps.filterd (m0109331.ppops.net [127.0.0.1])
	by mx0a-00082601.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id
 29J047YT012550
	for <linux-mm@kvack.org>; Tue, 18 Oct 2022 20:42:30 -0700
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com;
 h=from : to : cc : subject
 : date : message-id : in-reply-to : references : mime-version :
 content-transfer-encoding : content-type; s=facebook;
 bh=LMPQ9VseqlrMEGBMtugBLBDYYfGaRI4aGwNTXzo2sLs=;
 b=cflvBYxsQwcrucN5rPoIOXhYZLSlmO+JlEORzj9tzm7Esi7iKi6BVEEm4px1s5l83Lf9
 BCtdpZoAzbQF95iF22ylPRks5KqD4fyOqtShhYV/bRGrCSux+avrKWEzh41qfKjk2jz/
 oDD9rGZD3GgtL/yb0JEKwXoEBhg3LLkuLhA=
Received: from maileast.thefacebook.com ([163.114.130.16])
	by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3k9j40nj0f-2
	(version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT)
	for <linux-mm@kvack.org>; Tue, 18 Oct 2022 20:42:29 -0700
Received: from twshared3028.05.ash9.facebook.com (2620:10d:c0a8:1b::d) by
 mail.thefacebook.com (2620:10d:c0a8:82::d) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.2375.31; Tue, 18 Oct 2022 20:42:28 -0700
Received: by devvm6390.atn0.facebook.com (Postfix, from userid 352741)
	id 1B83253324D2; Tue, 18 Oct 2022 20:42:22 -0700 (PDT)
From: <alexlzhu@fb.com>
To: <linux-mm@kvack.org>, <kernel-team@fb.com>
CC: <willy@infradead.org>, <riel@surriel.com>, <hannes@cmpxchg.org>,
        Alexander
 Zhu <alexlzhu@fb.com>
Subject: [PATCH v4 1/3] mm: add thp_utilization metrics to debugfs
Date: Tue, 18 Oct 2022 20:42:18 -0700
Message-ID: 
 <6ee57dc7d2eadaee2816ef19fdbdc54ab356d5fb.1666150565.git.alexlzhu@fb.com>
X-Mailer: git-send-email 2.30.2
In-Reply-To: <cover.1666150565.git.alexlzhu@fb.com>
References: <cover.1666150565.git.alexlzhu@fb.com>
MIME-Version: 1.0
X-FB-Internal: Safe
X-Proofpoint-ORIG-GUID: IC4taHXAeLYf7HpOil8_rc6IsBWPqekJ
X-Proofpoint-GUID: IC4taHXAeLYf7HpOil8_rc6IsBWPqekJ
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.205,Aquarius:18.0.895,Hydra:6.0.545,FMLib:17.11.122.1
 definitions=2022-10-18_10,2022-10-18_01,2022-06-22_01
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1666150950; a=rsa-sha256;
	cv=none;
	b=GQPGHMw0Wba7wqHi1opQxYcEIgt1nzduJtTfJatdxUrVpU4gmbyhtDnLGnmzqPx0SepH2F
	SaTBSqeagC9/PwGFeA8h7ntaEwlPgCdZa2t1ZSxDP0pKx7D+GoBuTPRZXBH2E9xXkBysPB
	p5Tq3L24/K3IzTUTbOx3O30caK47Wy0=
ARC-Authentication-Results: i=1;
	imf20.hostedemail.com;
	dkim=pass header.d=fb.com header.s=facebook header.b=cflvBYxs;
	dmarc=pass (policy=reject) header.from=fb.com;
	spf=pass (imf20.hostedemail.com: domain of
 "prvs=129168f899=alexlzhu@meta.com" designates 67.231.153.30 as permitted
 sender) smtp.mailfrom="prvs=129168f899=alexlzhu@meta.com"
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1666150950;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=LMPQ9VseqlrMEGBMtugBLBDYYfGaRI4aGwNTXzo2sLs=;
	b=3CNkx6dSLahizOopX9mM2tfWdC/UWkZdk0KxkArSD6zKUjb5C29gb8gdnpQITBEDfI+YnF
	E9kMSq6YWdW0migmiuFkbdD1HysCjn8qmyhJJfQb+P9USl7JgCqF1+BUI0oKoUHDIPSuMW
	8xJ663nMlDnwrGJpSQRdKDTWp4PRDKo=
X-Stat-Signature: aauy1wwmttnka5g7rwy1qqtwsgmdr7u5
X-Rspamd-Queue-Id: B7FE51C0012
X-Rspam-User: 
Authentication-Results: imf20.hostedemail.com;
	dkim=pass header.d=fb.com header.s=facebook header.b=cflvBYxs;
	dmarc=pass (policy=reject) header.from=fb.com;
	spf=pass (imf20.hostedemail.com: domain of
 "prvs=129168f899=alexlzhu@meta.com" designates 67.231.153.30 as permitted
 sender) smtp.mailfrom="prvs=129168f899=alexlzhu@meta.com"
X-Rspamd-Server: rspam11
X-HE-Tag: 1666150950-279752
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

From: Alexander Zhu <alexlzhu@fb.com>

This change introduces a tool that scans through all of physical
memory for anonymous THPs and groups them into buckets based
on utilization. It also includes an interface under
/sys/kernel/debug/thp_utilization.

Sample Output:

Utilized[0-50]: 1331 680884
Utilized[51-101]: 9 3983
Utilized[102-152]: 3 1187
Utilized[153-203]: 0 0
Utilized[204-255]: 2 539
Utilized[256-306]: 5 1135
Utilized[307-357]: 1 192
Utilized[358-408]: 0 0
Utilized[409-459]: 1 57
Utilized[460-512]: 400 13
Last Scan Time: 223.98s
Last Scan Duration: 70.65s

This indicates that there are 1331 THPs that have between 0 and 50
utilized (non zero) pages. In total there are 680884 zero pages in
this utilization bucket. THPs in the [0-50] bucket compose 76% of total
THPs, and are responsible for 99% of total zero pages across all
THPs. In other words, the least utilized THPs are responsible for almost
all of the memory waste when THP is always enabled. Similar results
have been observed across production workloads.

The last two lines indicate the timestamp and duration of the most recent
scan through all of physical memory. Here we see that the last scan
occurred 223.98 seconds after boot time and took 70.65 seconds.

Utilization of a THP is defined as the percentage of nonzero
pages in the THP. The worker thread will scan through all
of physical memory and obtain utilization of all anonymous
THPs. It will gather this information by periodically scanning
through all of physical memory for anonymous THPs, group them
into buckets based on utilization, and report utilization
information through debugfs under /sys/kernel/debug/thp_utilization.

Signed-off-by: Alexander Zhu <alexlzhu@fb.com>
---
v3 to v4
-changed thp_utilization_bucket() function to take folios, saves conversion between page and folio
-added newlines where they were previously missing in v2-v3
-moved the thp utilization code out into its own file under mm/thp_utilization.c
-removed is_anonymous_transparent_hugepage function. Use folio_test_anon and folio_test_trans_huge instead.
-changed thp_number_utilized_pages to use memchr_inv

v1 to v2
-reversed ordering of is_transparent_hugepage and PageAnon in is_anon_transparent_hugepage, page->mapping is only meaningful for user pages

RFC to v1
-Refactored out the code to obtain the thp_utilization_bucket, as that now has to be used in multiple places.

 Documentation/admin-guide/mm/transhuge.rst |   9 +
 mm/Makefile                                |   2 +-
 mm/thp_utilization.c                       | 203 +++++++++++++++++++++
 3 files changed, 213 insertions(+), 1 deletion(-)
 create mode 100644 mm/thp_utilization.c

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 8ee78ec232eb..21d86303c97e 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -304,6 +304,15 @@ To identify what applications are mapping file transparent huge pages, it
 is necessary to read ``/proc/PID/smaps`` and count the FileHugeMapped fields
 for each mapping.
 
+The utilization of transparent hugepages can be viewed by reading
+``/sys/kernel/debug/thp_utilization``. The utilization of a THP is defined
+as the ratio of non zero filled 4kb pages to the total number of pages in a
+THP. The buckets are labelled by the range of total utilized 4kb pages with
+one line per utilization bucket. Each line contains the total number of
+THPs in that bucket and the total number of zero filled 4kb pages summed
+over all THPs in that bucket. The last two lines show the timestamp and
+duration respectively of the most recent scan over all of physical memory.
+
 Note that reading the smaps file is expensive and reading it
 frequently will incur overhead.
 
diff --git a/mm/Makefile b/mm/Makefile
index 8e105e5b3e29..5f76dc6ce044 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -95,7 +95,7 @@ obj-$(CONFIG_MEMTEST)		+= memtest.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_NUMA) += memory-tiers.o
 obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
-obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
+obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o thp_utilization.o
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
 obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
 ifdef CONFIG_SWAP
diff --git a/mm/thp_utilization.c b/mm/thp_utilization.c
new file mode 100644
index 000000000000..7b79f8759d12
--- /dev/null
+++ b/mm/thp_utilization.c
@@ -0,0 +1,203 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ *  Copyright (C) 2022  Meta, Inc.
+ *  Authors: Alexander Zhu, Johannes Weiner, Rik van Riel
+ */
+
+#include <linux/mm.h>
+#include <linux/debugfs.h>
+#include <linux/highmem.h>
+/*
+ * The number of utilization buckets THPs will be grouped in
+ * under /sys/kernel/debug/thp_utilization.
+ */
+#define THP_UTIL_BUCKET_NR 10
+/*
+ * The number of hugepages to scan through on each periodic
+ * run of the scanner that generates /sys/kernel/debug/thp_utilization.
+ */
+#define THP_UTIL_SCAN_SIZE 256
+
+static void thp_utilization_workfn(struct work_struct *work);
+static DECLARE_DELAYED_WORK(thp_utilization_work, thp_utilization_workfn);
+
+struct thp_scan_info_bucket {
+	int nr_thps;
+	int nr_zero_pages;
+};
+
+struct thp_scan_info {
+	struct thp_scan_info_bucket buckets[THP_UTIL_BUCKET_NR];
+	struct zone *scan_zone;
+	struct timespec64 last_scan_duration;
+	struct timespec64 last_scan_time;
+	unsigned long pfn;
+};
+
+/*
+ * thp_scan_debugfs is referred to when /sys/kernel/debug/thp_utilization
+ * is opened. thp_scan is used to keep track fo the current scan through
+ * physical memory.
+ */
+static struct thp_scan_info thp_scan_debugfs;
+static struct thp_scan_info thp_scan;
+
+#ifdef CONFIG_DEBUG_FS
+static int thp_utilization_show(struct seq_file *seqf, void *pos)
+{
+	int i;
+	int start;
+	int end;
+
+	for (i = 0; i < THP_UTIL_BUCKET_NR; i++) {
+		start = i * HPAGE_PMD_NR / THP_UTIL_BUCKET_NR;
+		end = (i + 1 == THP_UTIL_BUCKET_NR)
+			   ? HPAGE_PMD_NR
+			   : ((i + 1) * HPAGE_PMD_NR / THP_UTIL_BUCKET_NR - 1);
+		/* The last bucket will need to contain 100 */
+		seq_printf(seqf, "Utilized[%d-%d]: %d %d\n", start, end,
+			   thp_scan_debugfs.buckets[i].nr_thps,
+			   thp_scan_debugfs.buckets[i].nr_zero_pages);
+	}
+
+	seq_printf(seqf, "Last Scan Time: %lu.%02lus\n",
+		   (unsigned long)thp_scan_debugfs.last_scan_time.tv_sec,
+		   (thp_scan_debugfs.last_scan_time.tv_nsec / (NSEC_PER_SEC / 100)));
+
+	seq_printf(seqf, "Last Scan Duration: %lu.%02lus\n",
+		   (unsigned long)thp_scan_debugfs.last_scan_duration.tv_sec,
+		   (thp_scan_debugfs.last_scan_duration.tv_nsec / (NSEC_PER_SEC / 100)));
+
+	return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(thp_utilization);
+
+static int __init thp_utilization_debugfs(void)
+{
+	debugfs_create_file("thp_utilization", 0200, NULL, NULL,
+			    &thp_utilization_fops);
+	return 0;
+}
+late_initcall(thp_utilization_debugfs);
+#endif
+
+static int thp_utilization_bucket(int num_utilized_pages)
+{
+	int bucket;
+
+	if (num_utilized_pages < 0 || num_utilized_pages > HPAGE_PMD_NR)
+		return -1;
+
+	/* Group THPs into utilization buckets */
+	bucket = num_utilized_pages * THP_UTIL_BUCKET_NR / HPAGE_PMD_NR;
+	return min(bucket, THP_UTIL_BUCKET_NR - 1);
+}
+
+static int thp_number_utilized_pages(struct folio *folio)
+{
+	int thp_nr_utilized_pages = HPAGE_PMD_NR;
+	void *kaddr;
+	int i;
+
+	if (!folio || !folio_test_anon(folio) || !folio_test_transhuge(folio))
+		return -1;
+
+	for (i = 0; i < folio_nr_pages(folio); i++) {
+		kaddr = kmap_local_folio(folio, i);
+		if (memchr_inv(kaddr, 0, PAGE_SIZE))
+			thp_nr_utilized_pages--;
+
+		kunmap_local(kaddr);
+	}
+
+	return thp_nr_utilized_pages;
+}
+
+static void thp_scan_next_zone(void)
+{
+	struct timespec64 current_time;
+	bool update_debugfs;
+	/*
+	 * THP utilization worker thread has reached the end
+	 * of the memory zone. Proceed to the next zone.
+	 */
+	thp_scan.scan_zone = next_zone(thp_scan.scan_zone);
+	update_debugfs = !thp_scan.scan_zone;
+	thp_scan.scan_zone = update_debugfs ? (first_online_pgdat())->node_zones
+			: thp_scan.scan_zone;
+	thp_scan.pfn = (thp_scan.scan_zone->zone_start_pfn + HPAGE_PMD_NR - 1)
+			& ~(HPAGE_PMD_SIZE - 1);
+	if (!update_debugfs)
+		return;
+
+	/*
+	 * If the worker has scanned through all of physical memory then
+	 * update information displayed in /sys/kernel/debug/thp_utilization
+	 */
+	ktime_get_ts64(&current_time);
+	thp_scan_debugfs.last_scan_duration = timespec64_sub(current_time,
+							     thp_scan_debugfs.last_scan_time);
+	thp_scan_debugfs.last_scan_time = current_time;
+
+	memcpy(&thp_scan_debugfs.buckets, &thp_scan.buckets, sizeof(thp_scan.buckets));
+	memset(&thp_scan.buckets, 0, sizeof(thp_scan.buckets));
+}
+
+static void thp_util_scan(unsigned long pfn_end)
+{
+	struct page *page = NULL;
+	int bucket, current_pfn, num_utilized_pages;
+	int i;
+	/*
+	 * Scan through each memory zone in chunks of THP_UTIL_SCAN_SIZE
+	 * PFNs every second looking for anonymous THPs.
+	 */
+	for (i = 0; i < THP_UTIL_SCAN_SIZE; i++) {
+		current_pfn = thp_scan.pfn;
+		thp_scan.pfn += HPAGE_PMD_NR;
+		if (current_pfn >= pfn_end)
+			return;
+
+		page = pfn_to_online_page(current_pfn);
+		if (!page)
+			continue;
+
+		num_utilized_pages = thp_number_utilized_pages(page_folio(page));
+		bucket = thp_utilization_bucket(num_utilized_pages);
+		if (bucket < 0)
+			continue;
+
+		thp_scan.buckets[bucket].nr_thps++;
+		thp_scan.buckets[bucket].nr_zero_pages += (HPAGE_PMD_NR - num_utilized_pages);
+	}
+}
+
+static void thp_utilization_workfn(struct work_struct *work)
+{
+	unsigned long pfn_end;
+	/*
+	 * Worker function that scans through all of physical memory
+	 * for anonymous THPs.
+	 */
+	if (!thp_scan.scan_zone)
+		thp_scan.scan_zone = (first_online_pgdat())->node_zones;
+
+	pfn_end = zone_end_pfn(thp_scan.scan_zone);
+	/* If we have reached the end of the zone or end of physical memory
+	 * move on to the next zone. Otherwise, scan the next PFNs in the
+	 * current zone.
+	 */
+	if (!managed_zone(thp_scan.scan_zone) || thp_scan.pfn >= pfn_end)
+		thp_scan_next_zone();
+	else
+		thp_util_scan(pfn_end);
+
+	schedule_delayed_work(&thp_utilization_work, HZ);
+}
+
+static int __init thp_scan_init(void)
+{
+	schedule_delayed_work(&thp_utilization_work, HZ);
+	return 0;
+}
+subsys_initcall(thp_scan_init);

From patchwork Wed Oct 19 03:42:19 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Alex Zhu (Kernel)" <alexlzhu@fb.com>
X-Patchwork-Id: 13011335
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 38F46C433FE
	for <linux-mm@archiver.kernel.org>; Wed, 19 Oct 2022 03:42:36 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id CCC9A6B0074; Tue, 18 Oct 2022 23:42:35 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id C7C5E6B0075; Tue, 18 Oct 2022 23:42:35 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id AA8356B0078; Tue, 18 Oct 2022 23:42:35 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com
 [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 9AED66B0074
	for <linux-mm@kvack.org>; Tue, 18 Oct 2022 23:42:35 -0400 (EDT)
Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id 771BCC0402
	for <linux-mm@kvack.org>; Wed, 19 Oct 2022 03:42:35 +0000 (UTC)
X-FDA: 80036301870.06.ABB95FC
Received: from mx0b-00082601.pphosted.com (mx0b-00082601.pphosted.com
 [67.231.153.30])
	by imf13.hostedemail.com (Postfix) with ESMTP id 1E3E320008
	for <linux-mm@kvack.org>; Wed, 19 Oct 2022 03:42:34 +0000 (UTC)
Received: from pps.filterd (m0109331.ppops.net [127.0.0.1])
	by mx0a-00082601.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id
 29J046NN012491
	for <linux-mm@kvack.org>; Tue, 18 Oct 2022 20:42:34 -0700
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com;
 h=from : to : cc : subject
 : date : message-id : in-reply-to : references : content-type :
 content-transfer-encoding : mime-version; s=facebook;
 bh=/7KUHsay5fA7lUV05BLLdmpEUcAF1yCcHcH/0SKSuj4=;
 b=SKms5c+INRQN3wq9EeOh8nLmMYGpnGd/06M5Dl7d7hDGJGQ6Fw757HDEp9PbJjCxS9wa
 MyrsoxxvWIMZUAznhPtumCfJdFFAT+tC3/dnrLmPBqE+Q/g5O20zPpM+LKvbyFZ1SHAw
 +CgZ5X+lPHz8OaeEQ39ohfjibLf7KdId/ZE=
Received: from maileast.thefacebook.com ([163.114.130.16])
	by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3k9j40nj0p-2
	(version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT)
	for <linux-mm@kvack.org>; Tue, 18 Oct 2022 20:42:34 -0700
Received: from twshared26494.14.frc2.facebook.com (2620:10d:c0a8:1b::d) by
 mail.thefacebook.com (2620:10d:c0a8:82::d) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.2375.31; Tue, 18 Oct 2022 20:42:33 -0700
Received: by devvm6390.atn0.facebook.com (Postfix, from userid 352741)
	id 2355853324D6; Tue, 18 Oct 2022 20:42:22 -0700 (PDT)
From: <alexlzhu@fb.com>
To: <linux-mm@kvack.org>, <kernel-team@fb.com>
CC: <willy@infradead.org>, <riel@surriel.com>, <hannes@cmpxchg.org>,
        Alexander
 Zhu <alexlzhu@fb.com>
Subject: [PATCH v4 2/3] mm: changes to split_huge_page() to free zero filled
 tail pages
Date: Tue, 18 Oct 2022 20:42:19 -0700
Message-ID: 
 <ff9fd618be7b9cb48d36e3635c89a2fe0b7fca65.1666150565.git.alexlzhu@fb.com>
X-Mailer: git-send-email 2.30.2
In-Reply-To: <cover.1666150565.git.alexlzhu@fb.com>
References: <cover.1666150565.git.alexlzhu@fb.com>
X-FB-Internal: Safe
X-Proofpoint-ORIG-GUID: ZeU2-dyLDdeNPfT-L78EvL_pvuO5PpsC
X-Proofpoint-GUID: ZeU2-dyLDdeNPfT-L78EvL_pvuO5PpsC
X-Proofpoint-UnRewURL: 0 URL was un-rewritten
MIME-Version: 1.0
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.205,Aquarius:18.0.895,Hydra:6.0.545,FMLib:17.11.122.1
 definitions=2022-10-18_10,2022-10-18_01,2022-06-22_01
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1666150955;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=/7KUHsay5fA7lUV05BLLdmpEUcAF1yCcHcH/0SKSuj4=;
	b=4h8GZTbY7ed5DGWExTvxcdJrSCLZIu+K3eTcJKm82/qNF2Z6R81sw+jkrFIIKs78hnSPQf
	rLQqg8yglQOd3GE522UdUNjvZ8+ZeS5y0km42RcRTVecHlfnsIhaqK+ElVtIXPElxUE0pw
	XnFR2777uOWXRrnjmdTplAOSd6vxKK8=
ARC-Authentication-Results: i=1;
	imf13.hostedemail.com;
	dkim=pass header.d=fb.com header.s=facebook header.b=SKms5c+I;
	spf=pass (imf13.hostedemail.com: domain of
 "prvs=129168f899=alexlzhu@meta.com" designates 67.231.153.30 as permitted
 sender) smtp.mailfrom="prvs=129168f899=alexlzhu@meta.com";
	dmarc=pass (policy=reject) header.from=fb.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1666150955; a=rsa-sha256;
	cv=none;
	b=H1dT8sig/5PHexoJfHFlgXj1+5YOLgXCFzLgjymAg07Ffje4mqfXJ1daeuDiYehoA/aiTZ
	N7Ld+6koovI1iUzufNeMHaZSmjSxnXsfNLPJFykdQ+8bfUsv+ZFxteW6a2tZrswsvYZJxJ
	SQqnZfcpzNiTOeESrndE76hyejSDSDo=
Authentication-Results: imf13.hostedemail.com;
	dkim=pass header.d=fb.com header.s=facebook header.b=SKms5c+I;
	spf=pass (imf13.hostedemail.com: domain of
 "prvs=129168f899=alexlzhu@meta.com" designates 67.231.153.30 as permitted
 sender) smtp.mailfrom="prvs=129168f899=alexlzhu@meta.com";
	dmarc=pass (policy=reject) header.from=fb.com
X-Rspamd-Server: rspam07
X-Rspamd-Queue-Id: 1E3E320008
X-Stat-Signature: 43y36ijdjyk74q9e3nqapiw56eza7r9k
X-Rspam-User: 
X-HE-Tag: 1666150954-202805
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

From: Alexander Zhu <alexlzhu@fb.com>

Currently, when /sys/kernel/mm/transparent_hugepage/enabled=always is set
there are a large number of transparent hugepages that are almost entirely
zero filled.  This is mentioned in a number of previous patchsets
including:
https://lore.kernel.org/all/20210731063938.1391602-1-yuzhao@google.com/
https://lore.kernel.org/all/
1635422215-99394-1-git-send-email-ningzhang@linux.alibaba.com/

Currently, split_huge_page() does not have a way to identify zero filled
pages within the THP. Thus these zero pages get remapped and continue to
create memory waste. In this patch, we identify and free tail pages that
are zero filled in split_huge_page(). In this way, we avoid mapping these
pages back into page table entries and can free up unused memory within
THPs. This is based off the previously mentioned patchset by Yu Zhao.
However, we chose to free anonymous zero tail pages whenever they are
encountered instead of only on reclaim or migration.

We also add self tests to verify the RssAnon value to make sure zero
pages are not remapped except in the case of userfaultfd. In the case
of userfaultfd we remap to the shared zero page, similar to what is
done by KSM.

Signed-off-by: Alexander Zhu <alexlzhu@fb.com>
---
v1 to v2
-Modified split_huge_page self test based off more recent changes. 

RFC to v1

-Added support to map to the read only zero page when splitting a THP registered with userfaultfd. Also added a self test to verify that this is working.
-Only trigger the unmap_clean/zap in split_huge_page on anonymous THPs. We cannot zap zero pages for file THPs.

 include/linux/rmap.h                          |   2 +-
 include/linux/vm_event_item.h                 |   3 +
 mm/huge_memory.c                              |  45 ++++++-
 mm/migrate.c                                  |  73 +++++++++--
 mm/migrate_device.c                           |   4 +-
 mm/vmstat.c                                   |   3 +
 .../selftests/vm/split_huge_page_test.c       | 115 +++++++++++++++++-
 tools/testing/selftests/vm/vm_util.c          |  23 ++++
 tools/testing/selftests/vm/vm_util.h          |   3 +
 9 files changed, 256 insertions(+), 15 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index bd3504d11b15..3f83bbcf1333 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -428,7 +428,7 @@ int folio_mkclean(struct folio *);
 int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff,
 		      struct vm_area_struct *vma);
 
-void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked);
+void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked, bool unmap_clean);
 
 int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma);
 
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 3518dba1e02f..3618b10ddec9 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -111,6 +111,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 		THP_SPLIT_PUD,
 #endif
+		THP_SPLIT_FREE,
+		THP_SPLIT_UNMAP,
+		THP_SPLIT_REMAP_READONLY_ZERO_PAGE,
 		THP_ZERO_PAGE_ALLOC,
 		THP_ZERO_PAGE_ALLOC_FAILED,
 		THP_SWPOUT,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1cc4a5f4791e..f68a353e0adf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2373,7 +2373,7 @@ static void unmap_folio(struct folio *folio)
 		try_to_unmap(folio, ttu_flags | TTU_IGNORE_MLOCK);
 }
 
-static void remap_page(struct folio *folio, unsigned long nr)
+static void remap_page(struct folio *folio, unsigned long nr, bool unmap_clean)
 {
 	int i = 0;
 
@@ -2381,7 +2381,7 @@ static void remap_page(struct folio *folio, unsigned long nr)
 	if (!folio_test_anon(folio))
 		return;
 	for (;;) {
-		remove_migration_ptes(folio, folio, true);
+		remove_migration_ptes(folio, folio, true, unmap_clean);
 		i += folio_nr_pages(folio);
 		if (i >= nr)
 			break;
@@ -2496,6 +2496,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	struct address_space *swap_cache = NULL;
 	unsigned long offset = 0;
 	unsigned int nr = thp_nr_pages(head);
+	LIST_HEAD(pages_to_free);
+	int nr_pages_to_free = 0;
 	int i;
 
 	/* complete memcg works before add pages to LRU */
@@ -2558,7 +2560,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	}
 	local_irq_enable();
 
-	remap_page(folio, nr);
+	remap_page(folio, nr, PageAnon(head));
 
 	if (PageSwapCache(head)) {
 		swp_entry_t entry = { .val = page_private(head) };
@@ -2572,6 +2574,34 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 			continue;
 		unlock_page(subpage);
 
+		/*
+		 * If a tail page has only two references left, one inherited
+		 * from the isolation of its head and the other from
+		 * lru_add_page_tail() which we are about to drop, it means this
+		 * tail page was concurrently zapped. Then we can safely free it
+		 * and save page reclaim or migration the trouble of trying it.
+		 */
+		if (list && page_ref_freeze(subpage, 2)) {
+			VM_BUG_ON_PAGE(PageLRU(subpage), subpage);
+			VM_BUG_ON_PAGE(PageCompound(subpage), subpage);
+			VM_BUG_ON_PAGE(page_mapped(subpage), subpage);
+
+			ClearPageActive(subpage);
+			ClearPageUnevictable(subpage);
+			list_move(&subpage->lru, &pages_to_free);
+			nr_pages_to_free++;
+			continue;
+		}
+
+		/*
+		 * If a tail page has only one reference left, it will be freed
+		 * by the call to free_page_and_swap_cache below. Since zero
+		 * subpages are no longer remapped, there will only be one
+		 * reference left in cases outside of reclaim or migration.
+		 */
+		if (page_ref_count(subpage) == 1)
+			nr_pages_to_free++;
+
 		/*
 		 * Subpages may be freed if there wasn't any mapping
 		 * like if add_to_swap() is running on a lru page that
@@ -2581,6 +2611,13 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		 */
 		free_page_and_swap_cache(subpage);
 	}
+
+	if (!nr_pages_to_free)
+		return;
+
+	mem_cgroup_uncharge_list(&pages_to_free);
+	free_unref_page_list(&pages_to_free);
+	count_vm_events(THP_SPLIT_FREE, nr_pages_to_free);
 }
 
 /* Racy check whether the huge page can be split */
@@ -2752,7 +2789,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		if (mapping)
 			xas_unlock(&xas);
 		local_irq_enable();
-		remap_page(folio, folio_nr_pages(folio));
+		remap_page(folio, folio_nr_pages(folio), false);
 		ret = -EBUSY;
 	}
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 1379e1912772..bc96a084d925 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -30,6 +30,7 @@
 #include <linux/writeback.h>
 #include <linux/mempolicy.h>
 #include <linux/vmalloc.h>
+#include <linux/vm_event_item.h>
 #include <linux/security.h>
 #include <linux/backing-dev.h>
 #include <linux/compaction.h>
@@ -168,13 +169,62 @@ void putback_movable_pages(struct list_head *l)
 	}
 }
 
+static bool try_to_unmap_clean(struct page_vma_mapped_walk *pvmw, struct page *page)
+{
+	void *addr;
+	bool dirty;
+	pte_t newpte;
+
+	VM_BUG_ON_PAGE(PageCompound(page), page);
+	VM_BUG_ON_PAGE(!PageAnon(page), page);
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	VM_BUG_ON_PAGE(pte_present(*pvmw->pte), page);
+
+	if (PageMlocked(page) || (pvmw->vma->vm_flags & VM_LOCKED))
+		return false;
+
+	/*
+	 * The pmd entry mapping the old thp was flushed and the pte mapping
+	 * this subpage has been non present. Therefore, this subpage is
+	 * inaccessible. We don't need to remap it if it contains only zeros.
+	 */
+	addr = kmap_local_page(page);
+	dirty = memchr_inv(addr, 0, PAGE_SIZE);
+	kunmap_local(addr);
+
+	if (dirty)
+		return false;
+
+	pte_clear_not_present_full(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, false);
+
+	if (userfaultfd_armed(pvmw->vma)) {
+		newpte = pte_mkspecial(pfn_pte(page_to_pfn(ZERO_PAGE(pvmw->address)),
+					       pvmw->vma->vm_page_prot));
+		ptep_clear_flush(pvmw->vma, pvmw->address, pvmw->pte);
+		set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte);
+		dec_mm_counter(pvmw->vma->vm_mm, MM_ANONPAGES);
+		count_vm_event(THP_SPLIT_REMAP_READONLY_ZERO_PAGE);
+		return true;
+	}
+
+	dec_mm_counter(pvmw->vma->vm_mm, mm_counter(page));
+	count_vm_event(THP_SPLIT_UNMAP);
+	return true;
+}
+
+struct rmap_walk_arg {
+	struct folio *folio;
+	bool unmap_clean;
+};
+
 /*
  * Restore a potential migration pte to a working pte entry
  */
 static bool remove_migration_pte(struct folio *folio,
-		struct vm_area_struct *vma, unsigned long addr, void *old)
+		struct vm_area_struct *vma, unsigned long addr, void *arg)
 {
-	DEFINE_FOLIO_VMA_WALK(pvmw, old, vma, addr, PVMW_SYNC | PVMW_MIGRATION);
+	struct rmap_walk_arg *rmap_walk_arg = arg;
+	DEFINE_FOLIO_VMA_WALK(pvmw, rmap_walk_arg->folio, vma, addr, PVMW_SYNC | PVMW_MIGRATION);
 
 	while (page_vma_mapped_walk(&pvmw)) {
 		rmap_t rmap_flags = RMAP_NONE;
@@ -197,6 +247,8 @@ static bool remove_migration_pte(struct folio *folio,
 			continue;
 		}
 #endif
+		if (rmap_walk_arg->unmap_clean && try_to_unmap_clean(&pvmw, new))
+			continue;
 
 		folio_get(folio);
 		pte = mk_pte(new, READ_ONCE(vma->vm_page_prot));
@@ -272,13 +324,20 @@ static bool remove_migration_pte(struct folio *folio,
  * Get rid of all migration entries and replace them by
  * references to the indicated page.
  */
-void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked)
+void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked, bool unmap_clean)
 {
+	struct rmap_walk_arg rmap_walk_arg = {
+		.folio = src,
+		.unmap_clean = unmap_clean,
+	};
+
 	struct rmap_walk_control rwc = {
 		.rmap_one = remove_migration_pte,
-		.arg = src,
+		.arg = &rmap_walk_arg,
 	};
 
+	VM_BUG_ON_FOLIO(unmap_clean && src != dst, src);
+
 	if (locked)
 		rmap_walk_locked(dst, &rwc);
 	else
@@ -872,7 +931,7 @@ static int writeout(struct address_space *mapping, struct folio *folio)
 	 * At this point we know that the migration attempt cannot
 	 * be successful.
 	 */
-	remove_migration_ptes(folio, folio, false);
+	remove_migration_ptes(folio, folio, false, false);
 
 	rc = mapping->a_ops->writepage(&folio->page, &wbc);
 
@@ -1128,7 +1187,7 @@ static int __unmap_and_move(struct folio *src, struct folio *dst,
 
 	if (page_was_mapped)
 		remove_migration_ptes(src,
-			rc == MIGRATEPAGE_SUCCESS ? dst : src, false);
+			rc == MIGRATEPAGE_SUCCESS ? dst : src, false, false);
 
 out_unlock_both:
 	folio_unlock(dst);
@@ -1338,7 +1397,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 
 	if (page_was_mapped)
 		remove_migration_ptes(src,
-			rc == MIGRATEPAGE_SUCCESS ? dst : src, false);
+			rc == MIGRATEPAGE_SUCCESS ? dst : src, false, false);
 
 unlock_put_anon:
 	folio_unlock(dst);
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 6fa682eef7a0..6508a083d7fd 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -421,7 +421,7 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
 			continue;
 
 		folio = page_folio(page);
-		remove_migration_ptes(folio, folio, false);
+		remove_migration_ptes(folio, folio, false, false);
 
 		src_pfns[i] = 0;
 		folio_unlock(folio);
@@ -847,7 +847,7 @@ void migrate_device_finalize(unsigned long *src_pfns,
 
 		src = page_folio(page);
 		dst = page_folio(newpage);
-		remove_migration_ptes(src, dst, false);
+		remove_migration_ptes(src, dst, false, false);
 		folio_unlock(src);
 
 		if (is_zone_device_page(page))
diff --git a/mm/vmstat.c b/mm/vmstat.c
index b2371d745e00..3d802eb6754d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1359,6 +1359,9 @@ const char * const vmstat_text[] = {
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 	"thp_split_pud",
 #endif
+	"thp_split_free",
+	"thp_split_unmap",
+	"thp_split_remap_readonly_zero_page",
 	"thp_zero_page_alloc",
 	"thp_zero_page_alloc_failed",
 	"thp_swpout",
diff --git a/tools/testing/selftests/vm/split_huge_page_test.c b/tools/testing/selftests/vm/split_huge_page_test.c
index 76e1c36dd9e5..42f0e79a4508 100644
--- a/tools/testing/selftests/vm/split_huge_page_test.c
+++ b/tools/testing/selftests/vm/split_huge_page_test.c
@@ -16,6 +16,9 @@
 #include <sys/mount.h>
 #include <malloc.h>
 #include <stdbool.h>
+#include <sys/syscall.h> /* Definition of SYS_* constants */
+#include <linux/userfaultfd.h>
+#include <sys/ioctl.h>
 #include "vm_util.h"
 
 uint64_t pagesize;
@@ -88,6 +91,115 @@ static void write_debugfs(const char *fmt, ...)
 	}
 }
 
+static char *allocate_zero_filled_hugepage(size_t len)
+{
+	char *result;
+	size_t i;
+
+	result = memalign(pmd_pagesize, len);
+	if (!result) {
+		printf("Fail to allocate memory\n");
+		exit(EXIT_FAILURE);
+	}
+
+	madvise(result, len, MADV_HUGEPAGE);
+
+	for (i = 0; i < len; i++)
+		result[i] = (char)0;
+
+	return result;
+}
+
+static void verify_rss_anon_split_huge_page_all_zeroes(char *one_page, int nr_hpages, size_t len)
+{
+	uint64_t rss_anon_before, rss_anon_after;
+	size_t i;
+
+	if (!check_huge_anon(one_page, 4, pmd_pagesize)) {
+		printf("No THP is allocated\n");
+		exit(EXIT_FAILURE);
+	}
+
+	rss_anon_before = rss_anon();
+	if (!rss_anon_before) {
+		printf("No RssAnon is allocated before split\n");
+		exit(EXIT_FAILURE);
+	}
+
+	/* split all THPs */
+	write_debugfs(PID_FMT, getpid(), (uint64_t)one_page,
+		      (uint64_t)one_page + len);
+
+	for (i = 0; i < len; i++)
+		if (one_page[i] != (char)0) {
+			printf("%ld byte corrupted\n", i);
+			exit(EXIT_FAILURE);
+		}
+
+	if (!check_huge_anon(one_page, 0, pmd_pagesize)) {
+		printf("Still AnonHugePages not split\n");
+		exit(EXIT_FAILURE);
+	}
+
+	rss_anon_after = rss_anon();
+	if (rss_anon_after >= rss_anon_before) {
+		printf("Incorrect RssAnon value. Before: %ld After: %ld\n",
+		       rss_anon_before, rss_anon_after);
+		exit(EXIT_FAILURE);
+	}
+}
+
+void split_pmd_zero_pages(void)
+{
+	char *one_page;
+	int nr_hpages = 4;
+	size_t len = nr_hpages * pmd_pagesize;
+
+	one_page = allocate_zero_filled_hugepage(len);
+	verify_rss_anon_split_huge_page_all_zeroes(one_page, nr_hpages, len);
+	printf("Split zero filled huge pages successful\n");
+	free(one_page);
+}
+
+void split_pmd_zero_pages_uffd(void)
+{
+	char *one_page;
+	int nr_hpages = 4;
+	size_t len = nr_hpages * pmd_pagesize;
+	long uffd; /* userfaultfd file descriptor */
+	struct uffdio_api uffdio_api;
+	struct uffdio_register uffdio_register;
+
+	/* Create and enable userfaultfd object. */
+
+	uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
+	if (uffd == -1) {
+		perror("userfaultfd");
+		exit(1);
+	}
+
+	uffdio_api.api = UFFD_API;
+	uffdio_api.features = 0;
+	if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1) {
+		perror("ioctl-UFFDIO_API");
+		exit(1);
+	}
+
+	one_page = allocate_zero_filled_hugepage(len);
+
+	uffdio_register.range.start = (unsigned long)one_page;
+	uffdio_register.range.len = len;
+	uffdio_register.mode = UFFDIO_REGISTER_MODE_WP;
+	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1) {
+		perror("ioctl-UFFDIO_REGISTER");
+		exit(1);
+	}
+
+	verify_rss_anon_split_huge_page_all_zeroes(one_page, nr_hpages, len);
+	printf("Split zero filled huge pages with uffd successful\n");
+	free(one_page);
+}
+
 void split_pmd_thp(void)
 {
 	char *one_page;
@@ -121,7 +233,6 @@ void split_pmd_thp(void)
 			exit(EXIT_FAILURE);
 		}
 
-
 	if (check_huge_anon(one_page, 0, pmd_pagesize)) {
 		printf("Still AnonHugePages not split\n");
 		exit(EXIT_FAILURE);
@@ -301,6 +412,8 @@ int main(int argc, char **argv)
 	pageshift = ffs(pagesize) - 1;
 	pmd_pagesize = read_pmd_pagesize();
 
+	split_pmd_zero_pages();
+	split_pmd_zero_pages_uffd();
 	split_pmd_thp();
 	split_pte_mapped_thp();
 	split_file_backed_thp();
diff --git a/tools/testing/selftests/vm/vm_util.c b/tools/testing/selftests/vm/vm_util.c
index f11f8adda521..72f3edc64aaf 100644
--- a/tools/testing/selftests/vm/vm_util.c
+++ b/tools/testing/selftests/vm/vm_util.c
@@ -6,6 +6,7 @@
 
 #define PMD_SIZE_FILE_PATH "/sys/kernel/mm/transparent_hugepage/hpage_pmd_size"
 #define SMAP_FILE_PATH "/proc/self/smaps"
+#define STATUS_FILE_PATH "/proc/self/status"
 #define MAX_LINE_LENGTH 500
 
 uint64_t pagemap_get_entry(int fd, char *start)
@@ -72,6 +73,28 @@ uint64_t read_pmd_pagesize(void)
 	return strtoul(buf, NULL, 10);
 }
 
+uint64_t rss_anon(void)
+{
+	uint64_t rss_anon = 0;
+	int ret;
+	FILE *fp;
+	char buffer[MAX_LINE_LENGTH];
+
+	fp = fopen(STATUS_FILE_PATH, "r");
+	if (!fp)
+		ksft_exit_fail_msg("%s: Failed to open file %s\n", __func__, STATUS_FILE_PATH);
+
+	if (!check_for_pattern(fp, "RssAnon:", buffer, sizeof(buffer)))
+		goto err_out;
+
+	if (sscanf(buffer, "RssAnon:%10ld kB", &rss_anon) != 1)
+		ksft_exit_fail_msg("Reading status error\n");
+
+err_out:
+	fclose(fp);
+	return rss_anon;
+}
+
 bool __check_huge(void *addr, char *pattern, int nr_hpages,
 		  uint64_t hpage_size)
 {
diff --git a/tools/testing/selftests/vm/vm_util.h b/tools/testing/selftests/vm/vm_util.h
index 5c35de454e08..dd1885f66097 100644
--- a/tools/testing/selftests/vm/vm_util.h
+++ b/tools/testing/selftests/vm/vm_util.h
@@ -1,12 +1,15 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #include <stdint.h>
 #include <stdbool.h>
+#include <stddef.h>
+#include <stdio.h>
 
 uint64_t pagemap_get_entry(int fd, char *start);
 bool pagemap_is_softdirty(int fd, char *start);
 void clear_softdirty(void);
 bool check_for_pattern(FILE *fp, const char *pattern, char *buf, size_t len);
 uint64_t read_pmd_pagesize(void);
+uint64_t rss_anon(void);
 bool check_huge_anon(void *addr, int nr_hpages, uint64_t hpage_size);
 bool check_huge_file(void *addr, int nr_hpages, uint64_t hpage_size);
 bool check_huge_shmem(void *addr, int nr_hpages, uint64_t hpage_size);

From patchwork Wed Oct 19 03:42:20 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Alex Zhu (Kernel)" <alexlzhu@fb.com>
X-Patchwork-Id: 13011336
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 518EAC4332F
	for <linux-mm@archiver.kernel.org>; Wed, 19 Oct 2022 03:42:38 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id CCD406B0075; Tue, 18 Oct 2022 23:42:37 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id C7C636B0078; Tue, 18 Oct 2022 23:42:37 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id AF6206B007B; Tue, 18 Oct 2022 23:42:37 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com
 [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 9FA6A6B0075
	for <linux-mm@kvack.org>; Tue, 18 Oct 2022 23:42:37 -0400 (EDT)
Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id 7804D16044A
	for <linux-mm@kvack.org>; Wed, 19 Oct 2022 03:42:37 +0000 (UTC)
X-FDA: 80036301954.27.DA363D4
Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com
 [67.231.145.42])
	by imf10.hostedemail.com (Postfix) with ESMTP id 0A0EDC0026
	for <linux-mm@kvack.org>; Wed, 19 Oct 2022 03:42:35 +0000 (UTC)
Received: from pps.filterd (m0109334.ppops.net [127.0.0.1])
	by mx0a-00082601.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id
 29J03wjr030247
	for <linux-mm@kvack.org>; Tue, 18 Oct 2022 20:42:35 -0700
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com;
 h=from : to : cc : subject
 : date : message-id : in-reply-to : references : mime-version :
 content-transfer-encoding : content-type; s=facebook;
 bh=tEYaKDhGnah93pCbJzIZt981jxWGOApxCCyHZWI27w0=;
 b=GVp7KkPJnxsuIi11Js0538ku2p0dRnC61oZOv0dw5+sBZwj6MEwQOK881NeqUZI0Tzg8
 ulUknSw/CmeRq6cM7cGCFW9gOMK9hfG/ztitFCGzC0AfGNXO1rxJ8EHvBdbEw/QxGWaR
 piHG5YtikadGIaeKLFzTkzg+64AtaOKgDeU=
Received: from maileast.thefacebook.com ([163.114.130.16])
	by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3k9tpehf8g-2
	(version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT)
	for <linux-mm@kvack.org>; Tue, 18 Oct 2022 20:42:34 -0700
Received: from twshared9269.07.ash9.facebook.com (2620:10d:c0a8:1b::d) by
 mail.thefacebook.com (2620:10d:c0a8:83::5) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.2375.31; Tue, 18 Oct 2022 20:42:32 -0700
Received: by devvm6390.atn0.facebook.com (Postfix, from userid 352741)
	id 2AAC453324D8; Tue, 18 Oct 2022 20:42:22 -0700 (PDT)
From: <alexlzhu@fb.com>
To: <linux-mm@kvack.org>, <kernel-team@fb.com>
CC: <willy@infradead.org>, <riel@surriel.com>, <hannes@cmpxchg.org>,
        Alexander
 Zhu <alexlzhu@fb.com>
Subject: [PATCH v4 3/3] mm: THP low utilization shrinker
Date: Tue, 18 Oct 2022 20:42:20 -0700
Message-ID: 
 <da68c8f51d0cb0f86c3e42afdc8c470b60da4f22.1666150565.git.alexlzhu@fb.com>
X-Mailer: git-send-email 2.30.2
In-Reply-To: <cover.1666150565.git.alexlzhu@fb.com>
References: <cover.1666150565.git.alexlzhu@fb.com>
MIME-Version: 1.0
X-FB-Internal: Safe
X-Proofpoint-ORIG-GUID: qpAw453BXAj4PvzTup53OXVCpK8QrjAA
X-Proofpoint-GUID: qpAw453BXAj4PvzTup53OXVCpK8QrjAA
X-Proofpoint-Virus-Version: vendor=baseguard
 engine=ICAP:2.0.205,Aquarius:18.0.895,Hydra:6.0.545,FMLib:17.11.122.1
 definitions=2022-10-18_10,2022-10-18_01,2022-06-22_01
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1666150956;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=tEYaKDhGnah93pCbJzIZt981jxWGOApxCCyHZWI27w0=;
	b=SwDQ+b5TQr0reBIb+QA9m3X0/+pEaBaFVrsIjrUpTMIFvzOIEWXHx/G086ni75m6XmPPsp
	qzol92gtD9V/nFwg37WAWNTcsKbXeZq2PSefAMDE8PEfyJkFHcqwsPZOW17u41cdYGSFIm
	QEPjlMzJEcm7IYIk6HnZFuRF1LEOhAI=
ARC-Authentication-Results: i=1;
	imf10.hostedemail.com;
	dkim=pass header.d=fb.com header.s=facebook header.b=GVp7KkPJ;
	spf=pass (imf10.hostedemail.com: domain of
 "prvs=129168f899=alexlzhu@meta.com" designates 67.231.145.42 as permitted
 sender) smtp.mailfrom="prvs=129168f899=alexlzhu@meta.com";
	dmarc=pass (policy=reject) header.from=fb.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1666150956; a=rsa-sha256;
	cv=none;
	b=LK6CRP0zCTs6hYe9IxyjUqBH8NllyUsQzLItR0DnmJsc/ydKqj2AMlNaW9EyfCNIWvHhdF
	VWP47h7L9wEll9enP8nawKRBbOOl2KiS5zqGzvC4uCjZB76tHaIGrW00/pWbdOouN8cFBo
	Rh4mzivqAZdhj0mKJ/1s7gWKPm4/EU4=
X-Rspamd-Server: rspam09
X-Rspamd-Queue-Id: 0A0EDC0026
X-Rspam-User: 
Authentication-Results: imf10.hostedemail.com;
	dkim=pass header.d=fb.com header.s=facebook header.b=GVp7KkPJ;
	spf=pass (imf10.hostedemail.com: domain of
 "prvs=129168f899=alexlzhu@meta.com" designates 67.231.145.42 as permitted
 sender) smtp.mailfrom="prvs=129168f899=alexlzhu@meta.com";
	dmarc=pass (policy=reject) header.from=fb.com
X-Stat-Signature: mf5t8sx7x9jksc3ywfgn45xgae7ebw9i
X-HE-Tag: 1666150955-473731
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

From: Alexander Zhu <alexlzhu@fb.com>

This patch introduces a shrinker that will remove THPs in the lowest
utilization bucket. As previously mentioned, we have observed that
almost all of the memory waste when THPs are always enabled
is contained in the lowest utilization bucket. The shrinker will
add these THPs to a list_lru and split anonymous THPs based off
information from kswapd. It requires the changes from
thp_utilization to identify the least utilized THPs, and the
changes to split_huge_page to identify and free zero pages
within THPs.

Signed-off-by: Alexander Zhu <alexlzhu@fb.com>
---
v3 to v4

-added some comments regardling trylock
-change the relock to be unconditional in low_util_free_page
-only expose can_shrink_thp, abstract the thp_utilization and bucket logic to be private to mm/thp_utilization.c

v2 to v3
-put_page() after trylock_page in low_util_free_page. put() to be called after get() call 
-removed spin_unlock_irq in low_util_free_page above LRU_SKIP. There was a double unlock.    
-moved spin_unlock_irq() to below list_lru_isolate() in low_util_free_page. This is to shorten the critical section.
-moved lock_page in add_underutilized_thp such that we only lock when allocating and adding to the list_lru  
-removed list_lru_alloc in list_lru_add_page and list_lru_delete_page as these are no longer needed. 

v1 to v2
-Changed lru_lock to be irq safe. Added irq_save and restore around list_lru adds/deletes.
-Changed low_util_free_page() to trylock the page, and if it fails, unlock lru_lock and return LRU_SKIP. This is to avoid deadlock between reclaim, which calls split_huge_page() and the THP Shrinker
-Changed low_util_free_page() to unlock lru_lock, split_huge_page, then lock lru_lock. This way split_huge_page is not called with the lru_lock held. That leads to deadlock as split_huge_page calls on_each_cpu_mask 
-Changed list_lru_shrink_walk to list_lru_shrink_walk_irq. 

RFC to v1
-Remove all THPs that are not in the top utilization bucket. This is what we have found to perform the best in production testing, we have found that there are an almost trivial number of THPs in the middle range of buckets that account for most of the memory waste. 
-Added check for THP utilization prior to split_huge_page for the THP Shrinker. This is to account for THPs that move to the top bucket, but were underutilized at the time they were added to the list_lru. 
-Multiply the shrink_count and scan_count by HPAGE_PMD_NR. This is because a THP is 512 pages, and should count as 512 objects in reclaim. This way reclaim is triggered at a more appropriate frequency than in the RFC. 

 include/linux/huge_mm.h  |   9 ++++
 include/linux/list_lru.h |  24 +++++++++
 include/linux/mm_types.h |   5 ++
 mm/huge_memory.c         | 110 ++++++++++++++++++++++++++++++++++++++-
 mm/list_lru.c            |  49 +++++++++++++++++
 mm/page_alloc.c          |   6 +++
 mm/thp_utilization.c     |  13 +++++
 7 files changed, 214 insertions(+), 2 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a1341fdcf666..1745c94eb103 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -178,6 +178,8 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
 unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
 		unsigned long len, unsigned long pgoff, unsigned long flags);
 
+bool can_shrink_thp(struct folio *folio);
+
 void prep_transhuge_page(struct page *page);
 void free_transhuge_page(struct page *page);
 
@@ -189,6 +191,8 @@ static inline int split_huge_page(struct page *page)
 }
 void deferred_split_huge_page(struct page *page);
 
+void add_underutilized_thp(struct page *page);
+
 void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long address, bool freeze, struct folio *folio);
 
@@ -302,6 +306,11 @@ static inline struct list_head *page_deferred_list(struct page *page)
 	return &page[2].deferred_list;
 }
 
+static inline struct list_head *page_underutilized_thp_list(struct page *page)
+{
+	return &page[3].underutilized_thp_list;
+}
+
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
 #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index b35968ee9fb5..c2cf146ea880 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -89,6 +89,18 @@ void memcg_reparent_list_lrus(struct mem_cgroup *memcg, struct mem_cgroup *paren
  */
 bool list_lru_add(struct list_lru *lru, struct list_head *item);
 
+/**
+ * list_lru_add_page: add an element to the lru list's tail
+ * @list_lru: the lru pointer
+ * @page: the page containing the item
+ * @item: the item to be deleted.
+ *
+ * This function works the same as list_lru_add in terms of list
+ * manipulation. Used for non slab objects contained in the page.
+ *
+ * Return value: true if the list was updated, false otherwise
+ */
+bool list_lru_add_page(struct list_lru *lru, struct page *page, struct list_head *item);
 /**
  * list_lru_del: delete an element to the lru list
  * @list_lru: the lru pointer
@@ -102,6 +114,18 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item);
  */
 bool list_lru_del(struct list_lru *lru, struct list_head *item);
 
+/**
+ * list_lru_del_page: delete an element to the lru list
+ * @list_lru: the lru pointer
+ * @page: the page containing the item
+ * @item: the item to be deleted.
+ *
+ * This function works the same as list_lru_del in terms of list
+ * manipulation. Used for non slab objects contained in the page.
+ *
+ * Return value: true if the list was updated, false otherwise
+ */
+bool list_lru_del_page(struct list_lru *lru, struct page *page, struct list_head *item);
 /**
  * list_lru_count_one: return the number of objects currently held by @lru
  * @lru: the lru pointer.
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 500e536796ca..da1d1cf42158 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -152,6 +152,11 @@ struct page {
 			/* For both global and memcg */
 			struct list_head deferred_list;
 		};
+		struct { /* Third tail page of compound page */
+			unsigned long _compound_pad_3; /* compound_head */
+			unsigned long _compound_pad_4;
+			struct list_head underutilized_thp_list;
+		};
 		struct {	/* Page table pages */
 			unsigned long _pt_pad_1;	/* compound_head */
 			pgtable_t pmd_huge_pte; /* protected by page->ptl */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f68a353e0adf..76d39ceceb05 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -71,6 +71,8 @@ static atomic_t huge_zero_refcount;
 struct page *huge_zero_page __read_mostly;
 unsigned long huge_zero_pfn __read_mostly = ~0UL;
 
+static struct list_lru huge_low_util_page_lru;
+
 bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
 			bool smaps, bool in_pf, bool enforce_sysfs)
 {
@@ -234,6 +236,53 @@ static struct shrinker huge_zero_page_shrinker = {
 	.seeks = DEFAULT_SEEKS,
 };
 
+static enum lru_status low_util_free_page(struct list_head *item,
+					  struct list_lru_one *lru,
+					  spinlock_t *lru_lock,
+					  void *cb_arg)
+{
+	struct folio *folio = lru_to_folio(item);
+	struct page *head = &folio->page;
+
+	if (get_page_unless_zero(head)) {
+		/* Inverse lock order from add_underutilized_thp() */
+		if (!trylock_page(head)) {
+			put_page(head);
+			return LRU_SKIP;
+		}
+		list_lru_isolate(lru, item);
+		spin_unlock_irq(lru_lock);
+		if (can_shrink_thp(folio))
+			split_huge_page(head);
+		spin_lock_irq(lru_lock);
+		unlock_page(head);
+		put_page(head);
+	}
+
+	return LRU_REMOVED_RETRY;
+}
+
+static unsigned long shrink_huge_low_util_page_count(struct shrinker *shrink,
+						     struct shrink_control *sc)
+{
+	return HPAGE_PMD_NR * list_lru_shrink_count(&huge_low_util_page_lru, sc);
+}
+
+static unsigned long shrink_huge_low_util_page_scan(struct shrinker *shrink,
+						    struct shrink_control *sc)
+{
+	return HPAGE_PMD_NR * list_lru_shrink_walk_irq(&huge_low_util_page_lru,
+							sc, low_util_free_page, NULL);
+}
+
+static struct shrinker huge_low_util_page_shrinker = {
+	.count_objects = shrink_huge_low_util_page_count,
+	.scan_objects = shrink_huge_low_util_page_scan,
+	.seeks = DEFAULT_SEEKS,
+	.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE |
+		SHRINKER_NONSLAB,
+};
+
 #ifdef CONFIG_SYSFS
 static ssize_t enabled_show(struct kobject *kobj,
 			    struct kobj_attribute *attr, char *buf)
@@ -485,6 +534,9 @@ static int __init hugepage_init(void)
 	if (err)
 		goto err_slab;
 
+	err = register_shrinker(&huge_low_util_page_shrinker, "thp-low-util");
+	if (err)
+		goto err_low_util_shrinker;
 	err = register_shrinker(&huge_zero_page_shrinker, "thp-zero");
 	if (err)
 		goto err_hzp_shrinker;
@@ -492,6 +544,9 @@ static int __init hugepage_init(void)
 	if (err)
 		goto err_split_shrinker;
 
+	err = list_lru_init_memcg(&huge_low_util_page_lru, &huge_low_util_page_shrinker);
+	if (err)
+		goto err_low_util_list_lru;
 	/*
 	 * By default disable transparent hugepages on smaller systems,
 	 * where the extra memory used could hurt more than TLB overhead
@@ -508,10 +563,14 @@ static int __init hugepage_init(void)
 
 	return 0;
 err_khugepaged:
+	list_lru_destroy(&huge_low_util_page_lru);
+err_low_util_list_lru:
 	unregister_shrinker(&deferred_split_shrinker);
 err_split_shrinker:
 	unregister_shrinker(&huge_zero_page_shrinker);
 err_hzp_shrinker:
+	unregister_shrinker(&huge_low_util_page_shrinker);
+err_low_util_shrinker:
 	khugepaged_destroy();
 err_slab:
 	hugepage_exit_sysfs(hugepage_kobj);
@@ -586,6 +645,7 @@ void prep_transhuge_page(struct page *page)
 	 */
 
 	INIT_LIST_HEAD(page_deferred_list(page));
+	INIT_LIST_HEAD(page_underutilized_thp_list(page));
 	set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR);
 }
 
@@ -2451,8 +2511,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
 			 LRU_GEN_MASK | LRU_REFS_MASK));
 
 	/* ->mapping in first tail page is compound_mapcount */
-	VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
-			page_tail);
+	VM_BUG_ON_PAGE(tail > 3 && page_tail->mapping != TAIL_MAPPING, page_tail);
 	page_tail->mapping = head->mapping;
 	page_tail->index = head->index + tail;
 	page_tail->private = 0;
@@ -2660,6 +2719,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	struct folio *folio = page_folio(page);
 	struct deferred_split *ds_queue = get_deferred_split_queue(&folio->page);
 	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
+	struct list_head *underutilized_thp_list = page_underutilized_thp_list(&folio->page);
 	struct anon_vma *anon_vma = NULL;
 	struct address_space *mapping = NULL;
 	int extra_pins, ret;
@@ -2767,6 +2827,10 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 			list_del(page_deferred_list(&folio->page));
 		}
 		spin_unlock(&ds_queue->split_queue_lock);
+		/* Frozen refs lock out additions, test can be lockless */
+		if (!list_empty(underutilized_thp_list))
+			list_lru_del_page(&huge_low_util_page_lru, &folio->page,
+					  underutilized_thp_list);
 		if (mapping) {
 			int nr = folio_nr_pages(folio);
 
@@ -2809,6 +2873,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 void free_transhuge_page(struct page *page)
 {
 	struct deferred_split *ds_queue = get_deferred_split_queue(page);
+	struct list_head *underutilized_thp_list = page_underutilized_thp_list(page);
 	unsigned long flags;
 
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
@@ -2817,6 +2882,13 @@ void free_transhuge_page(struct page *page)
 		list_del(page_deferred_list(page));
 	}
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
+	/* A dead page cannot be re-added to the THP shrinker, test can be lockless */
+	if (!list_empty(underutilized_thp_list))
+		list_lru_del_page(&huge_low_util_page_lru, page, underutilized_thp_list);
+
+	if (PageLRU(page))
+		__folio_clear_lru_flags(page_folio(page));
+
 	free_compound_page(page);
 }
 
@@ -2857,6 +2929,40 @@ void deferred_split_huge_page(struct page *page)
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 }
 
+void add_underutilized_thp(struct page *page)
+{
+	VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+	VM_BUG_ON_PAGE(!PageAnon(page), page);
+
+	if (PageSwapCache(page))
+		return;
+
+	/*
+	 * Need to take a reference on the page to prevent the page from getting free'd from
+	 * under us while we are adding the THP to the shrinker.
+	 */
+	if (!get_page_unless_zero(page))
+		return;
+
+	if (is_huge_zero_page(page))
+		goto out_put;
+
+	/* Stabilize page->memcg to allocate and add to the same list */
+	lock_page(page);
+
+#ifdef CONFIG_MEMCG_KMEM
+	if (memcg_list_lru_alloc(page_memcg(page), &huge_low_util_page_lru, GFP_KERNEL))
+		goto out_unlock;
+#endif
+
+	list_lru_add_page(&huge_low_util_page_lru, page, page_underutilized_thp_list(page));
+
+out_unlock:
+	unlock_page(page);
+out_put:
+	put_page(page);
+}
+
 static unsigned long deferred_split_count(struct shrinker *shrink,
 		struct shrink_control *sc)
 {
diff --git a/mm/list_lru.c b/mm/list_lru.c
index a05e5bef3b40..8cc56a84b554 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -140,6 +140,32 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item)
 }
 EXPORT_SYMBOL_GPL(list_lru_add);
 
+bool list_lru_add_page(struct list_lru *lru, struct page *page, struct list_head *item)
+{
+	int nid = page_to_nid(page);
+	struct list_lru_node *nlru = &lru->node[nid];
+	struct list_lru_one *l;
+	struct mem_cgroup *memcg;
+	unsigned long flags;
+
+	spin_lock_irqsave(&nlru->lock, flags);
+	if (list_empty(item)) {
+		memcg = page_memcg(page);
+		l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
+		list_add_tail(item, &l->list);
+		/* Set shrinker bit if the first element was added */
+		if (!l->nr_items++)
+			set_shrinker_bit(memcg, nid,
+					 lru_shrinker_id(lru));
+		nlru->nr_items++;
+		spin_unlock_irqrestore(&nlru->lock, flags);
+		return true;
+	}
+	spin_unlock_irqrestore(&nlru->lock, flags);
+	return false;
+}
+EXPORT_SYMBOL_GPL(list_lru_add_page);
+
 bool list_lru_del(struct list_lru *lru, struct list_head *item)
 {
 	int nid = page_to_nid(virt_to_page(item));
@@ -160,6 +186,29 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item)
 }
 EXPORT_SYMBOL_GPL(list_lru_del);
 
+bool list_lru_del_page(struct list_lru *lru, struct page *page, struct list_head *item)
+{
+	int nid = page_to_nid(page);
+	struct list_lru_node *nlru = &lru->node[nid];
+	struct list_lru_one *l;
+	struct mem_cgroup *memcg;
+	unsigned long flags;
+
+	spin_lock_irqsave(&nlru->lock, flags);
+	if (!list_empty(item)) {
+		memcg = page_memcg(page);
+		l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
+		list_del_init(item);
+		l->nr_items--;
+		nlru->nr_items--;
+		spin_unlock_irqrestore(&nlru->lock, flags);
+		return true;
+	}
+	spin_unlock_irqrestore(&nlru->lock, flags);
+	return false;
+}
+EXPORT_SYMBOL_GPL(list_lru_del_page);
+
 void list_lru_isolate(struct list_lru_one *list, struct list_head *item)
 {
 	list_del_init(item);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e20ade858e71..31380526a9f4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1335,6 +1335,12 @@ static int free_tail_pages_check(struct page *head_page, struct page *page)
 		 * deferred_list.next -- ignore value.
 		 */
 		break;
+	case 3:
+		/*
+		 * the third tail page: ->mapping is
+		 * underutilized_thp_list.next -- ignore value.
+		 */
+		break;
 	default:
 		if (page->mapping != TAIL_MAPPING) {
 			bad_page(page, "corrupted mapping in tail page");
diff --git a/mm/thp_utilization.c b/mm/thp_utilization.c
index 7b79f8759d12..d0efcffde50a 100644
--- a/mm/thp_utilization.c
+++ b/mm/thp_utilization.c
@@ -113,6 +113,19 @@ static int thp_number_utilized_pages(struct folio *folio)
 	return thp_nr_utilized_pages;
 }
 
+bool can_shrink_thp(struct folio *folio)
+{
+	int bucket, num_utilized_pages;
+
+	if (!folio || !folio_test_anon(folio) || !folio_test_transhuge(folio))
+		return false;
+
+	num_utilized_pages = thp_number_utilized_pages(folio);
+	bucket = thp_utilization_bucket(num_utilized_pages);
+
+	return bucket < THP_UTIL_BUCKET_NR - 1;
+}
+
 static void thp_scan_next_zone(void)
 {
 	struct timespec64 current_time;