From patchwork Thu Nov 3 06:01:43 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Alex Zhu (Kernel)" X-Patchwork-Id: 13029577 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7584CC433FE for ; Thu, 3 Nov 2022 06:02:03 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BC25C80008; Thu, 3 Nov 2022 02:02:02 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BB5DD80009; Thu, 3 Nov 2022 02:02:02 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9878B80008; Thu, 3 Nov 2022 02:02:02 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 8599080007 for ; Thu, 3 Nov 2022 02:02:02 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 4C2F81607CA for ; Thu, 3 Nov 2022 06:02:02 +0000 (UTC) X-FDA: 80091085284.09.C1B29CF Received: from mx0b-00082601.pphosted.com (mx0b-00082601.pphosted.com [67.231.153.30]) by imf11.hostedemail.com (Postfix) with ESMTP id B13054000C for ; Thu, 3 Nov 2022 06:02:01 +0000 (UTC) Received: from pps.filterd (m0109332.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 2A2NVpUu014582 for ; Wed, 2 Nov 2022 23:02:01 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=qKsDJt4AWJPHZm08vgFyA4ZGD4E94rhjqRLqiblwApo=; b=ULEHGDyu984UD1o/vr6/kveESAZTylVcvsO4EiUfqh7GElA/qAHkijZDLFjbHpKwArSe mPDHdxVgJhJMylV4ZNIwEFXhKOAZ1qC4++3lEL2bSl84b1zlScc4a4JJAQFZlYPLXWtD N3ac6xfcrXb2V92deyVMz65R6KfUDmWOerU= Received: from mail.thefacebook.com ([163.114.132.120]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3kkva160b7-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Wed, 02 Nov 2022 23:02:00 -0700 Received: from twshared6758.06.ash9.facebook.com (2620:10d:c085:208::f) by mail.thefacebook.com (2620:10d:c085:21d::4) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Wed, 2 Nov 2022 23:01:59 -0700 Received: by devvm6390.atn0.facebook.com (Postfix, from userid 352741) id 854C35FD0414; Wed, 2 Nov 2022 23:01:49 -0700 (PDT) From: To: , CC: , , , , , Alexander Zhu Subject: [PATCH v6 1/5] mm: add thp_utilization metrics to debugfs Date: Wed, 2 Nov 2022 23:01:43 -0700 Message-ID: X-Mailer: git-send-email 2.30.2 In-Reply-To: References: MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-GUID: RSCFhaFyyI6RR8Niu-oYfCSZZbBM54nA X-Proofpoint-ORIG-GUID: RSCFhaFyyI6RR8Niu-oYfCSZZbBM54nA X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.895,Hydra:6.0.545,FMLib:17.11.122.1 definitions=2022-11-02_15,2022-11-02_01,2022-06-22_01 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1667455321; a=rsa-sha256; cv=none; b=PckDRxzEg3MVossONO+8rFUiFYo8PU9X9KUsdN3CtkKIsyIvWo9Mq7h7bmHwIxgjt4Ozms lVdapi23av8C+k6tig0w8r3Ty7FrqPgRSKpxvF/qfL30TMHWsk9BIm9w3qqjNrBLl22E/v z3jf3Ewn/jqoWmF422BG3h6l1yAxm94= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=fb.com header.s=facebook header.b=ULEHGDyu; dmarc=pass (policy=reject) header.from=fb.com; spf=pass (imf11.hostedemail.com: domain of "prvs=2306c4488a=alexlzhu@meta.com" designates 67.231.153.30 as permitted sender) smtp.mailfrom="prvs=2306c4488a=alexlzhu@meta.com" ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1667455321; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=qKsDJt4AWJPHZm08vgFyA4ZGD4E94rhjqRLqiblwApo=; b=5RwwUzxD1I0vN6UHrMOtRqKGmabig9i+zZ6wxA5IyqqPDds+2KWwOB7ISc90o27r4YNTC0 PXciaUtVg9rLicEEHYFi3B2L0Uw/z8q1Jlaj2TtTW8qfbNLEV6J0ZA6YJ2R2T7WKCZZgzy 97ZMN1Ppiv/8aVDz3kQ69kJG10XOzHA= X-Stat-Signature: ro9uqpaq9s8gzu7jbq8m3hnc1qyt653r X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: B13054000C X-Rspam-User: Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=fb.com header.s=facebook header.b=ULEHGDyu; dmarc=pass (policy=reject) header.from=fb.com; spf=pass (imf11.hostedemail.com: domain of "prvs=2306c4488a=alexlzhu@meta.com" designates 67.231.153.30 as permitted sender) smtp.mailfrom="prvs=2306c4488a=alexlzhu@meta.com" X-HE-Tag: 1667455321-853032 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Alexander Zhu This change introduces a tool that scans through all of physical memory for anonymous THPs and groups them into buckets based on utilization. It also includes an interface under /sys/kernel/debug/thp_utilization. Sample Output: Utilized[0-50]: 1331 680884 Utilized[51-101]: 9 3983 Utilized[102-152]: 3 1187 Utilized[153-203]: 0 0 Utilized[204-255]: 2 539 Utilized[256-306]: 5 1135 Utilized[307-357]: 1 192 Utilized[358-408]: 0 0 Utilized[409-459]: 1 57 Utilized[460-512]: 400 13 Last Scan Time: 223.98s Last Scan Duration: 70.65s This indicates that there are 1331 THPs that have between 0 and 50 utilized (non zero) pages. In total there are 680884 zero pages in this utilization bucket. THPs in the [0-50] bucket compose 76% of total THPs, and are responsible for 99% of total zero pages across all THPs. In other words, the least utilized THPs are responsible for almost all of the memory waste when THP is always enabled. Similar results have been observed across production workloads. The last two lines indicate the timestamp and duration of the most recent scan through all of physical memory. Here we see that the last scan occurred 223.98 seconds after boot time and took 70.65 seconds. Utilization of a THP is defined as the percentage of nonzero pages in the THP. The worker thread will scan through all of physical memory and obtain utilization of all anonymous THPs. It will gather this information by periodically scanning through all of physical memory for anonymous THPs, group them into buckets based on utilization, and report utilization information through debugfs under /sys/kernel/debug/thp_utilization. Signed-off-by: Alexander Zhu --- Documentation/admin-guide/mm/transhuge.rst | 9 + mm/Makefile | 2 +- mm/thp_utilization.c | 206 +++++++++++++++++++++ 3 files changed, 216 insertions(+), 1 deletion(-) create mode 100644 mm/thp_utilization.c diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index 8ee78ec232eb..21d86303c97e 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -304,6 +304,15 @@ To identify what applications are mapping file transparent huge pages, it is necessary to read ``/proc/PID/smaps`` and count the FileHugeMapped fields for each mapping. +The utilization of transparent hugepages can be viewed by reading +``/sys/kernel/debug/thp_utilization``. The utilization of a THP is defined +as the ratio of non zero filled 4kb pages to the total number of pages in a +THP. The buckets are labelled by the range of total utilized 4kb pages with +one line per utilization bucket. Each line contains the total number of +THPs in that bucket and the total number of zero filled 4kb pages summed +over all THPs in that bucket. The last two lines show the timestamp and +duration respectively of the most recent scan over all of physical memory. + Note that reading the smaps file is expensive and reading it frequently will incur overhead. diff --git a/mm/Makefile b/mm/Makefile index 8e105e5b3e29..5f76dc6ce044 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -95,7 +95,7 @@ obj-$(CONFIG_MEMTEST) += memtest.o obj-$(CONFIG_MIGRATION) += migrate.o obj-$(CONFIG_NUMA) += memory-tiers.o obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o -obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o +obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o thp_utilization.o obj-$(CONFIG_PAGE_COUNTER) += page_counter.o obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o ifdef CONFIG_SWAP diff --git a/mm/thp_utilization.c b/mm/thp_utilization.c new file mode 100644 index 000000000000..cdbb7d5c9f39 --- /dev/null +++ b/mm/thp_utilization.c @@ -0,0 +1,206 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2022 Meta, Inc. + * Authors: Alexander Zhu, Johannes Weiner, Rik van Riel + */ + +#include +#include +#include +/* + * The number of utilization buckets THPs will be grouped in + * under /sys/kernel/debug/thp_utilization. + */ +#define THP_UTIL_BUCKET_NR 10 +/* + * The number of hugepages to scan through on each periodic + * run of the scanner that generates /sys/kernel/debug/thp_utilization. + */ +#define THP_UTIL_SCAN_SIZE 256 + +static void thp_utilization_workfn(struct work_struct *work); +static DECLARE_DELAYED_WORK(thp_utilization_work, thp_utilization_workfn); + +struct thp_scan_info_bucket { + int nr_thps; + int nr_zero_pages; +}; + +struct thp_scan_info { + struct thp_scan_info_bucket buckets[THP_UTIL_BUCKET_NR]; + struct zone *scan_zone; + struct timespec64 last_scan_duration; + struct timespec64 last_scan_time; + unsigned long pfn; +}; + +/* + * thp_scan_debugfs is referred to when /sys/kernel/debug/thp_utilization + * is opened. thp_scan is used to keep track fo the current scan through + * physical memory. + */ +static struct thp_scan_info thp_scan_debugfs; +static struct thp_scan_info thp_scan; + +#ifdef CONFIG_DEBUG_FS +static int thp_utilization_show(struct seq_file *seqf, void *pos) +{ + int i; + int start; + int end; + + for (i = 0; i < THP_UTIL_BUCKET_NR; i++) { + start = i * HPAGE_PMD_NR / THP_UTIL_BUCKET_NR; + end = (i + 1 == THP_UTIL_BUCKET_NR) + ? HPAGE_PMD_NR + : ((i + 1) * HPAGE_PMD_NR / THP_UTIL_BUCKET_NR - 1); + /* The last bucket will need to contain 100 */ + seq_printf(seqf, "Utilized[%d-%d]: %d %d\n", start, end, + thp_scan_debugfs.buckets[i].nr_thps, + thp_scan_debugfs.buckets[i].nr_zero_pages); + } + + seq_printf(seqf, "Last Scan Time: %lu.%02lus\n", + (unsigned long)thp_scan_debugfs.last_scan_time.tv_sec, + (thp_scan_debugfs.last_scan_time.tv_nsec / (NSEC_PER_SEC / 100))); + + seq_printf(seqf, "Last Scan Duration: %lu.%02lus\n", + (unsigned long)thp_scan_debugfs.last_scan_duration.tv_sec, + (thp_scan_debugfs.last_scan_duration.tv_nsec / (NSEC_PER_SEC / 100))); + + return 0; +} +DEFINE_SHOW_ATTRIBUTE(thp_utilization); + +static int __init thp_utilization_debugfs(void) +{ + debugfs_create_file("thp_utilization", 0200, NULL, NULL, + &thp_utilization_fops); + return 0; +} +late_initcall(thp_utilization_debugfs); +#endif + +static int thp_utilization_bucket(int num_utilized_pages) +{ + int bucket; + + if (num_utilized_pages < 0 || num_utilized_pages > HPAGE_PMD_NR) + return -1; + + /* Group THPs into utilization buckets */ + bucket = num_utilized_pages * THP_UTIL_BUCKET_NR / HPAGE_PMD_NR; + return min(bucket, THP_UTIL_BUCKET_NR - 1); +} + +static int thp_number_utilized_pages(struct folio *folio) +{ + int thp_nr_utilized_pages = HPAGE_PMD_NR; + void *kaddr; + int i; + bool zero_page; + + if (!folio || !folio_test_anon(folio) || !folio_test_transhuge(folio)) + return -1; + + for (i = 0; i < folio_nr_pages(folio); i++) { + kaddr = kmap_local_folio(folio, i); + zero_page = !memchr_inv(kaddr, 0, PAGE_SIZE); + + if (zero_page) + thp_nr_utilized_pages--; + + kunmap_local(kaddr); + } + + return thp_nr_utilized_pages; +} + +static void thp_scan_next_zone(void) +{ + struct timespec64 current_time; + bool update_debugfs; + /* + * THP utilization worker thread has reached the end + * of the memory zone. Proceed to the next zone. + */ + thp_scan.scan_zone = next_zone(thp_scan.scan_zone); + update_debugfs = !thp_scan.scan_zone; + thp_scan.scan_zone = update_debugfs ? (first_online_pgdat())->node_zones + : thp_scan.scan_zone; + thp_scan.pfn = (thp_scan.scan_zone->zone_start_pfn + HPAGE_PMD_NR - 1) + & ~(HPAGE_PMD_SIZE - 1); + if (!update_debugfs) + return; + + /* + * If the worker has scanned through all of physical memory then + * update information displayed in /sys/kernel/debug/thp_utilization + */ + ktime_get_ts64(¤t_time); + thp_scan_debugfs.last_scan_duration = timespec64_sub(current_time, + thp_scan_debugfs.last_scan_time); + thp_scan_debugfs.last_scan_time = current_time; + + memcpy(&thp_scan_debugfs.buckets, &thp_scan.buckets, sizeof(thp_scan.buckets)); + memset(&thp_scan.buckets, 0, sizeof(thp_scan.buckets)); +} + +static void thp_util_scan(unsigned long pfn_end) +{ + struct page *page = NULL; + int bucket, current_pfn, num_utilized_pages; + int i; + /* + * Scan through each memory zone in chunks of THP_UTIL_SCAN_SIZE + * PFNs every second looking for anonymous THPs. + */ + for (i = 0; i < THP_UTIL_SCAN_SIZE; i++) { + current_pfn = thp_scan.pfn; + thp_scan.pfn += HPAGE_PMD_NR; + if (current_pfn >= pfn_end) + return; + + page = pfn_to_online_page(current_pfn); + if (!page) + continue; + + num_utilized_pages = thp_number_utilized_pages(page_folio(page)); + bucket = thp_utilization_bucket(num_utilized_pages); + if (bucket < 0) + continue; + + thp_scan.buckets[bucket].nr_thps++; + thp_scan.buckets[bucket].nr_zero_pages += (HPAGE_PMD_NR - num_utilized_pages); + } +} + +static void thp_utilization_workfn(struct work_struct *work) +{ + unsigned long pfn_end; + /* + * Worker function that scans through all of physical memory + * for anonymous THPs. + */ + if (!thp_scan.scan_zone) + thp_scan.scan_zone = (first_online_pgdat())->node_zones; + + pfn_end = zone_end_pfn(thp_scan.scan_zone); + /* If we have reached the end of the zone or end of physical memory + * move on to the next zone. Otherwise, scan the next PFNs in the + * current zone. + */ + if (!managed_zone(thp_scan.scan_zone) || thp_scan.pfn >= pfn_end) + thp_scan_next_zone(); + else + thp_util_scan(pfn_end); + + schedule_delayed_work(&thp_utilization_work, HZ); +} + +static int __init thp_scan_init(void) +{ + schedule_delayed_work(&thp_utilization_work, HZ); + return 0; +} +subsys_initcall(thp_scan_init);