From patchwork Wed Nov 27 02:57:20 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yuanchu Xie X-Patchwork-Id: 13886507 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4096FD66BAD for ; Wed, 27 Nov 2024 02:57:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id AEF9B6B009D; Tue, 26 Nov 2024 21:57:48 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A78F96B009E; Tue, 26 Nov 2024 21:57:48 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8A44F6B00A0; Tue, 26 Nov 2024 21:57:48 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 6760E6B009D for ; Tue, 26 Nov 2024 21:57:48 -0500 (EST) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 122E7814B4 for ; Wed, 27 Nov 2024 02:57:48 +0000 (UTC) X-FDA: 82830364848.22.388111A Received: from mail-pf1-f202.google.com (mail-pf1-f202.google.com [209.85.210.202]) by imf26.hostedemail.com (Postfix) with ESMTP id 1E4B3140019 for ; Wed, 27 Nov 2024 02:57:41 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=p1DNpvqT; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf26.hostedemail.com: domain of 3qIpGZwcKCBkNJzC16J5DD5A3.1DBA7CJM-BB9Kz19.DG5@flex--yuanchu.bounces.google.com designates 209.85.210.202 as permitted sender) smtp.mailfrom=3qIpGZwcKCBkNJzC16J5DD5A3.1DBA7CJM-BB9Kz19.DG5@flex--yuanchu.bounces.google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1732676263; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=YMfZReVH6Y6OSe0PATfKlnZ+c8+zk2ht0ykfo5UOUYo=; b=qoR2YT6csg4cZlraj4XakbEYF/rG55afL30YM0iWfRF1tYzP3NtTuO/1hHa8TDy0/psXOT GprXlyW5hxu4+OZbnr3xQLJFu3QuSMK8b2FXw2pZFzR/5qbxbs2ai3WsO/WrA/I2uaWAhU d8B8UizBweB+TC/7JvaROrhnxrv5pDQ= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1732676263; a=rsa-sha256; cv=none; b=RiWrw8pKheCvDFhD6dgL7POHGIlfLJOKFg3tllbumtz5VC8xpITzPCKEUs6memcUYKAjdv 6Y1u8+PU+oRE/my+F7PZ1jG0IiC8lAvLSGXMU8WGJ2WXnGrmK14XlLcKQfC+Tlr+j0Aydo q8TmrpFkP7ZrRjc6JJCRH4z8W8vRXI0= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=p1DNpvqT; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf26.hostedemail.com: domain of 3qIpGZwcKCBkNJzC16J5DD5A3.1DBA7CJM-BB9Kz19.DG5@flex--yuanchu.bounces.google.com designates 209.85.210.202 as permitted sender) smtp.mailfrom=3qIpGZwcKCBkNJzC16J5DD5A3.1DBA7CJM-BB9Kz19.DG5@flex--yuanchu.bounces.google.com Received: by mail-pf1-f202.google.com with SMTP id d2e1a72fcca58-724f5009c7dso3593958b3a.2 for ; Tue, 26 Nov 2024 18:57:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1732676265; x=1733281065; darn=kvack.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=YMfZReVH6Y6OSe0PATfKlnZ+c8+zk2ht0ykfo5UOUYo=; b=p1DNpvqTG9ZMPAY3uLqfEBZWFFN0Sxqxk3aQZpIn2o5KlCNrID4T1RLzcieiU+afkq FZGGFlILyrzyHRuP3HceQKQ8f+dNtHwBBx7nHeyNAnWlVXty6zpd7BAeNdmi59P5A8C7 UqM7XZ4huyOJg9u2NP9f7k3Q53qZ+gjhS1hQMy3ztwgdA2ptBU7SRBnrs91mmNUX9MH4 5+N+zva+S9kPHrDRuud/8s0tdPG5YBFAnOk5QjqaLgOtrHr+AvkmnntlXgRw6LTqO3Pp u1wsuRgvkVa5p7ct23LaslEHM7+0dCXChRHwtY557LbOhBZ3ZPqS+So+Gd2imTDBL2TU XrRA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1732676265; x=1733281065; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=YMfZReVH6Y6OSe0PATfKlnZ+c8+zk2ht0ykfo5UOUYo=; b=CUs0oyEK2gr03+ODnY6QXuRATJqKuFJKd7ywhVQStrQayRElXy8alvsGZsDO9UxIGZ ShhRnC0ozYvfUxCh9oaff0l0LPnIQw6/gvvY4YV1ADKNKd++5ev0yamH9w+VWRe0wXKI Ul3BDSI6noDrsx8eW0YNdh4zO69QU4tNkrT8qmBK3OERE6iNq2gAlzS91Wyq9WqfSiBE 7j4F1+EweBsW/7KkLb/pCj+39LggaiAwPol7NJHoRL533BoT7fHSI4whQYi5FVmEWcRY 7/+16armay3zV4qp6wHpl/vIJdvZxSD5ZSRYvrNE8CuHJ7GL8LQn2KRDW+5tTs/lg2AP Kzvg== X-Forwarded-Encrypted: i=1; AJvYcCX311UImaMiQ8IpiVcYqD9QpR+vrKkI4BtRNAix5kO/x6c5UlmfJclEGyiakWR7UgYZ9J43pulPOQ==@kvack.org X-Gm-Message-State: AOJu0YwQyWOVeSCmbNLj6c676b0KyRyEfEerAO+O0ooG+StrfCKeMTD7 aPhfOtAkgNZSgQwcVj9eVMDtUehjFiJU44j6iPvp/9lcEgcttCTagl8vHuiaMmATrWvdC+5BLPK LQFo5gQ== X-Google-Smtp-Source: AGHT+IGS3z2xtRaD82Eg6lpZpx9W4L7i5UELBQBxL0/M15PzUsyV5kTYfTLGRUtmYRU55oTDmRTrliUGszQh X-Received: from pjtd4.prod.google.com ([2002:a17:90b:44:b0:2e2:8d64:6213]) (user=yuanchu job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:2b47:b0:2ea:49a8:917b with SMTP id 98e67ed59e1d1-2ee08dab307mr2422606a91.0.1732676264733; Tue, 26 Nov 2024 18:57:44 -0800 (PST) Date: Tue, 26 Nov 2024 18:57:20 -0800 In-Reply-To: <20241127025728.3689245-1-yuanchu@google.com> Mime-Version: 1.0 References: <20241127025728.3689245-1-yuanchu@google.com> X-Mailer: git-send-email 2.47.0.338.g60cca15819-goog Message-ID: <20241127025728.3689245-2-yuanchu@google.com> Subject: [PATCH v4 1/9] mm: aggregate workingset information into histograms From: Yuanchu Xie To: Andrew Morton , David Hildenbrand , "Aneesh Kumar K.V" , Khalid Aziz , Henry Huang , Yu Zhao , Dan Williams , Gregory Price , Huang Ying , Lance Yang , Randy Dunlap , Muhammad Usama Anjum Cc: Tejun Heo , Johannes Weiner , " =?utf-8?q?Michal_Koutn=C3=BD?= " , Jonathan Corbet , Greg Kroah-Hartman , "Rafael J. Wysocki" , "Michael S. Tsirkin" , Jason Wang , Xuan Zhuo , " =?utf-8?q?Eugenio_P=C3=A9rez?= " , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Mike Rapoport , Shuah Khan , Christian Brauner , Daniel Watson , Yuanchu Xie , cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, virtualization@lists.linux.dev, linux-mm@kvack.org, linux-kselftest@vger.kernel.org X-Stat-Signature: uqtimremymihufu5h4pchfr14497ix6d X-Rspamd-Queue-Id: 1E4B3140019 X-Rspam-User: X-Rspamd-Server: rspam01 X-HE-Tag: 1732676261-263983 X-HE-Meta: U2FsdGVkX18EnKD9nEgzvRbPNAYHSop2M7NfWDXiG95vuK7bLvDJNdWWKNuRIP/e0tM8EC1S5xOPr2ZwPQMpvCstwGpro4zKtlI41NWUi0qShD5ETpcROaQA90/bG8iOL3Rw5K14KYIvNasANucdtIl4b4UKYsxmwtWFJPnhWbEbPU2qCkgM7W1audgt0QFPGWjGujxFqJO1OHluIfO4htwCgLawRRb1OQqY7Hv8TY/KM9mDSK5MPFlH2vUIzs8ttA6CB50UO5Yg3r0bx2GPSVxf+MTuJdxvXZtmGHSfd1dqY1xVVjOm1Iip1dSw29Uws0RpErr4LgmSM9Z8fH3mmAu3DctIhGRxyip8aq/qFbmVmJG5MUVZKrHYHJSmega87VUmi5PGjXcKWponyQAXxZnUQMyq3429mzrnqx/LS3fh+MHgq/bVCk3g4Jmy8swjOVIRXOX9hP3E1529rCJSkb787zwrL7amwpYczwEvosaC4bpCRB0d8lZw1IAD72XQUzG+hG5Sky1ucP6vbKqueKqqtYmgCaGJYn8vV1QKvnx+KDrRRv0OJatcKmGbDhp6eJju2xFgWamkls36ZbtCdbDaywT6/ioQR+Bx1W4TdWYnfCK1Ev1UbILjdFk9VIzy6MUTZmAsmGSs+vkD29sVXBx6vbycCAmv9eowbsJCKnw5MBFErZMVCQ318H99HWIrRLmgKSbnL+ETNj47KSeIJ8iU9/cZt0LvsFQtZgZJJZgHv7fP8S+PakOn5LGBwKW4GYMK5o5qJI/axzHH3EnB8V9jOBFHoaQImppo9euyqv6txVRd2FEjyZl9emrcIO8c+IProW0uiDccqfF5AERI+U36ePPbuOBBbIv/j3R+hd0eRh742v2YNBK4qiSTxCckswZTQ3gjbIckQ0tQDs0d304QwnGoQIEGASXxXZBx7DWtdtpKzCvNCplzvI/ToFPc8kJyBW7r6b69OxhWO9U Ki09aPXw uZ4gQ+Q1DzPo4J0WTlyd6N22k16nHU1y98ddHjC3lE2q4dISIaE4Wc3Ja0Kqwixw390qMIo25NEG+J9TGQ6xGJC0RY45mprd9+pkZ9mkkTPtOTOkfFYSQm9uQl9brFjaTylmMEIgH+zRrUEs4MiIuHb05/GLaYXsd1iKwsu1gZNHf0Fo7m6SZ5U/1ffx3Lqv6Dg4UVpat1wjEbYR02yt3O73QY9ctuyL0uQCej3WxBXJDhT726ivBj+nJ9WWzgUSTybfVVHgZUwLlFTgaL2z+36Lu6vwB+X7nwq/GQlMqqiwm0TuE102thocrssewO8U8/Dfjj0QtS6pdJvmf8qGJCRagOP4LBOD73t2ebFuvM65YAfvz/tvG9JXLX2Kdhaxt6BCCql387Pkg9i54t+97HONzYEjNoiSNMr3hC5yr1rRMc12i9LUrOT/qg3Nag5qNC8qt+dvdsp4DYbcAuywdU79nUHTkLSwkER//KkjaIX+mTKK+ijiWGiXXQwn/jnk6xJHH/gkmJ2Y7/xdZDpUZplk8Ja0JxVFv94At1Yr8xM2njtE= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hierarchically aggregate all memcgs' MGLRU generations and their page counts into working set page age histograms. The histograms break down the system's workingset per-node, per-anon/file. The sysfs interfaces are as follows: /sys/devices/system/node/nodeX/workingset_report/page_age A per-node page age histogram, showing an aggregate of the node's lruvecs. The information is extracted from MGLRU's per-generation page counters. Reading this file causes a hierarchical aging of all lruvecs, scanning pages and creates a new generation in each lruvec. For example: 1000 anon=0 file=0 2000 anon=0 file=0 100000 anon=5533696 file=5566464 18446744073709551615 anon=0 file=0 /sys/devices/system/node/nodeX/workingset_report/page_age_interval A comma separated list of time in milliseconds that configures what the page age histogram uses for aggregation. Signed-off-by: Yuanchu Xie --- drivers/base/node.c | 6 + include/linux/mmzone.h | 9 + include/linux/workingset_report.h | 79 ++++++ mm/Kconfig | 9 + mm/Makefile | 1 + mm/internal.h | 5 + mm/memcontrol.c | 2 + mm/mm_init.c | 2 + mm/mmzone.c | 2 + mm/vmscan.c | 10 +- mm/workingset_report.c | 451 ++++++++++++++++++++++++++++++ 11 files changed, 572 insertions(+), 4 deletions(-) create mode 100644 include/linux/workingset_report.h create mode 100644 mm/workingset_report.c diff --git a/drivers/base/node.c b/drivers/base/node.c index eb72580288e6..ba5b8720dbfa 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -20,6 +20,8 @@ #include #include #include +#include +#include static const struct bus_type node_subsys = { .name = "node", @@ -626,6 +628,7 @@ static int register_node(struct node *node, int num) } else { hugetlb_register_node(node); compaction_register_node(node); + wsr_init_sysfs(node); } return error; @@ -642,6 +645,9 @@ void unregister_node(struct node *node) { hugetlb_unregister_node(node); compaction_unregister_node(node); + wsr_remove_sysfs(node); + wsr_destroy_lruvec(mem_cgroup_lruvec(NULL, NODE_DATA(node->dev.id))); + wsr_destroy_pgdat(NODE_DATA(node->dev.id)); node_remove_accesses(node); node_remove_caches(node); device_unregister(&node->dev); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 80bc5640bb60..ee728c0c5a3b 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -24,6 +24,7 @@ #include #include #include +#include /* Free memory management - zoned buddy allocator. */ #ifndef CONFIG_ARCH_FORCE_MAX_ORDER @@ -630,6 +631,9 @@ struct lruvec { struct lru_gen_mm_state mm_state; #endif #endif /* CONFIG_LRU_GEN */ +#ifdef CONFIG_WORKINGSET_REPORT + struct wsr_state wsr; +#endif /* CONFIG_WORKINGSET_REPORT */ #ifdef CONFIG_MEMCG struct pglist_data *pgdat; #endif @@ -1424,6 +1428,11 @@ typedef struct pglist_data { struct lru_gen_memcg memcg_lru; #endif +#ifdef CONFIG_WORKINGSET_REPORT + struct mutex wsr_update_mutex; + struct wsr_report_bins __rcu *wsr_page_age_bins; +#endif + CACHELINE_PADDING(_pad2_); /* Per-node vmstats */ diff --git a/include/linux/workingset_report.h b/include/linux/workingset_report.h new file mode 100644 index 000000000000..d7c2ee14ec87 --- /dev/null +++ b/include/linux/workingset_report.h @@ -0,0 +1,79 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_WORKINGSET_REPORT_H +#define _LINUX_WORKINGSET_REPORT_H + +#include +#include + +struct mem_cgroup; +struct pglist_data; +struct node; +struct lruvec; + +#ifdef CONFIG_WORKINGSET_REPORT + +#define WORKINGSET_REPORT_MIN_NR_BINS 2 +#define WORKINGSET_REPORT_MAX_NR_BINS 32 + +#define WORKINGSET_INTERVAL_MAX ((unsigned long)-1) +#define ANON_AND_FILE 2 + +struct wsr_report_bin { + unsigned long idle_age; + unsigned long nr_pages[ANON_AND_FILE]; +}; + +struct wsr_report_bins { + /* excludes the WORKINGSET_INTERVAL_MAX bin */ + unsigned long nr_bins; + /* last bin contains WORKINGSET_INTERVAL_MAX */ + unsigned long idle_age[WORKINGSET_REPORT_MAX_NR_BINS]; + struct rcu_head rcu; +}; + +struct wsr_page_age_histo { + unsigned long timestamp; + struct wsr_report_bin bins[WORKINGSET_REPORT_MAX_NR_BINS]; +}; + +struct wsr_state { + /* breakdown of workingset by page age */ + struct mutex page_age_lock; + struct wsr_page_age_histo *page_age; +}; + +void wsr_init_lruvec(struct lruvec *lruvec); +void wsr_destroy_lruvec(struct lruvec *lruvec); +void wsr_init_pgdat(struct pglist_data *pgdat); +void wsr_destroy_pgdat(struct pglist_data *pgdat); +void wsr_init_sysfs(struct node *node); +void wsr_remove_sysfs(struct node *node); + +/* + * Returns true if the wsr is configured to be refreshed. + * The next refresh time is stored in refresh_time. + */ +bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root, + struct pglist_data *pgdat); +#else +static inline void wsr_init_lruvec(struct lruvec *lruvec) +{ +} +static inline void wsr_destroy_lruvec(struct lruvec *lruvec) +{ +} +static inline void wsr_init_pgdat(struct pglist_data *pgdat) +{ +} +static inline void wsr_destroy_pgdat(struct pglist_data *pgdat) +{ +} +static inline void wsr_init_sysfs(struct node *node) +{ +} +static inline void wsr_remove_sysfs(struct node *node) +{ +} +#endif /* CONFIG_WORKINGSET_REPORT */ + +#endif /* _LINUX_WORKINGSET_REPORT_H */ diff --git a/mm/Kconfig b/mm/Kconfig index 84000b016808..be949786796d 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1301,6 +1301,15 @@ config ARCH_HAS_USER_SHADOW_STACK The architecture has hardware support for userspace shadow call stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss). +config WORKINGSET_REPORT + bool "Working set reporting" + depends on LRU_GEN && SYSFS + help + Report system and per-memcg working set to userspace. + + This option exports stats and events giving the user more insight + into its memory working set. + source "mm/damon/Kconfig" endmenu diff --git a/mm/Makefile b/mm/Makefile index d5639b036166..f5ef0768253a 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -98,6 +98,7 @@ obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o obj-$(CONFIG_PAGE_COUNTER) += page_counter.o obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o +obj-$(CONFIG_WORKINGSET_REPORT) += workingset_report.o ifdef CONFIG_SWAP obj-$(CONFIG_MEMCG) += swap_cgroup.o endif diff --git a/mm/internal.h b/mm/internal.h index 64c2eb0b160e..bbd3c1501bac 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -470,9 +470,14 @@ extern unsigned long highest_memmap_pfn; /* * in mm/vmscan.c: */ +struct scan_control; +bool isolate_lru_page(struct page *page); bool folio_isolate_lru(struct folio *folio); void folio_putback_lru(struct folio *folio); extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason); +bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq, bool can_swap, + bool force_scan); +void set_task_reclaim_state(struct task_struct *task, struct reclaim_state *rs); /* * in mm/rmap.c: diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 53db98d2c4a1..096856b35fbc 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -63,6 +63,7 @@ #include #include #include +#include #include "internal.h" #include #include @@ -3453,6 +3454,7 @@ static void free_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node) if (!pn) return; + wsr_destroy_lruvec(&pn->lruvec); free_percpu(pn->lruvec_stats_percpu); kfree(pn->lruvec_stats); kfree(pn); diff --git a/mm/mm_init.c b/mm/mm_init.c index 4ba5607aaf19..b4f7c904ce33 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -30,6 +30,7 @@ #include #include #include +#include #include "internal.h" #include "slab.h" #include "shuffle.h" @@ -1378,6 +1379,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat) pgdat_page_ext_init(pgdat); lruvec_init(&pgdat->__lruvec); + wsr_init_pgdat(pgdat); } static void __meminit zone_init_internals(struct zone *zone, enum zone_type idx, int nid, diff --git a/mm/mmzone.c b/mm/mmzone.c index f9baa8882fbf..0352a2018067 100644 --- a/mm/mmzone.c +++ b/mm/mmzone.c @@ -90,6 +90,8 @@ void lruvec_init(struct lruvec *lruvec) */ list_del(&lruvec->lists[LRU_UNEVICTABLE]); + wsr_init_lruvec(lruvec); + lru_gen_init_lruvec(lruvec); } diff --git a/mm/vmscan.c b/mm/vmscan.c index 28ba2b06fc7d..89da4d8dfb5f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -57,6 +57,7 @@ #include #include #include +#include #include #include @@ -271,8 +272,7 @@ static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg) } #endif -static void set_task_reclaim_state(struct task_struct *task, - struct reclaim_state *rs) +void set_task_reclaim_state(struct task_struct *task, struct reclaim_state *rs) { /* Check for an overwrite */ WARN_ON_ONCE(rs && task->reclaim_state); @@ -3861,8 +3861,8 @@ static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq, return success; } -static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq, - bool can_swap, bool force_scan) +bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq, bool can_swap, + bool force_scan) { bool success; struct lru_gen_mm_walk *walk; @@ -5640,6 +5640,8 @@ static int __init init_lru_gen(void) if (sysfs_create_group(mm_kobj, &lru_gen_attr_group)) pr_err("lru_gen: failed to create sysfs group\n"); + wsr_init_sysfs(NULL); + debugfs_create_file("lru_gen", 0644, NULL, NULL, &lru_gen_rw_fops); debugfs_create_file("lru_gen_full", 0444, NULL, NULL, &lru_gen_ro_fops); diff --git a/mm/workingset_report.c b/mm/workingset_report.c new file mode 100644 index 000000000000..a4dcf62fcd96 --- /dev/null +++ b/mm/workingset_report.c @@ -0,0 +1,451 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "internal.h" + +void wsr_init_pgdat(struct pglist_data *pgdat) +{ + mutex_init(&pgdat->wsr_update_mutex); + RCU_INIT_POINTER(pgdat->wsr_page_age_bins, NULL); +} + +void wsr_destroy_pgdat(struct pglist_data *pgdat) +{ + struct wsr_report_bins __rcu *bins; + + mutex_lock(&pgdat->wsr_update_mutex); + bins = rcu_replace_pointer(pgdat->wsr_page_age_bins, NULL, + lockdep_is_held(&pgdat->wsr_update_mutex)); + kfree_rcu(bins, rcu); + mutex_unlock(&pgdat->wsr_update_mutex); + mutex_destroy(&pgdat->wsr_update_mutex); +} + +void wsr_init_lruvec(struct lruvec *lruvec) +{ + struct wsr_state *wsr = &lruvec->wsr; + + memset(wsr, 0, sizeof(*wsr)); + mutex_init(&wsr->page_age_lock); +} + +void wsr_destroy_lruvec(struct lruvec *lruvec) +{ + struct wsr_state *wsr = &lruvec->wsr; + + mutex_destroy(&wsr->page_age_lock); + kfree(wsr->page_age); + memset(wsr, 0, sizeof(*wsr)); +} + +static int workingset_report_intervals_parse(char *src, + struct wsr_report_bins *bins) +{ + int err = 0, i = 0; + char *cur, *next = strim(src); + + if (*next == '\0') + return 0; + + while ((cur = strsep(&next, ","))) { + unsigned int interval; + + err = kstrtouint(cur, 0, &interval); + if (err) + goto out; + + bins->idle_age[i] = msecs_to_jiffies(interval); + if (i > 0 && bins->idle_age[i] <= bins->idle_age[i - 1]) { + err = -EINVAL; + goto out; + } + + if (++i == WORKINGSET_REPORT_MAX_NR_BINS) { + err = -ERANGE; + goto out; + } + } + + if (i && i < WORKINGSET_REPORT_MIN_NR_BINS - 1) { + err = -ERANGE; + goto out; + } + + bins->nr_bins = i; + bins->idle_age[i] = WORKINGSET_INTERVAL_MAX; +out: + return err ?: i; +} + +static unsigned long get_gen_start_time(const struct lru_gen_folio *lrugen, + unsigned long seq, + unsigned long max_seq, + unsigned long curr_timestamp) +{ + int younger_gen; + + if (seq == max_seq) + return curr_timestamp; + younger_gen = lru_gen_from_seq(seq + 1); + return READ_ONCE(lrugen->timestamps[younger_gen]); +} + +static void collect_page_age_type(const struct lru_gen_folio *lrugen, + struct wsr_report_bin *bin, + unsigned long max_seq, unsigned long min_seq, + unsigned long curr_timestamp, int type) +{ + unsigned long seq; + + for (seq = max_seq; seq + 1 > min_seq; seq--) { + int gen, zone; + unsigned long gen_end, gen_start, size = 0; + + gen = lru_gen_from_seq(seq); + + for (zone = 0; zone < MAX_NR_ZONES; zone++) + size += max( + READ_ONCE(lrugen->nr_pages[gen][type][zone]), + 0L); + + gen_start = get_gen_start_time(lrugen, seq, max_seq, + curr_timestamp); + gen_end = READ_ONCE(lrugen->timestamps[gen]); + + while (bin->idle_age != WORKINGSET_INTERVAL_MAX && + time_before(gen_end + bin->idle_age, curr_timestamp)) { + unsigned long gen_in_bin = (long)gen_start - + (long)curr_timestamp + + (long)bin->idle_age; + unsigned long gen_len = (long)gen_start - (long)gen_end; + + if (!gen_len) + break; + if (gen_in_bin) { + unsigned long split_bin = + size / gen_len * gen_in_bin; + + bin->nr_pages[type] += split_bin; + size -= split_bin; + } + gen_start = curr_timestamp - bin->idle_age; + bin++; + } + bin->nr_pages[type] += size; + } +} + +/* + * proportionally aggregate Multi-gen LRU bins into a working set report + * MGLRU generations: + * current time + * | max_seq timestamp + * | | max_seq - 1 timestamp + * | | | unbounded + * | | | | + * -------------------------------- + * | max_seq | ... | ... | min_seq + * -------------------------------- + * + * Bins: + * + * current time + * | current - idle_age[0] + * | | current - idle_age[1] + * | | | unbounded + * | | | | + * ------------------------------ + * | bin 0 | ... | ... | bin n-1 + * ------------------------------ + * + * Assume the heuristic that pages are in the MGLRU generation + * through uniform accesses, so we can aggregate them + * proportionally into bins. + */ +static void collect_page_age(struct wsr_page_age_histo *page_age, + const struct lruvec *lruvec) +{ + int type; + const struct lru_gen_folio *lrugen = &lruvec->lrugen; + unsigned long curr_timestamp = jiffies; + unsigned long max_seq = READ_ONCE((lruvec)->lrugen.max_seq); + unsigned long min_seq[ANON_AND_FILE] = { + READ_ONCE(lruvec->lrugen.min_seq[LRU_GEN_ANON]), + READ_ONCE(lruvec->lrugen.min_seq[LRU_GEN_FILE]), + }; + struct wsr_report_bin *bin = &page_age->bins[0]; + + for (type = 0; type < ANON_AND_FILE; type++) + collect_page_age_type(lrugen, bin, max_seq, min_seq[type], + curr_timestamp, type); +} + +/* First step: hierarchically scan child memcgs. */ +static void refresh_scan(struct wsr_state *wsr, struct mem_cgroup *root, + struct pglist_data *pgdat) +{ + struct mem_cgroup *memcg; + unsigned int flags; + struct reclaim_state rs = { 0 }; + + set_task_reclaim_state(current, &rs); + flags = memalloc_noreclaim_save(); + + memcg = mem_cgroup_iter(root, NULL, NULL); + do { + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); + unsigned long max_seq = READ_ONCE((lruvec)->lrugen.max_seq); + + /* + * setting can_swap=true and force_scan=true ensures + * proper workingset stats when the system cannot swap. + */ + try_to_inc_max_seq(lruvec, max_seq, true, true); + cond_resched(); + } while ((memcg = mem_cgroup_iter(root, memcg, NULL))); + + memalloc_noreclaim_restore(flags); + set_task_reclaim_state(current, NULL); +} + +/* Second step: aggregate child memcgs into the page age histogram. */ +static void refresh_aggregate(struct wsr_page_age_histo *page_age, + struct mem_cgroup *root, + struct pglist_data *pgdat) +{ + struct mem_cgroup *memcg; + struct wsr_report_bin *bin; + + for (bin = page_age->bins; + bin->idle_age != WORKINGSET_INTERVAL_MAX; bin++) { + bin->nr_pages[0] = 0; + bin->nr_pages[1] = 0; + } + /* the last used bin has idle_age == WORKINGSET_INTERVAL_MAX. */ + bin->nr_pages[0] = 0; + bin->nr_pages[1] = 0; + + memcg = mem_cgroup_iter(root, NULL, NULL); + do { + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); + + collect_page_age(page_age, lruvec); + cond_resched(); + } while ((memcg = mem_cgroup_iter(root, memcg, NULL))); + WRITE_ONCE(page_age->timestamp, jiffies); +} + +static void copy_node_bins(struct pglist_data *pgdat, + struct wsr_page_age_histo *page_age) +{ + struct wsr_report_bins *node_page_age_bins; + int i = 0; + + rcu_read_lock(); + node_page_age_bins = rcu_dereference(pgdat->wsr_page_age_bins); + if (!node_page_age_bins) + goto nocopy; + for (i = 0; i < node_page_age_bins->nr_bins; ++i) + page_age->bins[i].idle_age = node_page_age_bins->idle_age[i]; + +nocopy: + page_age->bins[i].idle_age = WORKINGSET_INTERVAL_MAX; + rcu_read_unlock(); +} + +bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root, + struct pglist_data *pgdat) +{ + struct wsr_page_age_histo *page_age; + + if (!READ_ONCE(wsr->page_age)) + return false; + + refresh_scan(wsr, root, pgdat); + mutex_lock(&wsr->page_age_lock); + page_age = READ_ONCE(wsr->page_age); + if (page_age) { + copy_node_bins(pgdat, page_age); + refresh_aggregate(page_age, root, pgdat); + } + mutex_unlock(&wsr->page_age_lock); + return !!page_age; +} +EXPORT_SYMBOL_GPL(wsr_refresh_report); + +static struct pglist_data *kobj_to_pgdat(struct kobject *kobj) +{ + int nid = IS_ENABLED(CONFIG_NUMA) ? kobj_to_dev(kobj)->id : + first_memory_node; + + return NODE_DATA(nid); +} + +static struct wsr_state *kobj_to_wsr(struct kobject *kobj) +{ + return &mem_cgroup_lruvec(NULL, kobj_to_pgdat(kobj))->wsr; +} + +static ssize_t page_age_intervals_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + struct wsr_report_bins *bins; + int len = 0; + struct pglist_data *pgdat = kobj_to_pgdat(kobj); + + rcu_read_lock(); + bins = rcu_dereference(pgdat->wsr_page_age_bins); + if (bins) { + int i; + int nr_bins = bins->nr_bins; + + for (i = 0; i < bins->nr_bins; ++i) { + len += sysfs_emit_at( + buf, len, "%u", + jiffies_to_msecs(bins->idle_age[i])); + if (i + 1 < nr_bins) + len += sysfs_emit_at(buf, len, ","); + } + } + len += sysfs_emit_at(buf, len, "\n"); + rcu_read_unlock(); + + return len; +} + +static ssize_t page_age_intervals_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *src, size_t len) +{ + struct wsr_report_bins *bins = NULL, __rcu *old; + char *buf = NULL; + int err = 0; + struct pglist_data *pgdat = kobj_to_pgdat(kobj); + + buf = kstrdup(src, GFP_KERNEL); + if (!buf) { + err = -ENOMEM; + goto failed; + } + + bins = + kzalloc(sizeof(struct wsr_report_bins), GFP_KERNEL); + + if (!bins) { + err = -ENOMEM; + goto failed; + } + + err = workingset_report_intervals_parse(buf, bins); + if (err < 0) + goto failed; + + if (err == 0) { + kfree(bins); + bins = NULL; + } + + mutex_lock(&pgdat->wsr_update_mutex); + old = rcu_replace_pointer(pgdat->wsr_page_age_bins, bins, + lockdep_is_held(&pgdat->wsr_update_mutex)); + mutex_unlock(&pgdat->wsr_update_mutex); + kfree_rcu(old, rcu); + kfree(buf); + return len; +failed: + kfree(bins); + kfree(buf); + + return err; +} + +static struct kobj_attribute page_age_intervals_attr = + __ATTR_RW(page_age_intervals); + +static ssize_t page_age_show(struct kobject *kobj, struct kobj_attribute *attr, + char *buf) +{ + struct wsr_report_bin *bin; + int ret = 0; + struct wsr_state *wsr = kobj_to_wsr(kobj); + + + mutex_lock(&wsr->page_age_lock); + if (!wsr->page_age) + wsr->page_age = + kzalloc(sizeof(struct wsr_page_age_histo), GFP_KERNEL); + mutex_unlock(&wsr->page_age_lock); + + wsr_refresh_report(wsr, NULL, kobj_to_pgdat(kobj)); + + mutex_lock(&wsr->page_age_lock); + if (!wsr->page_age) + goto unlock; + for (bin = wsr->page_age->bins; + bin->idle_age != WORKINGSET_INTERVAL_MAX; bin++) + ret += sysfs_emit_at(buf, ret, "%u anon=%lu file=%lu\n", + jiffies_to_msecs(bin->idle_age), + bin->nr_pages[0] * PAGE_SIZE, + bin->nr_pages[1] * PAGE_SIZE); + + ret += sysfs_emit_at(buf, ret, "%lu anon=%lu file=%lu\n", + WORKINGSET_INTERVAL_MAX, + bin->nr_pages[0] * PAGE_SIZE, + bin->nr_pages[1] * PAGE_SIZE); + +unlock: + mutex_unlock(&wsr->page_age_lock); + return ret; +} + +static struct kobj_attribute page_age_attr = __ATTR_RO(page_age); + +static struct attribute *workingset_report_attrs[] = { + &page_age_intervals_attr.attr, &page_age_attr.attr, NULL +}; + +static const struct attribute_group workingset_report_attr_group = { + .name = "workingset_report", + .attrs = workingset_report_attrs, +}; + +void wsr_init_sysfs(struct node *node) +{ + struct kobject *kobj = node ? &node->dev.kobj : mm_kobj; + struct wsr_state *wsr; + + if (IS_ENABLED(CONFIG_NUMA) && !node) + return; + + wsr = kobj_to_wsr(kobj); + + if (sysfs_create_group(kobj, &workingset_report_attr_group)) + pr_warn("Workingset report failed to create sysfs files\n"); +} +EXPORT_SYMBOL_GPL(wsr_init_sysfs); + +void wsr_remove_sysfs(struct node *node) +{ + struct kobject *kobj = &node->dev.kobj; + struct wsr_state *wsr; + + if (IS_ENABLED(CONFIG_NUMA) && !node) + return; + + wsr = kobj_to_wsr(kobj); + sysfs_remove_group(kobj, &workingset_report_attr_group); +} +EXPORT_SYMBOL_GPL(wsr_remove_sysfs);