From patchwork Wed Nov 27 02:57:19 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yuanchu Xie X-Patchwork-Id: 13886506 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 630B8D66BAC for ; Wed, 27 Nov 2024 02:57:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A7A846B009C; Tue, 26 Nov 2024 21:57:46 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A2AA36B009D; Tue, 26 Nov 2024 21:57:46 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8F1BA6B009E; Tue, 26 Nov 2024 21:57:46 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 74AA46B009C for ; Tue, 26 Nov 2024 21:57:46 -0500 (EST) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 012E11A0DCB for ; Wed, 27 Nov 2024 02:57:45 +0000 (UTC) X-FDA: 82830364764.30.174A856 Received: from mail-pj1-f74.google.com (mail-pj1-f74.google.com [209.85.216.74]) by imf28.hostedemail.com (Postfix) with ESMTP id E0F78C000F for ; Wed, 27 Nov 2024 02:57:35 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=zhJSopiC; spf=pass (imf28.hostedemail.com: domain of 3popGZwcKCBcLHxAz4H3BB381.zB985AHK-997Ixz7.BE3@flex--yuanchu.bounces.google.com designates 209.85.216.74 as permitted sender) smtp.mailfrom=3popGZwcKCBcLHxAz4H3BB381.zB985AHK-997Ixz7.BE3@flex--yuanchu.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1732676261; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=DOATy7sDz8DTD8Z4AoICwdWQAcaTwjKHPvi099XhNks=; b=XrBL/ZH6ac+w6CfT90keGgBe2Pv2goNzse3ui7imbden2YWo7iBqwlUT9qiAXr6DVAXxIm Q/erR9Znw+qAgpWo7O7JjQZQwDA2sY/gdyuQ5cLW4leG3xJkUpyr0pgUoFvJ2IaqHKUnXB O8Nnap3JT7SyoRlbvJs3gvbcBYVW9Lc= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1732676261; a=rsa-sha256; cv=none; b=uFWKiNqd9Gyep+ItKw3OvgdckRUpDFwn/PUR9uhzHUS/2zx56jSaFjRKzq/dUvli6QJnvO m3RkZjN75RpCRcfFLdrnxucYo+qw/s7o6QDI9NpPScB1K3GUruXYMTDQ4btltm9keEql1q 4Y6ThKPWshCWKeqmpM+MAb+FGG23jqY= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=zhJSopiC; spf=pass (imf28.hostedemail.com: domain of 3popGZwcKCBcLHxAz4H3BB381.zB985AHK-997Ixz7.BE3@flex--yuanchu.bounces.google.com designates 209.85.216.74 as permitted sender) smtp.mailfrom=3popGZwcKCBcLHxAz4H3BB381.zB985AHK-997Ixz7.BE3@flex--yuanchu.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-pj1-f74.google.com with SMTP id 98e67ed59e1d1-2e9ff7aa7eeso333218a91.1 for ; Tue, 26 Nov 2024 18:57:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1732676263; x=1733281063; darn=kvack.org; h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject :date:message-id:reply-to; bh=DOATy7sDz8DTD8Z4AoICwdWQAcaTwjKHPvi099XhNks=; b=zhJSopiCkH2aTFsGsaXhEgkY2NQcIxp/o4PhBYV4vSYjR5+R0wRdyMZFhYnNBogFwF tTUPCfVgvD1AV51zeb8Fnw0Ia31zXdy2Oq27FmLDj/EaHJvH92Sz5dAA7GOxY/0FyoUs sxbMM/mB/GCxxstaQpwRtiS46NaiMN3umxlgLIXm79vwNp/ZaczgY9NSxkvFlHv8lpsj seKMVKc2NIcKsyd1Y+/P+6pnXcJqlo4SbCbPn1Qyome1B1HyAaXM0iHdrHkrbwM6hr6c goSwA4lq/x0DU1BIsGrpnH1gQr2Ys5R8wYL8yoFCZaNH6ElsxF/8Jjzhjm2dSGbwgkDh hYLA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1732676263; x=1733281063; h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=DOATy7sDz8DTD8Z4AoICwdWQAcaTwjKHPvi099XhNks=; b=A2GeGJOKq5hnOVVOPVuR0GMPDLX5AYEPfkGTX8mIkT8UwrizMVtw2O0Pa2IpbZwoMi 1Cz61YbUycV/F5ZCz4uNqbTGmJq85cKJr0GuUdTgmp7Rnvw6KZE16Bd1avG8/SO/PJ96 2KC2267P5/vKiTcfMYlIhvJiQoJ9vs0bGwzQNOtmtoOd1mFsCP3zCRIOxutCLnciUwFh tHichs15Hqtfg9OVPfcisC07RX/wsB4jgDPh0m3Oky+Qx2qdwD1iv79Eb6y5uuPrNjXK eb8j513e4gfPjgTFPgWBpaSl+GMuL2c6bigC6hCRydiZsOjU6/eZRRbYpDT90UnugUSE LRsQ== X-Forwarded-Encrypted: i=1; AJvYcCUdRgjObXXNZEzPmYSOjhrPV7PbboatOXAjSMgaueD701EonCpN+IPkvQHdalNCpENGjZhyyW5gAg==@kvack.org X-Gm-Message-State: AOJu0YwCCJhXLbBcd+vXwp4hxWjtqGr1jhL8gJyuWfjavJoed1vqkfz+ 2G1zKWwHsrirRbe8s1ZwyTTpgdprtEExbJ9Fa4CNZBIXO66xkItzZcAcZLRe8GG05o7kF7a99j8 yNVE63g== X-Google-Smtp-Source: AGHT+IGwtdBa6yhIuRJSWPEXoNp2VYNlKPrafpXUgul1cMSxbPMT+mgSuRvlqTMnr3isKES25GDLRo275izi X-Received: from pjbnd9.prod.google.com ([2002:a17:90b:4cc9:b0:2ea:5fc2:b503]) (user=yuanchu job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:1004:b0:2ea:9f3a:7d9 with SMTP id 98e67ed59e1d1-2edeb68cff8mr8628929a91.3.1732676262751; Tue, 26 Nov 2024 18:57:42 -0800 (PST) Date: Tue, 26 Nov 2024 18:57:19 -0800 Mime-Version: 1.0 X-Mailer: git-send-email 2.47.0.338.g60cca15819-goog Message-ID: <20241127025728.3689245-1-yuanchu@google.com> Subject: [PATCH v4 0/9] mm: workingset reporting From: Yuanchu Xie To: Andrew Morton , David Hildenbrand , "Aneesh Kumar K.V" , Khalid Aziz , Henry Huang , Yu Zhao , Dan Williams , Gregory Price , Huang Ying , Lance Yang , Randy Dunlap , Muhammad Usama Anjum Cc: Tejun Heo , Johannes Weiner , " =?utf-8?q?Michal_Koutn=C3=BD?= " , Jonathan Corbet , Greg Kroah-Hartman , "Rafael J. Wysocki" , "Michael S. Tsirkin" , Jason Wang , Xuan Zhuo , " =?utf-8?q?Eugenio_P=C3=A9rez?= " , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Mike Rapoport , Shuah Khan , Christian Brauner , Daniel Watson , Yuanchu Xie , cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, virtualization@lists.linux.dev, linux-mm@kvack.org, linux-kselftest@vger.kernel.org X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: E0F78C000F X-Stat-Signature: xbjy4r7t1hqe3ss8kebqsjyck4rct4o1 X-Rspam-User: X-HE-Tag: 1732676255-374840 X-HE-Meta: U2FsdGVkX19wgI7ql/zLm/BSp7LOEjS7tGOlcL3iK1xkgPdXWAiXuNzGzwEfAJztwNvP/+JKF5mMAWR9v5Wlk52O03smd3xieAZWqpr6w3Bc99Y0UzPXjYoG0nIDZzbDMnQRrA3rbE63hhZUFWQdYXuQ717ENEgwP8IAYdWMHED+1PdnU4ZwoYphdJX+dM/o2A3TSpaPzKQjJ59W38ILotmdqyPQYhnzyxkj3mxO3lFMbUn0Vkonwx8kOSsgLWYDAuMRKFpPrpzFetRjvSPHQu+getGI0D4Tgwyvyka0fXpS4XwnoIuLtZZRHolsGCovV3T0p8pDXGLQ0c5iUqIsPPZ6f+EQ1AvbE2Ev3j7/fu/wyllqPRunm8R/W+q+Bt8fDLIUKXYza5bHqG2P9QieKySZSmZDP5uShZV9WmvSwO3DFug+3rIvtYiMrX5yhlDkJhVMMWpzURa7JH5VfgPT1VzKQLcQxg0iqBpZToDUbH8R/2bvA5JvrWM9fkzSL2bI+K2QqF8yKiYg3VcHYa8ceUY3YxG2Tp+3ERWDIDfhU7kuIbrI3+5B/uzKJGx/mBjMLfoVKyvX8Yu353AayaZYiKxphdocLDlSLZki1NCt4zAJAXFRJPZxzrQuyW9I6b3IwkhNIWlGOEUItK9fKZ4vCbOj57g9BuVDJamJTWcHgQ8X/Dy6mwwArc7q13+Kj6ynQf6lVG5aBxcYlj63I5yE6x0sPSQNV8EsMfGcG1CjQl8LX4r398YPTF7l5gtMhKSoAWUsRGKLs8UjMuL2YwF8wc+qxAYDICJvMJwRs0R5BZzrh388UoTXdn6RKPF3QJKZTmIsdpam8+H2Ci0B3aF47rHcGWjrhE8WHxGQ66GRIYleRn/SLgIGGvTD78ROzCo0H8unnDxxLuYFdOEqypjz2gktQpS9c2CODI1GT3igJnI1m8H3VeC3/nTixjmVWKkLjqXDJEFkWqYH0H3RbGA s0w5bXKz z66Zy0wE7PQQw6+yEYtJyOBJIJkv8BElsR1kPht0nLcjfs3Ti0QqPByT3itsEuNmgdGo7pSOyqnJkuk7L77qqJR06Jt66asFW2YLfInOC2iqoLUnwHdqhV0W+uPiLue9ypyIXHJTvc9dV65IFR4O6Aqai2hUZdDwdW4VtjVpgkjRfd03UTNSW4SAhXSDAttgBzKhr9roNoEc3LfWCBI4TN4rI2OdTXrCpSAnsq52sgoHLv9oK+slOmX8fQ6lCFMoaO19RVOxqkL7COmdTR/xDysSDTOD5nh8TIbXzYD9EOxQiWVGAp2RPgTqiN2uC8il12VWeB4K08ipQrxInUqCtkWKw8vY5Z44c6wkI0q9oxHjXXfjSPASjj70a5CLy/wJpr2Y1xbNnb7mCsLV+JEwYv3qvtAT8snlKiT3NrGyZSsJQr/a3SuHqoHQwwvGKFQUJop8mqnNeczfNKDjTvctCpIHTWqW7jlAHCCcUxv187oUYpcKEdsuh5jRIwGm1rNlqShYNnCbRjkPv9zISnV3/gmkrIcLLRwrVwJlMKEjf7MWsFbAOsIuYjnWqCVJvGFzTgtiKzXzzt75Wrc+JPWd+iiqYSmJxLf5kWHw6UXfSQ+nGUcZuIQ5SZzT5r0hgmVXHdr5RDsH1b/wJaLvrg5i6X3IyonUMsNPtg6KxtZw/d/SISNLHGbYUVlryto0B9jebODxsB4ZAzcRN0AjlN7fVLuCqLQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: This patch series provides workingset reporting of user pages in lruvecs, of which coldness can be tracked by accessed bits and fd references. However, the concept of workingset applies generically to all types of memory, which could be kernel slab caches, discardable userspace caches (databases), or CXL.mem. Therefore, data sources might come from slab shrinkers, device drivers, or the userspace. Another interesting idea might be hugepage workingset, so that we can measure the proportion of hugepages backing cold memory. However, with architectures like arm, there may be too many hugepage sizes leading to a combinatorial explosion when exporting stats to the userspace. Nonetheless, the kernel should provide a set of workingset interfaces that is generic enough to accommodate the various use cases, and extensible to potential future use cases. Use cases ========== Job scheduling On overcommitted hosts, workingset information improves efficiency and reliability by allowing the job scheduler to have better stats on the exact memory requirements of each job. This can manifest in efficiency by landing more jobs on the same host or NUMA node. On the other hand, the job scheduler can also ensure each node has a sufficient amount of memory and does not enter direct reclaim or the kernel OOM path. With workingset information and job priority, the userspace OOM killing or proactive reclaim policy can kick in before the system is under memory pressure. If the job shape is very different from the machine shape, knowing the workingset per-node can also help inform page allocation policies. Proactive reclaim Workingset information allows the a container manager to proactively reclaim memory while not impacting a job's performance. While PSI may provide a reactive measure of when a proactive reclaim has reclaimed too much, workingset reporting allows the policy to be more accurate and flexible. Ballooning (similar to proactive reclaim) The last patch of the series extends the virtio-balloon device to report the guest workingset. Balloon policies benefit from workingset to more precisely determine the size of the memory balloon. On end-user devices where memory is scarce and overcommitted, the balloon sizing in multiple VMs running on the same device can be orchestrated with workingset reports from each one. On the server side, workingset reporting allows the balloon controller to inflate the balloon without causing too much file cache to be reclaimed in the guest. Promotion/Demotion If different mechanisms are used for promition and demotion, workingset information can help connect the two and avoid pages being migrated back and forth. For example, given a promotion hot page threshold defined in reaccess distance of N seconds (promote pages accessed more often than every N seconds). The threshold N should be set so that ~80% (e.g.) of pages on the fast memory node passes the threshold. This calculation can be done with workingset reports. To be directly useful for promotion policies, the workingset report interfaces need to be extended to report hotness and gather hotness information from the devices[1]. [1] https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements-white-paper-pdf-1 Sysfs and Cgroup Interfaces ========== The interfaces are detailed in the patches that introduce them. The main idea here is we break down the workingset per-node per-memcg into time intervals (ms), e.g. 1000 anon=137368 file=24530 20000 anon=34342 file=0 30000 anon=353232 file=333608 40000 anon=407198 file=206052 9223372036854775807 anon=4925624 file=892892 Implementation ========== The reporting of user pages is based off of MGLRU, and therefore requires CONFIG_LRU_GEN=y. We would benefit from more MGLRU generations for a more fine-grained workingset report, but we can already gather a lot of data with just four generations. The workingset reporting mechanism is gated behind CONFIG_WORKINGSET_REPORT, and the aging thread is behind CONFIG_WORKINGSET_REPORT_AGING. Benchmarks ========== Ghait Ouled Amar Ben Cheikh has implemented a simple policy and ran Linux compile and redis benchmarks from openbenchmarking.org. The policy and runner is referred to as WMO (Workload Memory Optimization). The results were based on v3 of the series, but v4 doesn't change the core of the working set reporting and just adds the ballooning counterpart. The timed Linux kernel compilation benchmark shows improvements in peak memory usage with a policy of "swap out all bytes colder than 10 seconds every 40 seconds". A swapfile is configured on SSD. -------------------------------------------- peak memory usage (with WMO): 4982.61328 MiB peak memory usage (control): 9569.1367 MiB peak memory reduction: 47.9% -------------------------------------------- Benchmark | Experimental |Control | Experimental_Std_Dev | Control_Std_Dev Timed Linux Kernel Compilation - allmodconfig (sec) | 708.486 (95.91%) | 679.499 (100%) | 0.6% | 0.1% -------------------------------------------- Seconds, fewer is better The redis benchmark shows employs the same policy: -------------------------------------------- peak memory usage (with WMO): 375.9023 MiB peak memory usage (control): 509.765 MiB peak memory reduction: 26% -------------------------------------------- Benchmark | Experimental | Control | Experimental_Std_Dev | Control_Std_Dev Redis - LPOP (Reqs/sec) | 2023130 (98.22%) | 2059849 (100%) | 1.2% | 2% Redis - SADD (Reqs/sec) | 2539662 (98.63%) | 2574811 (100%) | 2.3% | 1.4% Redis - LPUSH (Reqs/sec)| 2024880 (100%) | 2000884 (98.81%) | 1.1% | 0.8% Redis - GET (Reqs/sec) | 2835764 (100%) | 2763722 (97.46%) | 2.7% | 1.6% Redis - SET (Reqs/sec) | 2340723 (100%) | 2327372 (99.43%) | 2.4% | 1.8% -------------------------------------------- Reqs/sec, more is better The detailed report and benchmarking results are in Ghait's repo: https://github.com/miloudi98/WMO Changelog ========== Changes from PATCH v3 -> v4: - Added documentation for cgroup-v2 (Waiman Long) - Fixed types in documentation (Randy Dunlap) - Added implementation for the ballooning use case - Added detailed description of benchmark results (Andrew Morton) Changes from PATCH v2 -> v3: - Fixed typos in commit messages and documentation (Lance Yang, Randy Dunlap) - Split out the force_scan patch to be reviewed separately - Added benchmarks from Ghait Ouled Amar Ben Cheikh - Fixed reported compile error without CONFIG_MEMCG Changes from PATCH v1 -> v2: - Updated selftest to use ksft_test_result_code instead of switch-case (Muhammad Usama Anjum) - Included more use cases in the cover letter (Huang, Ying) - Added documentation for sysfs and memcg interfaces - Added an aging-specific struct lru_gen_mm_walk in struct pglist_data to avoid allocating for each lruvec. [v1] https://lore.kernel.org/linux-mm/20240504073011.4000534-1-yuanchu@google.com/ [v2] https://lore.kernel.org/linux-mm/20240604020549.1017540-1-yuanchu@google.com/ [v3] https://lore.kernel.org/linux-mm/20240813165619.748102-1-yuanchu@google.com/ Yuanchu Xie (9): mm: aggregate workingset information into histograms mm: use refresh interval to rate-limit workingset report aggregation mm: report workingset during memory pressure driven scanning mm: extend workingset reporting to memcgs mm: add kernel aging thread for workingset reporting selftest: test system-wide workingset reporting Docs/admin-guide/mm/workingset_report: document sysfs and memcg interfaces Docs/admin-guide/cgroup-v2: document workingset reporting virtio-balloon: add workingset reporting Documentation/admin-guide/cgroup-v2.rst | 35 + Documentation/admin-guide/mm/index.rst | 1 + .../admin-guide/mm/workingset_report.rst | 105 +++ drivers/base/node.c | 6 + drivers/virtio/virtio_balloon.c | 390 ++++++++++- include/linux/balloon_compaction.h | 1 + include/linux/memcontrol.h | 21 + include/linux/mmzone.h | 13 + include/linux/workingset_report.h | 167 +++++ include/uapi/linux/virtio_balloon.h | 30 + mm/Kconfig | 15 + mm/Makefile | 2 + mm/internal.h | 19 + mm/memcontrol.c | 162 ++++- mm/mm_init.c | 2 + mm/mmzone.c | 2 + mm/vmscan.c | 56 +- mm/workingset_report.c | 653 ++++++++++++++++++ mm/workingset_report_aging.c | 127 ++++ tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 3 + tools/testing/selftests/mm/run_vmtests.sh | 5 + .../testing/selftests/mm/workingset_report.c | 306 ++++++++ .../testing/selftests/mm/workingset_report.h | 39 ++ .../selftests/mm/workingset_report_test.c | 330 +++++++++ 25 files changed, 2482 insertions(+), 9 deletions(-) create mode 100644 Documentation/admin-guide/mm/workingset_report.rst create mode 100644 include/linux/workingset_report.h create mode 100644 mm/workingset_report.c create mode 100644 mm/workingset_report_aging.c create mode 100644 tools/testing/selftests/mm/workingset_report.c create mode 100644 tools/testing/selftests/mm/workingset_report.h create mode 100644 tools/testing/selftests/mm/workingset_report_test.c