From patchwork Tue Aug 13 16:56:11 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yuanchu Xie X-Patchwork-Id: 13762343 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3C391C52D7B for ; Tue, 13 Aug 2024 16:59:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C13BD6B008A; Tue, 13 Aug 2024 12:59:40 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B9D3B6B0092; Tue, 13 Aug 2024 12:59:40 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A15FE6B0095; Tue, 13 Aug 2024 12:59:40 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 80BD66B008A for ; Tue, 13 Aug 2024 12:59:40 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 1672816085D for ; Tue, 13 Aug 2024 16:59:40 +0000 (UTC) X-FDA: 82447833720.15.DB19E44 Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) by imf12.hostedemail.com (Postfix) with ESMTP id 4FD3040019 for ; Tue, 13 Aug 2024 16:59:37 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=nhmLIvdy; spf=pass (imf12.hostedemail.com: domain of 3-JC7ZgcKCEU51hujo1nvvnsl.jvtspu14-ttr2hjr.vyn@flex--yuanchu.bounces.google.com designates 209.85.219.201 as permitted sender) smtp.mailfrom=3-JC7ZgcKCEU51hujo1nvvnsl.jvtspu14-ttr2hjr.vyn@flex--yuanchu.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1723568283; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=Bv4SVjHgUZOhoTb1OWwMYpUDeJnXwVqeg4+bbTG2FbM=; b=YgxHAWj0Fq+QWjMA2I/l48fNUPpur1pQcFLxdRKsOXpjUvrrbEjM5TUiZ4TR11kMIOMKtL g9orAJIN3gL0nBOj8l3MGmKxbiYJFX5g8WtGHIZ9ptTUS2uhd3r2WLTyAdHNjeqpnjlX4z SFeB5cBy5NyvaDCSluYMgbTQ73XRN9k= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=nhmLIvdy; spf=pass (imf12.hostedemail.com: domain of 3-JC7ZgcKCEU51hujo1nvvnsl.jvtspu14-ttr2hjr.vyn@flex--yuanchu.bounces.google.com designates 209.85.219.201 as permitted sender) smtp.mailfrom=3-JC7ZgcKCEU51hujo1nvvnsl.jvtspu14-ttr2hjr.vyn@flex--yuanchu.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1723568283; a=rsa-sha256; cv=none; b=3UpY4BQb293P4KcbePOlHkUuFaBV72rqykvt76WJepYzXNEyQ9WWqg0kqfWruUsZhcFq3m XYzqvhMwkbNicstR5khd1KmSQobgKuMlU2yn50gQXzOYWLUj6vF3Ccx+Pd/LEDp1njZQ6I 9bCmSMw8jJEsQN8Zv6KF56QHzGW5Pgg= Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-e0e4cd64909so9751098276.3 for ; Tue, 13 Aug 2024 09:59:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1723568376; x=1724173176; darn=kvack.org; h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject :date:message-id:reply-to; bh=Bv4SVjHgUZOhoTb1OWwMYpUDeJnXwVqeg4+bbTG2FbM=; b=nhmLIvdylctBNPZHZOz7+An0fqDgJ06xH84gpXR2zeNCeq+5fdpQm8BUWk8c/S0PD6 mQn8lNL7bZIHl4dx/TtYnza3bPglFYehF7lx5G9lWyxA6QXjX5jpbhhsSq1IrjjDnzOc 1bYnf5H5ZJpX9DIPpBvDz0BvVwh4ggZfGyCz7LaEd+gfvYP4Md4F0Y7qfSbPKPvuqxWa MSGIzL3ckGD+2i93PFVlTkHYxTKaumbvGiXZi581hDEE/UZrk9PWjYSoF3EymdR9zJ32 jMC6y9mR/WXOAKPA3ZqH1jnUecI/N8uvd78g87SeFiNfcxlZ8u98rMbZ62XnpkHc+/TE h6EA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1723568376; x=1724173176; h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=Bv4SVjHgUZOhoTb1OWwMYpUDeJnXwVqeg4+bbTG2FbM=; b=uAK0wwNz+SnPFA9S2J4SnRTMQQ62+56rw4UM2nuGJLXYi/V0JU0W4Oofo5eJH0w8np lh9mIHFUJ4f6jFnpRe47QU2uUQU76p+/Lflg97S5a1brsD9GEKtpmDTVGYX3baDvjh1U 2YFX+Lp9USI/xWaGSpRcibBzHVlZkcBQ+MVq40rvzuli9D/O0eaS466BMqR05d2gw620 2Qs7559uTKp1syhnRu7yLpW7mHQBsgMaSHJHV4URBb5hB+oad4ynC5DkYMiMxADs95UD 3tbocF/nhvUSEYtACoLtQn+5rGuv5sV+MnhxrYA+3WNOu20XEf1z01uz+v6RcqNeBhiw wwHQ== X-Forwarded-Encrypted: i=1; AJvYcCUB98aNT/1c0ifXjxXP0GaUda8EqCZVC0MI3PxPfmlGDdMH52yTLGmbmUhtapL43IU/wBHkk0jfKeeF1S6wuRC1b5I= X-Gm-Message-State: AOJu0Yxl5aZd22QkTYYk4n9IEg203VnS8HHHZD0fun8PxrVw64hy8WQp MEmjvES0OOqDDNglSP2LB6QHQR0DH4bMlESPRynU7vpFcbY2HL2vt6cWWn4sXtbVRgJ/IOAu47C v11NAeg== X-Google-Smtp-Source: AGHT+IFhhfIujG8uqC2QMmL5ZeYf4GNZcSGDr17DVlIcY0LVjF5xR47VsuGiOOLjceaKtQ5GHcSYn/ofKs8V X-Received: from yuanchu-desktop.svl.corp.google.com ([2620:15c:2a3:200:b50c:66e8:6532:a371]) (user=yuanchu job=sendgmr) by 2002:a25:9709:0:b0:e0b:bcd2:b2ee with SMTP id 3f1490d57ef6-e1155aa5875mr280276.6.1723568376005; Tue, 13 Aug 2024 09:59:36 -0700 (PDT) Date: Tue, 13 Aug 2024 09:56:11 -0700 Mime-Version: 1.0 X-Mailer: git-send-email 2.46.0.76.ge559c4bf1a-goog Message-ID: <20240813165619.748102-1-yuanchu@google.com> Subject: [PATCH v3 0/7] mm: workingset reporting From: Yuanchu Xie To: David Hildenbrand , "Aneesh Kumar K.V" , Khalid Aziz , Henry Huang , Yu Zhao , Dan Williams , Gregory Price , Huang Ying , Andrew Morton , Lance Yang , Randy Dunlap , Muhammad Usama Anjum Cc: Kalesh Singh , Wei Xu , David Rientjes , Greg Kroah-Hartman , "Rafael J. Wysocki" , Johannes Weiner , Michal Hocko , Roman Gushchin , Muchun Song , Shuah Khan , Yosry Ahmed , Matthew Wilcox , Sudarshan Rajagopalan , Kairui Song , "Michael S. Tsirkin" , Vasily Averin , Nhat Pham , Miaohe Lin , Qi Zheng , Abel Wu , "Vishal Moola (Oracle)" , Kefeng Wang , Yuanchu Xie , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kselftest@vger.kernel.org X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 4FD3040019 X-Stat-Signature: iop3wxg3zc5n6bmpuxf4wk7iitg48oii X-Rspam-User: X-HE-Tag: 1723568377-640839 X-HE-Meta: U2FsdGVkX19sVbTy4KuMRa2RIpOK7mKSf79Bjnaa612X9BOjqLaRk9XmdQQ+uivv47CD3Kzlp4CO7e0XBrjd6YOqgbT4fZXEjlbspC3WmKgYeEAflGYgadLVFyrbisuyAAn3YmgMi3tDRZ9NtOwcDhOelvfB+Mzyhzs8sUEC4Rig+a/EWAZovMAsoeuMmv42C+REVCa1seE51PHoEg6fqp4z/phPWZnUp3ahP0K3ifhmyI5ULtPvqdF+tVaZmfz1AQBOCtDk5/gj08omWuJKKvk79smUDxE5TJKspT4EQPHXNHXvw1wpA3mZ9adefttwI7Z3Tgr4nSFlS+5hN9wHOLLTQu2TCspDOvo6JGhGyUjQJwoi0K1uBK4IZDOVdhxPOSPp4MykjExda4ysqLS5bWcc5VKVVH1KJKAhzvEwK5ppSsvwvG2dbKgmBbEVKOr3doQ/EresNhAhLViab2+rNZzgrMwsK7AP0XAyIQR2mhnknaiBzjR/kDj5CWV0OLUeIWnGVr3HSFLna6z6dg+jXjB7Hi/5j7XVg95bdgbYgwck4VrgYBelztVCYJ2YwK+P0npn8smEE67L3BdxDOL4Bd+y6cqQHNAIsDKxN1K6iwC9oyzpndrdd11RVvfCjUUoW6gN2irR41sL8CS5Yec3OwMkhz2A1DhWokSwCC5iufZ+kpY5BF62Tf60Y/Pt8mHMDT5rHyVTQT23IQtcCqquLEjg5zO/fUHjtYT4yZHGPT2obs8UWOSVdu5A5ST5jud/BJ9ESo00VDTYtXqKGCLlFSol2VWXtBCb7lQMmBmu1/VyXLAdwJZCqraQtxbn2K8nUgNYhWfswUfxzm7kuZgPsvCradK+wB1QNp9g8Eza+qJGN39SngxqyUDBFcumV29lxwmWBNxHpbdPMacg6nyoUgY3cVcVeMD/c4ugroa+1w4vi3L7lykz9th96PK6gFydFOKYZMp2OHGP+7w2dSX xFGs8GOz neiK8qJV/e0mXzdcoDs5aPHH/qQim5b6YqQKtLT8QTvn7Lcuo2ahaq4/gqDbW82eswykhEaIUv2EpAwYShqD92nO6lNahpQRZtaTSdyyUjlkkmV5nAc3xPiFv7ayWUj4eVgovtDgODe018Ij+wVU7cG2g/4OEeVxn7/Jro6ibxXy4Cvpc3ec0+i9kell/EDwEFfk+ySjxtk0CPxg/eQv3DgeGOfxyI8smo01ox0vUH9r0r/NoQJlBc3G4/mCPymv5hWMyYC4evhO48jrndh5QwIUBHDZsfVijHvjC+KTwu48hsyONdIZs+OqZVZpR36AQba80UItWf4TlG4B+MwPJ0NiM8/UC66sD/BenBA8vkrHIOHFy2vrIfol6IYHHDx0z7MEP3Kqbv/r9NTBGuTrkQ1hr1V5mTkwi2pgFV3Zpxo0kqLcGz13IKEIo6/rVWi+pC0mqQT+kxo9zC6pGesM1ydb+nL6lTom6jHS1qcWquQxJ03v1NOlbe6i1Yl4FBkgyPdjfNCZMma7Gw18dryHfVkkh4FDCpzFJ+pe0OXVgDtPmSsd3m87LtwesZRB0021FFTAqzQvoSJ8J6mnuJFbdo+aPGMa1lRXpWQReLYaazUZYsIbEpKsDs5QbjdknwAr2vx/5eIwAxeB27bfegdvGksHmDmcmOw/pSf9ITgeA/awwyrDLGn7ay8PimArEa/11ErUOfVM7uzeqq6XrReiZUCU/7hbFKDbz3nbDHWQ+FnKE1JYBK5qfF9VxWoyeqCbNOFMM1coxvXPM3UmRspYjwyG5j9avRF2H3JmF X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Changes from PATCH v2 -> v3: - Fixed typos in commit messages and documentation (Lance Yang, Randy Dunlap) - Split out the force_scan patch to be reviewed separately - Added benchmarks from Ghait Ouled Amar Ben Cheikh - Fixed reported compile error without CONFIG_MEMCG Changes from PATCH v1 -> v2: - Updated selftest to use ksft_test_result_code instead of switch-case (Muhammad Usama Anjum) - Included more use cases in the cover letter (Huang, Ying) - Added documentation for sysfs and memcg interfaces - Added an aging-specific struct lru_gen_mm_walk in struct pglist_data to avoid allocating for each lruvec. Changes from RFC v3 -> PATCH v1: - Updated selftest to use ksft_print_msg instead of fprintf(stderr, ...) (Muhammad Usama Anjum) - Included more detail in patch skipping pmd_young with force_scan (Huang, Ying) - Deferred reaccess histogram as a followup - Removed per-memcg page age interval configs for simplicity Changes from RFC v2 -> RFC v3: - Update to v6.8 - Added an aging kernel thread (gated behind config) - Added basic selftests for sysfs interface files - Track swapped out pages for reaccesses - Refactoring and cleanup - Dropped the virtio-balloon extension to make things manageable Changes from RFC v1 -> RFC v2: - Refactored the patchs into smaller pieces - Renamed interfaces and functions from wss to wsr (Working Set Reporting) - Fixed build errors when CONFIG_WSR is not set - Changed working_set_num_bins to u8 for virtio-balloon - Added support for per-NUMA node reporting for virtio-balloon [rfc v1] https://lore.kernel.org/linux-mm/20230509185419.1088297-1-yuanchu@google.com/ [rfc v2] https://lore.kernel.org/linux-mm/20230621180454.973862-1-yuanchu@google.com/ [rfc v3] https://lore.kernel.org/linux-mm/20240327213108.2384666-1-yuanchu@google.com/ This patch series provides workingset reporting of user pages in lruvecs, of which coldness can be tracked by accessed bits and fd references. However, the concept of workingset applies generically to all types of memory, which could be kernel slab caches, discardable userspace caches (databases), or CXL.mem. Therefore, data sources might come from slab shrinkers, device drivers, or the userspace. IMO, the kernel should provide a set of workingset interfaces that should be generic enough to accommodate the various use cases, and be extensible to potential future use cases. The current proposed interfaces are not sufficient in that regard, but I would like to start somewhere, solicit feedback, and iterate. Use cases ========== Job scheduling On overcommitted hosts, workingset information allows the job scheduler to right-size each job and land more jobs on the same host or NUMA node, and in the case of a job with increasing workingset, policy decisions can be made to migrate other jobs off the host/NUMA node, or oom-kill the misbehaving job. If the job shape is very different from the machine shape, knowing the workingset per-node can also help inform page allocation policies. Proactive reclaim Workingset information allows the a container manager to proactively reclaim memory while not impacting a job's performance. While PSI may provide a reactive measure of when a proactive reclaim has reclaimed too much, workingset reporting allows the policy to be more accurate and flexible. Ballooning (similar to proactive reclaim) While this patch series does not extend the virtio-balloon device, balloon policies benefit from workingset to more precisely determine the size of the memory balloon. On desktops/laptops/mobile devices where memory is scarce and overcommitted, the balloon sizing in multiple VMs running on the same device can be orchestrated with workingset reports from each one. Promotion/Demotion If different mechanisms are used for promition and demotion, workingset information can help connect the two and avoid pages being migrated back and forth. For example, given a promotion hot page threshold defined in reaccess distance of N seconds (promote pages accessed more often than every N seconds). The threshold N should be set so that ~80% (e.g.) of pages on the fast memory node passes the threshold. This calculation can be done with workingset reports. To be directly useful for promotion policies, the workingset report interfaces need to be extended to report hotness and gather hotness information from the devices[1]. [1] https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements-white-paper-pdf-1 Sysfs and Cgroup Interfaces ========== The interfaces are detailed in the patches that introduce them. The main idea here is we break down the workingset per-node per-memcg into time intervals (ms), e.g. 1000 anon=137368 file=24530 20000 anon=34342 file=0 30000 anon=353232 file=333608 40000 anon=407198 file=206052 9223372036854775807 anon=4925624 file=892892 I realize this does not generalize well to hotness information, but I lack the intuition for an abstraction that presents hotness in a useful way. Based on a recent proposal for move_phys_pages[2], it seems like userspace tiering software would like to move specific physical pages, instead of informing the kernel "move x number of hot pages to y device". Please advise. [2] https://lore.kernel.org/lkml/20240319172609.332900-1-gregory.price@memverge.com/ Implementation ========== Currently, the reporting of user pages is based off of MGLRU, and therefore requires CONFIG_LRU_GEN=y. We would benefit from more MGLRU generations for a more fine-grained workingset report. I will make the generation count configurable in the next version. The workingset reporting mechanism is gated behind CONFIG_WORKINGSET_REPORT, and the aging thread is behind CONFIG_WORKINGSET_REPORT_AGING. Benchmarks ========== Ghait Ouled Amar Ben Cheikh has implemented a simple "reclaim everything colder than 10 seconds every 40 seconds" policy and ran Linux compile and redis from the phoronix test suite. The results are in his repo: https://github.com/miloudi98/WMO Yuanchu Xie (7): mm: aggregate working set information into histograms mm: use refresh interval to rate-limit workingset report aggregation mm: report workingset during memory pressure driven scanning mm: extend working set reporting to memcgs mm: add kernel aging thread for workingset reporting selftest: test system-wide workingset reporting Docs/admin-guide/mm/workingset_report: document sysfs and memcg interfaces Documentation/admin-guide/mm/index.rst | 1 + .../admin-guide/mm/workingset_report.rst | 105 ++++ drivers/base/node.c | 6 + include/linux/memcontrol.h | 21 + include/linux/mmzone.h | 9 + include/linux/workingset_report.h | 97 +++ mm/Kconfig | 15 + mm/Makefile | 2 + mm/internal.h | 18 + mm/memcontrol.c | 184 +++++- mm/mm_init.c | 2 + mm/mmzone.c | 2 + mm/vmscan.c | 56 +- mm/workingset_report.c | 561 ++++++++++++++++++ mm/workingset_report_aging.c | 127 ++++ tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 3 + tools/testing/selftests/mm/run_vmtests.sh | 5 + .../testing/selftests/mm/workingset_report.c | 306 ++++++++++ .../testing/selftests/mm/workingset_report.h | 39 ++ .../selftests/mm/workingset_report_test.c | 330 +++++++++++ 21 files changed, 1885 insertions(+), 5 deletions(-) create mode 100644 Documentation/admin-guide/mm/workingset_report.rst create mode 100644 include/linux/workingset_report.h create mode 100644 mm/workingset_report.c create mode 100644 mm/workingset_report_aging.c create mode 100644 tools/testing/selftests/mm/workingset_report.c create mode 100644 tools/testing/selftests/mm/workingset_report.h create mode 100644 tools/testing/selftests/mm/workingset_report_test.c