From patchwork Tue May 9 18:54:17 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yuanchu Xie X-Patchwork-Id: 13236087 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E8B90C77B75 for ; Tue, 9 May 2023 18:54:31 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7A7A96B0072; Tue, 9 May 2023 14:54:31 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7577D6B0074; Tue, 9 May 2023 14:54:31 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5F97C6B0075; Tue, 9 May 2023 14:54:31 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 50C2E6B0072 for ; Tue, 9 May 2023 14:54:31 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 14BAC120940 for ; Tue, 9 May 2023 18:54:31 +0000 (UTC) X-FDA: 80771617542.08.86A6533 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) by imf12.hostedemail.com (Postfix) with ESMTP id 2B0DC4000D for ; Tue, 9 May 2023 18:54:28 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=TTnbX0Gb; spf=pass (imf12.hostedemail.com: domain of 345ZaZAcKCKQcYERGLYKSSKPI.GSQPMRYb-QQOZEGO.SVK@flex--yuanchu.bounces.google.com designates 209.85.128.201 as permitted sender) smtp.mailfrom=345ZaZAcKCKQcYERGLYKSSKPI.GSQPMRYb-QQOZEGO.SVK@flex--yuanchu.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1683658469; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=LC0FZWUPJYiXz+9KGOyu/I1FWoZ4PhU7nKNGRM1Iaac=; b=QlyDqsHtkOSg5oztriqD0Y2v3SoO226VoOH/xcJ1oGyvfF9C2OAFLiFTSL61esRdPF4+kq 5KhMrpA9HTjWGB7uBOA/TdM5XLo+rh9PHubY9p46zq/8TinaqgRySjsvBxfwYSvMJ/p588 RSHf95jpaZjBGJFBjOAMLMy41qeENQQ= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1683658469; a=rsa-sha256; cv=none; b=vUnHDZ+01fY6v2qIPnFewnY3p2fILOTFDfhK5gMayHuPv2vx99qXB9C967ClIdpaaVhygj MlR/h6SwhmA5v7VMJh+CQHim2gDEVpQTj9rYsnUUHMVRYbydNvJbkf3AFcyj84y3bbFF5z c6++RcYR/22cgFTcsyE+EXUzfsn2X2U= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=TTnbX0Gb; spf=pass (imf12.hostedemail.com: domain of 345ZaZAcKCKQcYERGLYKSSKPI.GSQPMRYb-QQOZEGO.SVK@flex--yuanchu.bounces.google.com designates 209.85.128.201 as permitted sender) smtp.mailfrom=345ZaZAcKCKQcYERGLYKSSKPI.GSQPMRYb-QQOZEGO.SVK@flex--yuanchu.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-55a7d1f6914so119051197b3.1 for ; Tue, 09 May 2023 11:54:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1683658468; x=1686250468; h=content-transfer-encoding:cc:to:from:subject:message-id :mime-version:date:from:to:cc:subject:date:message-id:reply-to; bh=LC0FZWUPJYiXz+9KGOyu/I1FWoZ4PhU7nKNGRM1Iaac=; b=TTnbX0GbPjgGnUxeFZbJhVt40KHXVqzLdKBcEG1Sj3G8/Z9B8pPzcMJ7p+ChJplsOd XLyQzACiqEkaJ235bBMCEJDVAcDx+Y/YoH5WLaCYvGtxsacpUmgLjfK6JcHsyS7FMYjX OfYwk92ad2NbG+fLNfYEQF/SZ08s5hVRZPBsE2m1PKigbGJnoJoTyAXGhje2YjU2+uZr lOooSO6NIaLUIuZ646NpI1tECL47TPUkH/YLMYIdSqEt1fgbm2a4lNULND6Lvp6rG3Kp nx+JcQV4ToY4ye0ao+Mt3jPk3WQMYDYQLkqeKXee/tb5baMwrd9ITix6hNaPZ0kbiI4p 5+Cg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683658468; x=1686250468; h=content-transfer-encoding:cc:to:from:subject:message-id :mime-version:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=LC0FZWUPJYiXz+9KGOyu/I1FWoZ4PhU7nKNGRM1Iaac=; b=lIej0iiuK0buRtxaF/Zev7PMUpT+oqFYILYnykmTNvuc5ma/sSwgWZPxvawpngJqyR w5uHuW/Ob9hYqEBnxCC2z8k8+bnwI9Q3EMF9vQF9vAxx3BNrzq6OYaKF4o+MOaI0OTEf 9LgN3Q2wQCFb68OLDH6U2JAuNCmkZo3EZl10L0KjTuz6Q3lTFLEf4QkCQmJK4cjjVdWS EYI7VMK5kH61stbcPiIhOgra79PjtOq4XHSI05QVO8HT/sZZ5mT01mLj8xbi7bbFw6gG xPTdMl0xaoC468NlYkkh5LRAu8ZKKZsn8F3VwuL8cqCf9A1zVXyIk1ZZFk2dPNwlkRAE J31w== X-Gm-Message-State: AC+VfDwfTmLFDxZD9Br+qZJ8Z0Co3YWlIe6RmcSFu8QtvyzUmYKsCpvc 8/wJ5sfOgoOpbwAJwU72BDg/SJ1MuQC9 X-Google-Smtp-Source: ACHHUZ7HL9FK+Qjg2anHy3PObI3bDtjOHfngX3MgxJ9qDF0dTKE9ggs1GdxqFoUQV6txjLeZ6WrZy08UqmAp X-Received: from yuanchu.bej.corp.google.com ([2401:fa00:44:10:d495:1070:e926:f84a]) (user=yuanchu job=sendgmr) by 2002:a81:bc09:0:b0:55d:95b7:39d8 with SMTP id a9-20020a81bc09000000b0055d95b739d8mr9071164ywi.7.1683658467995; Tue, 09 May 2023 11:54:27 -0700 (PDT) Date: Wed, 10 May 2023 02:54:17 +0800 Mime-Version: 1.0 X-Mailer: git-send-email 2.40.1.521.gf1e218fcd8-goog Message-ID: <20230509185419.1088297-1-yuanchu@google.com> Subject: [RFC PATCH 0/2] mm: Working Set Reporting From: Yuanchu Xie To: David Hildenbrand , "Sudarshan Rajagopalan (QUIC)" , kai.huang@intel.com, hch@lst.de, jon@nutanix.com Cc: SeongJae Park , Shakeel Butt , Aneesh Kumar K V , Greg Kroah-Hartman , "Rafael J. Wysocki" , "Michael S. Tsirkin" , Jason Wang , Andrew Morton , Johannes Weiner , Michal Hocko , Roman Gushchin , Muchun Song , Yu Zhao , "Matthew Wilcox (Oracle)" , Yosry Ahmed , Vasily Averin , talumbau , Yuanchu Xie , linux-kernel@vger.kernel.org, virtualization@lists.linux-foundation.org, linux-mm@kvack.org, cgroups@vger.kernel.org X-Rspamd-Queue-Id: 2B0DC4000D X-Stat-Signature: r6n8t7p5i1t8zt4wc9xfpohfbrjparzh X-Rspam-User: X-Rspamd-Server: rspam09 X-HE-Tag: 1683658468-903799 X-HE-Meta: U2FsdGVkX1+5DBP3XW8LEftohSDTdDMQ+21Cu5xso8ngSFxKnOWeVefN6BKlLOn2xNr8zQ2NUfUM+WU7xC9ZUV0gwBW8ZIj3Thws50B92iGkt4zdg5bDEQNn7MUU7RsSy3zeFkNUCxrxC2AfxTTcMsXBAmEZERFbMrqUapJ6gGJnFK5I4ZbC7h7NSMBR0hLQGQ1e+ga/j+I/bGpowJR+uA051WhMVPpPW7u870i/inxmLt7REYMY/4ZPP7xiNJNDvTNbodUGSJzknp4si/OW/6gAmBbCp4zcbW5hDZ9lTSjgmUfCICK1Ped0AYoo+OKsxPY+imI2Z96VfpBXNHGz6XtXtmveLBbtDB1O5Cdn/UzJW/PQrky5C6sNpL4lBuiWo0Z5TxpBqYstS4jYolI7yGb6DPXoFPIOCm/rg24AqEKGbeMGN8U11I6amPQ7YUmvpqVZ+R+Eg4N1sp0JO1rBjhDfCF8HMeMav1X3lAN47hesEAH+R19hX8D/mD6f0gt9gR0e6Xn0aJPo8g3pI9nac+vAJRzvJxfDZgeDymvX5iQW+q2Ow1CHfU+ea6Da0Lc0zJeDf9SAqhcvcVV7cP/UygWVz5STQWwulFDi7aaZPcZ2BspY4vNxbhhdpCJYIWb0Amu9EcmW1N0W12uv8iUgLn9y4Lsh+ZKU3OJFr99T9XMo/4kMhofJgE/NCfw5wxfCYmnjJeQekItbr2BPXCWoh9NqvbpyoQQbWjcR4ahT2Ixb9xqZFLRVw6+/qNadZ7QyKIeqJVuQNPYPNzKH5D7al1nzFg7xIg/8MS6J/vjlFWkLfdAQXLhSyA3vgtAaH9UJcmLbTb/mKeORpiAjgvSkAZ4CSdXeeWD7TavIzFFLBg7ex2pvEH6ID45ZPqwDuz9rTpUqvsHr/APLuEP2WU9O8Df3D15ssBjl1QGjwg2eq1crIo/6B2YmpaMTZaZT610R4Qn2GMxdWHamD2gs0+Q H48B7DZF 7LC4rMdTWcKmwS+zA583c2/8sqSBJs6dJN/eaxYzEkFpN6nkgyXuKrhjTmlLd6y5tRQgBgjhuyU8dNTYmypJUkakeRNvcWKsl9FWl3m/RCEcD6E3erRwtiSc2pgqJHDc3ybXacgg9B2GVLXzS8yKJD0Ik/zhK8yzy7MtgmFkF7hiTnjAzsmwvHr09RU+9rIyL75VwLuRMh/5s8+6TaO9lSwxN4wVTdQy19HihaXuZHhoC2KV0dEWyC6SgjNtF5JfwvYFdaVUoy8RDUqtJF5J5O5e6+8bEh9JLrpsqRIBKAdxBL0BD5N7V6DN4IVEd9HCef2wh8dtl7hNesmpYG0XU2l7aCiV67uG8x/sSZPJbAKRobrb4hqSSjkngBhpPNM7fhN2L17QN1bqffnsrglH9TGNbppdh8N5s0D642ytlgmXhL2WoMQuuj9B2vO1510fak+SaNY3M2YWC68p8SpJAcJvheVupSjuIUFxDGw7od4SwA5u6yKhZOGDeLDrMnNm0ZPcSpM+aubQreNuoBJSQ25OkxccOTFin6eFe0OP7TQ1ugrkBBzpH1Xvwz/fuzlEDUGpseTUYFPZHpkR43r2hxJk/ZUpEUeIoH6CIIl06YqpRWMEYRi4cgMeaO0BJF1lp4n9dCzIy5ANQ4CvqLb6yyR/fSPuq1d1LdXbGLkSaDzPq/8a+qN9+iCyEkWZtWXyR/MpvwdTMrSO++A9e2bpybAX/G1bayB++sT9T5HPtfosLE+J9LZORhq1Q6SPVTteSJm74qk8jE6DlNsb6Bd3TMZ16zcrfLUXuMNxp X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Background ========== For both clients and servers, workloads can be containerized with virtual machines, kubernetes containers, or memcgs. The workloads differ between servers and clients. Server jobs have more predictable memory footprints, and are concerned about stability and performance. One technique is proactive reclaim, which reclaims memory ahead of memory pressure, and makes apparent the amount of actually free memory on a machine. Client applications are more bursty and unpredictable since they react to user interactions. The system needs to respond quickly to interesting events, and be aware of energy usage. An overcommitted machine can scale the containers' footprint through memory.max/high, virtio-balloon, etc. The balloon device is a typical mechanism for sharing memory between a guest VM and host. It is particularly useful in multi-VM scenarios where memory is overcommitted and dynamic changes to VM memory size are required as workloads change on the system. The balloon device now has a number of features to assist in judiciously sharing memory resources amongst the guests and host (e.g free page hinting, stats, free page reporting). For a host controller program tasked with optimizing memory resources in a multi-VM environment, it must use these tools to answer two concrete questions: 1. When is the right time to modify the balloon? 2. How much should the balloon be changed by? An early project to develop such an "auto-balloon" capability was done in 2013 [1]. More recently, additional VIRTIO devices have been created (virtio-mem, virtio-pmem) that offer more tools for a number of use cases, each with advantages and disadvantages (see [2] for a recent overview by RedHat of this space). A previous proposal to extend MGLRU with working set interfaces [3] focuses on the server use cases but does not work for clients. Proposal ========== A unified Working Set reporting structure that works for both servers and clients. It involves per-node histograms on the host, per-memcg histograms, and a virtio-balloon driver extension. There are two ways of working with Working Set reporting: event-driven and querying. The host controller can receive notifications from reclaim, which produces a report, or the controller can query for the histogram directly. Patch 1 introduces the Working Set reporting mechanism and the host interfaces. See the Details section for Patch 2 extends the virtio-balloon driver with Working Set reporting. The initial RFC builds on MGLRU and is intended to be a Proof of Concept for discussion and refinements. T.J. and I aim to support the active/inactive LRU and working set estimation from the userspace. We are working on demo scripts and getting some numbers as well. The RFC is a bit hacky and should be built with the these configs: CONFIG_LRU_GEN=y CONFIG_LRU_GEN_ENABLED=y CONFIG_VIRTIO_BALLOON=y CONFIG_WSS=y Host ========== On the host side, a few sysfs files are added to monitor the working set of the host. On a CONFIG_NUMA system, they live under "/sys/devices/system/node/nodeX/wss/", otherwise they are under "/sys/kernel/mm/wss/". They are mostly read/write tuneables except for the histogram. The files work as follows: report_ms: Read-write, specifies report threshold in milliseconds, min value 0 max value LONG_MAX. 0 disables working set reporting A rate-limiting factor that prevents frequent aging from generating reports too fast. For example, with a report threshold of 500ms, suppose aging happens 3 times within 500ms, the first one generates a wss report, and the rest are ignored. Example: $ echo 1000 > report_ms refresh_ms: Read-write, specifies refresh threshold in milliseconds, min value 0 max value LONG_MAX. 0 ensures that every histogram read produces a new report. A rate-limiting factor that prevents working set histogram reads from triggering aging too frequently. For example, with a refresh threshold of 10,000ms, if a WSS report is generated within the past 10,000ms, reading the wss/histogram does not perform aging, otherwise, aging occurs, a new wss report is generated and read. Generating a report can block for the period of time that it takes to complete aging. Example: $ echo 10000 > refresh_ms intervals_ms: Read-write, specifies bin intervals in milliseconds, min value 1, max value LONG_MAX. Example: $ echo 1000,2000,3000,4000 > intervals_ms histogram: Read-only, prints wss report for this node in the format of: anon= file= <...> Reading it may trigger aging if the refresh threshold has passed. On poll, it waits until kswapd performs aging on this node, and notifies subject to the rate limiting threshold set by report_ms A per-node histogram that captures the number of bytes of user memory in each working set bin. It reports the anon and file pages separately for each bin. It does not track other types of memory, e.g. hugetlb or kernel memory. Example, note that the last bin is a catch-all bin that comes after all the intervals_ms bins: $ cat histogram 1000 anon=618 file=10 2000 anon=0 file=0 3000 anon=72 file=0 4000 anon=83 file=0 9223372036854775807 anon=1004 file=182 A per-memcg interface is also included, to enable the use cases where one may use memcgs to manage applications on the host, along with VMs. The files are: memory.wss.report_ms memory.wss.refresh_ms memory.wss.intervals_ms memory.wss.histogram They support per-node configurations by requiring the node to be specified (one node at a time), e.g. $ echo N0=1000 > memory.wss.report_ms $ echo N1=3000 > memory.wss.report_ms $ echo N0=1000,2000,3000,4000 > memory.wss.intervals_ms $ cat memory.wss.intervals_ms N0=1000,2000,4000,9223372036854775807 N1=9223372036854775807 $ cat memory.wss.histogram N0 1000 anon=6330 file=0 2000 anon=72 file=0 4000 anon=0 file=0 9223372036854775807 anon=0 file=0 N1 9223372036854775807 anon=0 file=0 A reaccess histogram is also implemented for memcgs. The files are: memory.reaccess.intervals_ms memory.reaccess.histogram The interface formats are identical to the memory.wss.*. Writing to memory.reaccess.intervals_ms clears the histogram for the corresponding node. The reaccess histogram is a per-node histogram of page counters. When a page is discovered to be reaccessed during scanning, the counter for the bin the page is previously in is incremented. For server use cases, the workload memory access pattern is fairly predictable. A proactive reclaimer can use the reaccess information to determine the right bin to reclaim. Example, where 72 instances of reaccess were discovered where for pages idle for 1000ms-2000ms during scanning: $ cat memory.reaccess.histogram N0 1000 anon=6330 file=0 2000 anon=72 file=0 4000 anon=0 file=0 9223372036854775807 anon=0 file=0 N1 9223372036854775807 anon=0 file=0 virtio-balloon ========== The Working Set reporting mechanism presented in the first patch in this series provides a mechanism to assist a controller in making such balloon adjustments. There are two components in this patch: - The virtio-balloon driver has a new feature (VIRTIO_F_WS_REPORTING) to standardize the configuration and communication of Working Set reports to the device. - A stand-in interface for connecting MM activities (here, only background reclaim) to a client (here, just the balloon driver) so that the driver can be notified at appropriate times when a new Working Set report is available (and would be useful to share). By providing a "hook" into reclaim activities, we can provide a mechanism for timely updates (i.e. when the guest is under memory pressure). By providing a uniform reporting structure in both the host and all guests, a global picture of memory utilization can be reconstructed in the controller, thus helping to answer the question of how much to adjust the balloon. The reporting mechanism can be combined with a domain-specific balloon policy in an overcommitted multi-vm scenario, providing balloon adjustments to drive the separate reclaim activities in a coordinated fashion. TODO: - Specify a proper interface for clients to register for Working Set reports, using the shrinker interface as a guide. References: [1] https://www.linux-kvm.org/page/Projects/auto-ballooning [2] https://kvmforum2020.sched.com/event/eE4U/virtio-balloonpmemmem-managing-guest-memory-david-hildenbrand-michael-s-tsirkin-red-hat [3] https://lore.kernel.org/linux-mm/20221214225123.2770216-1-yuanchu@google.com/ talumbau (2): mm: multigen-LRU: working set reporting virtio-balloon: Add Working Set reporting drivers/base/node.c | 2 + drivers/virtio/virtio_balloon.c | 243 +++++++++++- include/linux/balloon_compaction.h | 6 + include/linux/memcontrol.h | 6 + include/linux/mmzone.h | 14 +- include/linux/wss.h | 57 +++ include/uapi/linux/virtio_balloon.h | 21 + mm/Kconfig | 7 + mm/Makefile | 1 + mm/memcontrol.c | 349 ++++++++++++++++- mm/mmzone.c | 2 + mm/vmscan.c | 581 +++++++++++++++++++++++++++- mm/wss.c | 56 +++ 13 files changed, 1341 insertions(+), 4 deletions(-) create mode 100644 include/linux/wss.h create mode 100644 mm/wss.c