From patchwork Tue Apr 15 02:45:04 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Muchun Song X-Patchwork-Id: 14051353 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E559AC369B8 for ; Tue, 15 Apr 2025 02:46:00 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 93F68280118; Mon, 14 Apr 2025 22:45:59 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8F0C32800C2; Mon, 14 Apr 2025 22:45:59 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7B682280118; Mon, 14 Apr 2025 22:45:59 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 61F582800C2 for ; Mon, 14 Apr 2025 22:45:59 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id E781B1A0B2B for ; Tue, 15 Apr 2025 02:45:59 +0000 (UTC) X-FDA: 83334738438.12.97D30D3 Received: from mail-pg1-f174.google.com (mail-pg1-f174.google.com [209.85.215.174]) by imf09.hostedemail.com (Postfix) with ESMTP id 78003140009 for ; Tue, 15 Apr 2025 02:45:57 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=c6xPQzB1; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf09.hostedemail.com: domain of songmuchun@bytedance.com designates 209.85.215.174 as permitted sender) smtp.mailfrom=songmuchun@bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1744685158; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=Vmef3ZlMZg7TO+ldIj898v58g7NCeQi8xtcn9yDmmsc=; b=X7LBbXMM12g6nhYTJbriX5rSZ+Lu2AxF/AkZ+/6Ec7g8bksSKEfnnwmOzOnCmyqNfnm4qM TYQuhqCXGPaUtUNdwKN0fPD0BhctLeBlHy8202Kjg6Axt3N+m/NHX7rbPMshYnq95UsrQ1 65sZUkpPE6j9aVZdi0e2Q7cSv0flMqI= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=c6xPQzB1; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf09.hostedemail.com: domain of songmuchun@bytedance.com designates 209.85.215.174 as permitted sender) smtp.mailfrom=songmuchun@bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1744685158; a=rsa-sha256; cv=none; b=HgC55FkCmMx2+0FfLOuoK4yKY3NI7eTe2DBT4j1nRmmSvkVJ6OJzL60l/3X4OoNIG+Exsx +Hk3kwa6bkZxyXGnVi/EcJiVq9/6BbdidzWz7TE51HmSTTNg6yNBAYiXCV2QbpDvVsvTsA uq6tGSnQACI0nxfXjTa6G4VhlpLnCPo= Received: by mail-pg1-f174.google.com with SMTP id 41be03b00d2f7-b074d908e56so1461238a12.2 for ; Mon, 14 Apr 2025 19:45:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1744685156; x=1745289956; darn=kvack.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=Vmef3ZlMZg7TO+ldIj898v58g7NCeQi8xtcn9yDmmsc=; b=c6xPQzB1DJskZoXeLIk5XsrDekyYlwLTlSCyEZO9+btxPJmyXmZRDXqW+l0nZ373iy ELyqSr3MzBl9cwD1LNFNqiBucBN3ysgy94CzkkRs23SRvK7iFeUXQ80wI1JshgkbMFP3 WSWBmPO+Dr7kuC7n3oli5KGhKf1IOC1GlUiN3NWz4keiuU93kHfY96zW3h4wediMhQ2C D1wu64YQYmCsUng0dBCVv/h39k+Ie5Bchq4reSBjh52NQ0xYEW5Z3gKyQe3uyUyrQOEO aRzdzFmg71iJxqX8ym6KDeN28T1lFQkQou7qgySMVnyQagXLkK0ZkmwrMVZceABdlwAB OZUA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1744685156; x=1745289956; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=Vmef3ZlMZg7TO+ldIj898v58g7NCeQi8xtcn9yDmmsc=; b=cK6S9Iv1JsfNqP/ggeiLXyVM3vG9rS7TDapl09gZwMGOXD1uvZa0MTLMxbbUvQ1vhq i5FaNN7v/R1wAiHdEHa3HfarS1YXuUAb7S0XFLK3/1gbv2eB+CLb3L62nn6/HkRhO9Ep QnCc0k0P+zO6VTnR4heTIhbxLYUVR3qG9H5ko6PwDx7NHDMz1ksYh6v635X4fFjHw3rV HN/LVIypuj+P5t2KEdbFjLCVvhZtEtG9rSymDUugWFup30nNpwrFNX5j4J5Gu9CEam4I df72fICknvNxlCic6XRvS1bVRg1vlmYQk5Y4HCAHirbnEIWuyfQv+aNos4eFNUEz8Cvy REzA== X-Forwarded-Encrypted: i=1; AJvYcCUD/7XRti/oGAwzDtiQGJreJgl9Op8A+MS7OR76cju8sQ9knlyL79kQ8Aevl0N2JURKpYCrF7OHOA==@kvack.org X-Gm-Message-State: AOJu0Ywykl5GQr4XS98UHO/b7zAZn1XpDdZQkc8+OgxATt3IKAuZDghw hQ9JrXTq1RbT/m4swyUS45TpPmnJc8oigLYekZBUB7SHjQAGGB4AFGwBxjlFsNDI65Wx8VTKCO1 06vPRTQ== X-Gm-Gg: ASbGncucVDVNV86uOV/zT1pAdl2WjOyx/zXTzYTIKPlWMI68u7WG7xW8Zc8+MH2IMN0 +fe/hKAqHzZuMZr81Bz0fjm5wlZZ9WJLeeHmWtGOi8X4veRqP+je0ZKlJbZwUhrDy8X6SO4rgHc bD+55KrQI/UQJfmzx6NmvDAwJTu/5XGrDQyp90Nmnju/+5G/d+HC5++2qO94KvWELGZln9+JfWQ lYSKbM8lI+ojCxz9pK0Velp+UY/eQNDyi99ADD2AFcpgalKGqVny2EnYLkVCYps3+fYX/3PiCUS qVaXZ7ehC7uKsrgE9r9OTjOUtviV8kmaVyNbIcF9C1i+lhUO6fZ5f/DgyUW7HoTm5k5BxSj+ X-Google-Smtp-Source: AGHT+IE0pZ/5yl+DX5hKsSJ8q3otL9DZ3TTCxY1dn2jd+uRX0/xON+LgsEquVuUnMUfbbGTr0hzbkg== X-Received: by 2002:a17:902:d48d:b0:21f:6546:9af0 with SMTP id d9443c01a7336-22bea502685mr170782125ad.44.1744685155939; Mon, 14 Apr 2025 19:45:55 -0700 (PDT) Received: from PXLDJ45XCM.bytedance.net ([61.213.176.5]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-22ac7ccac49sm106681185ad.217.2025.04.14.19.45.49 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Mon, 14 Apr 2025 19:45:53 -0700 (PDT) From: Muchun Song To: hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, muchun.song@linux.dev, akpm@linux-foundation.org, david@fromorbit.com, zhengqi.arch@bytedance.com, yosry.ahmed@linux.dev, nphamcs@gmail.com, chengming.zhou@linux.dev Cc: linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org, hamzamahfooz@linux.microsoft.com, apais@linux.microsoft.com, Muchun Song Subject: [PATCH RFC 00/28] Eliminate Dying Memory Cgroup Date: Tue, 15 Apr 2025 10:45:04 +0800 Message-Id: <20250415024532.26632-1-songmuchun@bytedance.com> X-Mailer: git-send-email 2.39.5 (Apple Git-154) MIME-Version: 1.0 X-Rspamd-Server: rspam01 X-Stat-Signature: 73bdmc7e4g6frtpjp9pwfqm9gpr8dhuw X-Rspam-User: X-Rspamd-Queue-Id: 78003140009 X-HE-Tag: 1744685157-206121 X-HE-Meta: U2FsdGVkX1+qhFLWWJxcvveXX36WfWcfcGIMJ7D9K8aJo6ZyWOEzNiKb0IpuQOc+U4tRbDsWGUyvaPYM09H00vkKO2RVn3a83VuRQrTKKSwV2yvAIkAjn0WH2Ug68DNkHiyEzOhgSrJWD7n97qbWgR0dStmdUHdgFrvKFiZCjVP44Tzn1sX2eSwof1ReVyUZbiPBi84y9AMHOY6qORz+75g2jdSabyimuSZLlGz16JyKJYGAOQ0QkocdTGaWsS69haWTsthtQ+03uqC1ZCsvwXk7sm+C08S93RGjDik0hUjh16HgkDRGZSTrAsIDe5gRULXztINR/0mRbwxi9NadrwtBkZyhy3eWrsi9Gor/hWKxG0TDw+vCgHFhSWTVyUCUJ0vFpL0/cYkxKerdm0CNeSjKmTJpBPLEbZxcSRHlblWA3K9PgBmSJBI5EkMHuJLbf0XH3qvJ+TtIkMVyjHieUYtZr6lf+8pbJD/pS/clv4/Zj7cRKI21XBfWN6Je25/+WUYJVW+YzcPEeW0hGK9AMnbPYUO1rWzwco+dSwhGJZTFs/B+/awpL//oJZKyfb4HYDtA8s4hj92S11vQ3Vjk+xXa00QKSi0+yQ6J9EKw3E70LKT4DrWx2JaJaeXNcwuvtglHVElPoOsIwfcGf38qMI1psOY7k/mNr4fVg07ozXaW3+7WMb9HkcfRrxR1ycYzTO7qT101lEy7/kSoJyIGN1avmU2vlJI/SNHvzw7GxTXIN8ST/02oMrNotLNPMohbANnkyU4n89j2U3L3RAQi9fcg5Rv1zFoYCqk7oqa0SPaHLyGpD4/1/KlfjI6ZV/Wrro0H72mV8iu8cmT8M2lxHADHceuIAX2k69akX11EScrcYNcfYWQPcVh1Hg8sYphY/0OP02PNEDwfT0p2pev88/ou0ixM1O2PeqmsiC1ZQlHomNRb1KCNaYkRbxWQHAtv0RL4nRXjCJwSlYRxrYi 5Aja9gS4 oOFbjyfmJB5+2NywXL9ysjzwaFY7nQTRqZsY7PqhJ/w9+dkXHKGI3XIMqlkz+kD0mbNZXpwM56PP6/55Rro99K2qnbqfsChkA0qsk+HHW8tiAHgjwFhhD18zRKxaV/kQRJkFmBwyHzOr1jtEMIQSfXPmfJbzjE48H8puO3yfwcpvuqxdCwmJ22sE+oHuq8jQYVPC8NGXkB+XhpS/BeppHnAXI1U1gnPYevhh/8S4tmMBVLwac5LEebXTlurkXWuj5JA8SiPDyikZMpeMnJcsdSfqGOdzFKHLzS+78Y8cX+RnuyB2UaAl/VQksHxp7SmrjUm8mwAKT/0lSMloVqbRXJ1bqE9bxiFeTcp1IiX9r3az8Wh6N94ZHB52vqudkJQJWODInkKzNyFRKfJA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: This patchset is based on v6.15-rc2. It functions correctly only when CONFIG_LRU_GEN (Multi-Gen LRU) is disabled. Several issues were encountered during rebasing onto the latest code. For more details and assistance, refer to the "Challenges" section. This is the reason for adding the RFC tag. ## Introduction This patchset is intended to transfer the LRU pages to the object cgroup without holding a reference to the original memory cgroup in order to address the issue of the dying memory cgroup. A consensus has already been reached regarding this approach recently [1]. ## Background The issue of a dying memory cgroup refers to a situation where a memory cgroup is no longer being used by users, but memory (the metadata associated with memory cgroups) remains allocated to it. This situation may potentially result in memory leaks or inefficiencies in memory reclamation and has persisted as an issue for several years. Any memory allocation that endures longer than the lifespan (from the users' perspective) of a memory cgroup can lead to the issue of dying memory cgroup. We have exerted greater efforts to tackle this problem by introducing the infrastructure of object cgroup [2]. Presently, numerous types of objects (slab objects, non-slab kernel allocations, per-CPU objects) are charged to the object cgroup without holding a reference to the original memory cgroup. The final allocations for LRU pages (anonymous pages and file pages) are charged at allocation time and continues to hold a reference to the original memory cgroup until reclaimed. File pages are more complex than anonymous pages as they can be shared among different memory cgroups and may persist beyond the lifespan of the memory cgroup. The long-term pinning of file pages to memory cgroups is a widespread issue that causes recurring problems in practical scenarios [3]. File pages remain unreclaimed for extended periods. Additionally, they are accessed by successive instances (second, third, fourth, etc.) of the same job, which is restarted into a new cgroup each time. As a result, unreclaimable dying memory cgroups accumulate, leading to memory wastage and significantly reducing the efficiency of page reclamation. ## Fundamentals A folio will no longer pin its corresponding memory cgroup. It is necessary to ensure that the memory cgroup or the lruvec associated with the memory cgroup is not released when a user obtains a pointer to the memory cgroup or lruvec returned by folio_memcg() or folio_lruvec(). Users are required to hold the RCU read lock or acquire a reference to the memory cgroup associated with the folio to prevent its release if they are not concerned about the binding stability between the folio and its corresponding memory cgroup. However, some users of folio_lruvec() (i.e., the lruvec lock) desire a stable binding between the folio and its corresponding memory cgroup. An approach is needed to ensure the stability of the binding while the lruvec lock is held, and to detect the situation of holding the incorrect lruvec lock when there is a race condition during memory cgroup reparenting. The following four steps are taken to achieve these goals. 1. The first step to be taken is to identify all users of both functions (folio_memcg() and folio_lruvec()) who are not concerned about binding stability and implement appropriate measures (such as holding a RCU read lock or temporarily obtaining a reference to the memory cgroup for a brief period) to prevent the release of the memory cgroup. 2. Secondly, the following refactoring of folio_lruvec_lock() demonstrates how to ensure the binding stability from the user's perspective of folio_lruvec(). struct lruvec *folio_lruvec_lock(struct folio *folio) { struct lruvec *lruvec; rcu_read_lock(); retry: lruvec = folio_lruvec(folio); spin_lock(&lruvec->lru_lock); if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) { spin_unlock(&lruvec->lru_lock); goto retry; } return lruvec; } From the perspective of memory cgroup removal, the entire reparenting process (altering the binding relationship between folio and its memory cgroup and moving the LRU lists to its parental memory cgroup) should be carried out under both the lruvec lock of the memory cgroup being removed and the lruvec lock of its parent. 3. Thirdly, another lock that requires the same approach is the split-queue lock of THP. 4. Finally, transfer the LRU pages to the object cgroup without holding a reference to the original memory cgroup. ## Challenges In a non-MGLRU scenario, each lruvec of every memory cgroup comprises four LRU lists (i.e., two active lists for anonymous and file folios, and two inactive lists for anonymous and file folios). Due to the symmetry of the LRU lists, it is feasible to transfer the LRU lists from a memory cgroup to its parent memory cgroup during the reparenting process. In a MGLRU scenario, each lruvec of every memory cgroup comprises at least 2 (MIN_NR_GENS) generations and at most 4 (MAX_NR_GENS) generations. 1. The first question is how to move the LRU lists from a memory cgroup to its parent memory cgroup during the reparenting process. This is due to the fact that the quantity of LRU lists (aka generations) may differ between a child memory cgroup and its parent memory cgroup. 2. The second question is how to make the process of reparenting more efficient, since each folio charged to a memory cgroup stores its generation counter into its ->flags. And the generation counter may differ between a child memory cgroup and its parent memory cgroup because the values of ->min_seq and ->max_seq are not identical. Should those generation counters be updated correspondingly? I am uncertain about how to handle them appropriately as I am not an expert at MGLRU. I would appreciate it if you could offer some suggestions. Moreover, if you are willing to directly provide your patches, I would be glad to incorporate them into this patchset. ## Compositions Patches 1-8 involve code refactoring and cleanup with the aim of facilitating the transfer LRU folios to object cgroup infrastructures. Patches 9-10 aim to allocate the object cgroup for non-kmem scenarios, enabling the ability that LRU folios could be charged to it and aligning the behavior of object-cgroup-related APIs with that of the memory cgroup. Patches 11-19 aim to prevent memory cgroup returned by folio_memcg() from being released. Patches 20-23 aim to prevent lruvec returned by folio_lruvec() from being released. Patches 24-25 implement the core mechanism to guarantee binding stability between the folio and its corresponding memory cgroup while holding lruvec lock or split-queue lock of THP. Patches 26-27 are intended to transfer the LRU pages to the object cgroup without holding a reference to the original memory cgroup in order to address the issue of the dying memory cgroup. Patch 28 aims to add VM_WARN_ON_ONCE_FOLIO to LRU maintenance helpers to ensure correct folio operations in the future. ## Effect Finally, it can be observed that the quantity of dying memory cgroups will not experience a significant increase if the following test script is executed to reproduce the issue. ```bash #!/bin/bash # Create a temporary file 'temp' filled with zero bytes dd if=/dev/zero of=temp bs=4096 count=1 # Display memory-cgroup info from /proc/cgroups cat /proc/cgroups | grep memory for i in {0..2000} do mkdir /sys/fs/cgroup/memory/test$i echo $$ > /sys/fs/cgroup/memory/test$i/cgroup.procs # Append 'temp' file content to 'log' cat temp >> log echo $$ > /sys/fs/cgroup/memory/cgroup.procs # Potentially create a dying memory cgroup rmdir /sys/fs/cgroup/memory/test$i done # Display memory-cgroup info after test cat /proc/cgroups | grep memory rm -f temp log ``` ## References [1] https://lore.kernel.org/linux-mm/Z6OkXXYDorPrBvEQ@hm-sls2/ [2] https://lwn.net/Articles/895431/ [3] https://github.com/systemd/systemd/pull/36827 Muchun Song (28): mm: memcontrol: remove dead code of checking parent memory cgroup mm: memcontrol: use folio_memcg_charged() to avoid potential rcu lock holding mm: workingset: use folio_lruvec() in workingset_refault() mm: rename unlock_page_lruvec_irq and its variants mm: thp: replace folio_memcg() with folio_memcg_charged() mm: thp: introduce folio_split_queue_lock and its variants mm: thp: use folio_batch to handle THP splitting in deferred_split_scan() mm: vmscan: refactor move_folios_to_lru() mm: memcontrol: allocate object cgroup for non-kmem case mm: memcontrol: return root object cgroup for root memory cgroup mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio() buffer: prevent memory cgroup release in folio_alloc_buffers() writeback: prevent memory cgroup release in writeback module mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events() mm: page_io: prevent memory cgroup release in page_io module mm: migrate: prevent memory cgroup release in folio_migrate_mapping() mm: mglru: prevent memory cgroup release in mglru mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full() mm: workingset: prevent memory cgroup release in lru_gen_eviction() mm: workingset: prevent lruvec release in workingset_refault() mm: zswap: prevent lruvec release in zswap_folio_swapin() mm: swap: prevent lruvec release in swap module mm: workingset: prevent lruvec release in workingset_activation() mm: memcontrol: prepare for reparenting LRU pages for lruvec lock mm: thp: prepare for reparenting LRU pages for split queue lock mm: memcontrol: introduce memcg_reparent_ops mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers fs/buffer.c | 4 +- fs/fs-writeback.c | 22 +- include/linux/memcontrol.h | 190 ++++++------ include/linux/mm_inline.h | 6 + include/trace/events/writeback.h | 3 + mm/compaction.c | 43 ++- mm/huge_memory.c | 218 +++++++++----- mm/memcontrol-v1.c | 15 +- mm/memcontrol.c | 476 +++++++++++++++++++------------ mm/migrate.c | 2 + mm/mlock.c | 2 +- mm/page_io.c | 8 +- mm/percpu.c | 2 +- mm/shrinker.c | 6 +- mm/swap.c | 22 +- mm/vmscan.c | 73 ++--- mm/workingset.c | 26 +- mm/zswap.c | 2 + 18 files changed, 696 insertions(+), 424 deletions(-)