From patchwork Fri Jan 3 01:50:11 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: JP Kobryn X-Patchwork-Id: 13925082 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D8D87E77188 for ; Fri, 3 Jan 2025 01:50:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DBA6C6B007B; Thu, 2 Jan 2025 20:50:34 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D6AE96B0082; Thu, 2 Jan 2025 20:50:34 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C32C26B0083; Thu, 2 Jan 2025 20:50:34 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id A58126B007B for ; Thu, 2 Jan 2025 20:50:34 -0500 (EST) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 10DC71C69AC for ; Fri, 3 Jan 2025 01:50:34 +0000 (UTC) X-FDA: 82964459676.24.B1CA8DA Received: from mail-pl1-f175.google.com (mail-pl1-f175.google.com [209.85.214.175]) by imf30.hostedemail.com (Postfix) with ESMTP id 4FFFB80014 for ; Fri, 3 Jan 2025 01:48:54 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=UpMRzgFw; spf=pass (imf30.hostedemail.com: domain of inwardvessel@gmail.com designates 209.85.214.175 as permitted sender) smtp.mailfrom=inwardvessel@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1735868998; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=uMJ2bk+hGWZvNadkXxWhE1x2YHiVwI+Nwe/vTX7uNLI=; b=WVMkfGH3fx/wTyUMTkpolCp4Fch7N8jOdiK6/4xCoEiqgtQGyxKCbqvi2l/oW88jv+9Lc1 1p+Cjfp5K9mtXvPSd5F2zBh67hfS9Ki8XAQB2Wh9ThvLvf3WxNRjMJ/C0XVemesSqFWdom ykj5LRIHOwKMLDi2c0/5O8WqnaaiJ/I= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1735868998; a=rsa-sha256; cv=none; b=j+tBCsXyUE6aZ2Aq/JV2me15P8L6QnfzDVMcjcfbZJNXuzXubVnkQB8zBvUP9ahJs6C0WJ RiNtHnWaZNN81TsZ/3qwkQK6CzjuAjTELIs7oPlZriUQg85cm0LcBkdPCbB4Oh0+76o7Je WCCxnIl49viCwXUZHvNsBg0/qI9Y+W4= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=UpMRzgFw; spf=pass (imf30.hostedemail.com: domain of inwardvessel@gmail.com designates 209.85.214.175 as permitted sender) smtp.mailfrom=inwardvessel@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-pl1-f175.google.com with SMTP id d9443c01a7336-21644aca3a0so83266855ad.3 for ; Thu, 02 Jan 2025 17:50:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1735869031; x=1736473831; darn=kvack.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=uMJ2bk+hGWZvNadkXxWhE1x2YHiVwI+Nwe/vTX7uNLI=; b=UpMRzgFwvue2G24i2l2wt+6r73sDTNricdia+oxnEzkyRecxS25hsCaohGn6S9mGWK hY73EWfUvu3SHf9RXaKgG8eXEDOs11ugR+See/Ruy3KdeYhl1/ZVNrpHcWsAmNkckXR+ umxrGdsG+hX0j2y8RwpCaHi22emljwDy9bfBApT2ebXHQpVtBu8uzKpp42KvHzgPumvh NWOkc3h1fr+vGeq1Aa+9DUYNOCbyLfk4RBNNC7d+XO+KnPyOnNuXaP7Tz8uAvMJtzsC/ dI4lG0AABZVKn485qscKIXNLkdElIxp880BiozwvULYheLHvdfek8LCVtNpyNeilZruq KsDA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1735869031; x=1736473831; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=uMJ2bk+hGWZvNadkXxWhE1x2YHiVwI+Nwe/vTX7uNLI=; b=TtgNuZyMAEqLRbEMdtcVLmWTlHLQNzn0wWDJ0rbRqanknnVjghBX4kK9Nzx5+Aevpr jm8W7X6OX4979RELcj+iuxVu8BalXixNpf++XbcsBhy8uaFwQ/FTiAnLHq2//exNGnrr KGw3+IgS1RCcZOBW7yX9oMyMz385It1mAU7sy2NxX+Aoi+lij8jaVw8sV50qq4JfEWz9 HXuLVJhirKTKNV/kNz11K2CJMbacyLCl9+Nq8PjZShTJMn0CWP6fFWfXPUrSqeGHZ3Vv PyP84Pm5uUl+iOv+osa4u9s1pK5BW4zSQdep8buFxusNnhLO2IGBQYmTfXmXMhWQsmX1 yktQ== X-Gm-Message-State: AOJu0YxkMYC6N2Au0H7WeAfA5/5tpbFqUoe9dbA3s+e2GLKmg9XfPo5J i405mgGL0BNrZ3TsMAJlbrdGp4fjlEGq19481rFtW7hJMv/sDIba X-Gm-Gg: ASbGncun/JA5e7so4p0ecvfydI7cTOfhTHxvjoMVo6j9JQqXRHUUsJi4oSEy6uB8hle myWPV5DOWyDWK/fcSXdZPdXaVBrP/P6QGHjB7d83aNpR/c0GbmL/nYCdrDLqa+hdZN2SlXy6Xi2 VBa1WqSccWKhYnEQF8xOMLKBcKcDPyQkXhNvj5S5Fcgae4DKBoTMZGe7Tn12SNAsNo1npcRVLFS 0pXn/1fnqkk50tki/rbZL3AkitJ507kv+VN3lELFcnnhTh/0t4Gnj9bJMFF9qoQelIDJb8AW2JN mbZVMxcXVWfxz8STYw== X-Google-Smtp-Source: AGHT+IG6fpUSL0tZv+jydrobqKI1mBJ8YAEJbqhK4E9lRyHA5kN6wKF1I83wPcwL84o/V523VNNzcw== X-Received: by 2002:a17:903:234d:b0:216:42fd:79d2 with SMTP id d9443c01a7336-219e6f2669dmr724433925ad.49.1735869030820; Thu, 02 Jan 2025 17:50:30 -0800 (PST) Received: from saturn.. (c-67-188-127-15.hsd1.ca.comcast.net. [67.188.127.15]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-219dca04ce7sm228851505ad.283.2025.01.02.17.50.29 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 02 Jan 2025 17:50:30 -0800 (PST) From: JP Kobryn To: shakeel.butt@linux.dev, tj@kernel.org, mhocko@kernel.org, hannes@cmpxchg.org, yosryahmed@google.com, akpm@linux-foundation.org Cc: linux-mm@kvack.org, cgroups@vger.kernel.org Subject: [RFC PATCH 0/9 v2] cgroup: separate per-subsystem rstat trees Date: Thu, 2 Jan 2025 17:50:11 -0800 Message-ID: <20250103015020.78547-1-inwardvessel@gmail.com> X-Mailer: git-send-email 2.47.1 MIME-Version: 1.0 X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 4FFFB80014 X-Stat-Signature: pge6mkadscqsqfd1mfn9eti7m8s5nzsw X-Rspam-User: X-HE-Tag: 1735868934-769349 X-HE-Meta: U2FsdGVkX1//gzYFSl6X7lfA6lnN7xisLeX6GwpHnlh6uLyJdbSnLAO3ZSu4FNMjBiTOpYeKi0TSguPtBfUEMJS0QI3Q9dLnp4BXGKPRKiOmJ9cowFwFfh0GUMO5KHas11IOP/izXoqhYvZTRZ1Uw4QMRxUby8PYTS4yCFayqVjh36xVUw+nGl0zYhcLyHxRUon/M1TWtKtttY6MmLB91w4pJT2+BVlNufOxtpSIyliVhtkpFyLZTZ972SsNvLhbnbtK5EJmnwD/9gjCjJFVgr1UjE0bEvClY12gvHPxHl02f0hPa5J7AOaLiJ13Mc0GlANoQ9AWNyqMJQDeewjKWKQ1verC5bscHGhl7ue8DueBJd9dGQhHZYzCCjsAU6C+TnAYadmyIlgpZnMIODIKSDTbfYJP5O4RmfikWqiK9UjiUdJNJ6Y29piIRyYj/Hq6Q6wDAQAp5Gzd62BvQNlftRwIwSxJz8DiK33JODq8pIZ6T6ehouE46hqgPIj64PRZN0H/UnoK+YzyQ/DwKR1PTpSDEyJG+wAvXuBaXM8rJi67uYWOf0YAfQKJlqznqpKkhytWqqSKYiCbnkFekBOGOqTp0q3kFW0lK5kyVHbp7bOktN1E3xionSJuB2eXTPDV9dtH7gEP2zdWpEJQggUv/OMQfmjxwOPc3OOs2lZ9MkrZnx0rPZtN8mrUsztvH/sVJNOPpSeL9WTcSq/xV9vJVTnLOpisYn2c6Wb5XVFgqiJxLiefV0vMnmBOieRtQwpFa/JQT9TvXUL85z6j/IywYSh1Npggt30XDCQvDZh95tMNO1jMIqZAMNlLMktCofV7cBhfGoEMqTCLGYnoUsOUk0xytuPSZeQoqP7MmU9/+lG9zV4faHmprJ/tNEYE1HPviPaE6KZTF9KO8GU/QJ0Sgw9p6Jyb+MXtQSxoVkp+d0KfnaEMg0UDXddGASHzopI+Ov2OUBLcm9Bt7abtQOE ipk0lwig fefPfWJg+s7mF0jzeDoN+DTIhmsIJ0hkM+YK3skiL/VT/9mSEOidvdPboziCNWdfyskoiE1zPRhcRcPSC+ezbNdQygpT6sxylVWXROq47OH9YW6h+rX2Umj5Niues1ENUE61FpivBhqzBVjtW+35P3sQpAkg/KYsXNnab6tDX3KXnqrpTlmebr2Q+rmFqdCmaI2bAEkPnFo6oxE37oW76svOGiphU1QZJcNio/XeAHxRr0Yvz69xxMmX2M0eaEbNaGsOOr0hU99ODCG1/mZFunU9ECeylYqFtELst1Lh92HMSrtwfB0DsFFscBQcGqMWl7sgNmU/b9dpza4rayGD0rQ5zlqTtUWBTn7isjLA789qsh5oK7an5z1C/rR1MMqgPDT1TGA/eEIjWFQPpbXBfg9iEA1q7HigHwhQlHy7qdTEhe8pwCJ5IULilAIvnwM0rLyLlNooUDswn/Vs= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000009, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: The current rstat model is set up to keep track of cgroup stats on a per-cpu basis. When a stat (of any subsystem) is updated, the updater notes this change using the cgroup_rstat_updated() API call. This change is propagated to the cpu-specific rstat tree, by appending the updated cgroup to the tree (unless it's already on the tree). So for each cpu, an rstat tree will consist of the cgroups that reported one or more updated stats. Later on when a flush is requested via cgroup_rstat_flush(), each per-cpu rstat tree is traversed starting at the requested cgroup and the subsystem-specific flush callbacks (via css_rstat_flush) are invoked along the way. During the flush, the section of the tree starting at the requested cgroup through its descendants are removed. Using the cgroup struct to represent nodes of change means that the changes represented by a given tree are heterogeneous - the tree can consist of nodes that have changes from different subsystems; i.e. changes in stats from the memory subsystem and the io subsystem can coexist in the same tree. The implication is that when a flush is requested, usually in the context of a single subsystem, all other subsystems need to be flushed along with it. This seems to have become a drawback due to how expensive the flushing of the memory-specific stats have become [0][1]. Another implication is when updates are performed, subsystems may contend with each other over the locks involved. I've been experimenting with an idea that allows for isolating the updating and flushing of cgroup stats on a per-subsystem basis. The idea was instead of having a per-cpu rstat tree for managing stats across all subsystems, we could split up the per-cpu trees into separate trees for each subsystem. So each cpu would have separate trees for each subsystem. It would allow subsystems to update and flush their stats without any contention or extra overhead from other subsystems. The core change is moving ownership of the the rstat entities from the cgroup struct onto the cgroup_subsystem_state struct. To complement the ownership change, the lockng scheme was adjusted. The global cgroup_rstat_lock for synchronizing updates and flushes was replaced with subsystem-specific locks (in the cgroup_subsystem struct). An additional global lock was added to allow the base stats pseudo-subsystem to be synchronized in a similar way. The per-cpu locks called cgroup_rstat_cpu_lock have changed to a per-cpu array of locks which is indexed by subsystem id. Following suit, there is also a per-cpu array of locks dedicated to the base subsystem. The dedicated locks for the base stats was added since the base stats have a NULL subsystem so it did not fit the subsystem id index approach. I reached a point where this started to feel stable in my local testing, so I wanted to share and get feedback on this approach. [0] https://lore.kernel.org/all/CAOm-9arwY3VLUx5189JAR9J7B=Miad9nQjjet_VNdT3i+J+5FA@mail.gmail.com/ [1] https://github.blog/engineering/debugging-network-stalls-on-kubernetes/ Changelog v2: updated cover letter and some patch text. no code changes. JP Kobryn (8): change cgroup to css in rstat updated and flush api change cgroup to css in rstat internal flush and lock funcs change cgroup to css in rstat init and exit api split rstat from cgroup into separate css separate locking between base css and others isolate base stat flush remove unneeded rcu list remove bpf rstat flush from css generic flush block/blk-cgroup.c | 4 +- include/linux/cgroup-defs.h | 35 ++--- include/linux/cgroup.h | 8 +- kernel/cgroup/cgroup-internal.h | 4 +- kernel/cgroup/cgroup.c | 79 ++++++----- kernel/cgroup/rstat.c | 225 +++++++++++++++++++------------- mm/memcontrol.c | 4 +- 7 files changed, 203 insertions(+), 156 deletions(-)