From patchwork Thu Feb 27 21:55:39 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: JP Kobryn X-Patchwork-Id: 13995240 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6BBC1C197BF for ; Thu, 27 Feb 2025 21:56:00 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 027756B0089; Thu, 27 Feb 2025 16:56:00 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id F1990280003; Thu, 27 Feb 2025 16:55:59 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DB9E6280002; Thu, 27 Feb 2025 16:55:59 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id B951D6B0089 for ; Thu, 27 Feb 2025 16:55:59 -0500 (EST) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 39C11121A47 for ; Thu, 27 Feb 2025 21:55:59 +0000 (UTC) X-FDA: 83167082838.24.FA3602B Received: from mail-pl1-f172.google.com (mail-pl1-f172.google.com [209.85.214.172]) by imf03.hostedemail.com (Postfix) with ESMTP id 16ED520008 for ; Thu, 27 Feb 2025 21:55:56 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=gZv3WPvI; spf=pass (imf03.hostedemail.com: domain of inwardvessel@gmail.com designates 209.85.214.172 as permitted sender) smtp.mailfrom=inwardvessel@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1740693357; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=XFfAgp/+4UL4ly3BmhcVa92nPM9QTii4Pvvc5uFPH7w=; b=h/cv1OY6b4lVbMcncWW0xRng27z5wonoYbfwqLL/HypGVw9LoX1tyha6+kV6jEyi5G/ACE EcLM8tDe5LKbC38yNFIv7bA/a9SExQXW3H//9KXT1kA1qbiuYlmwWYbDUAwJAmK6JlnDQH Faq143s87s2mlObGIcB9bFLGbsvdX9Y= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740693357; a=rsa-sha256; cv=none; b=H7yGYk4RuE3etGrjFZM/lTfY/n6Hq/8WkC7wpsjCUX5m70dIhND9d0PpawiYkBelKkAghE RMFLig4XO4BzicSwTy7g7KcKYZ3olO79K0GIMccS4B7W3H9af1BTQyJDF2cF8LsHX98fj7 TQR/Ep4gqZX7hYDDlgJ9SGWX3CEOUws= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=gZv3WPvI; spf=pass (imf03.hostedemail.com: domain of inwardvessel@gmail.com designates 209.85.214.172 as permitted sender) smtp.mailfrom=inwardvessel@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-pl1-f172.google.com with SMTP id d9443c01a7336-22337bc9ac3so28646495ad.1 for ; Thu, 27 Feb 2025 13:55:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1740693355; x=1741298155; darn=kvack.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=XFfAgp/+4UL4ly3BmhcVa92nPM9QTii4Pvvc5uFPH7w=; b=gZv3WPvIV3BLFRQ9Hvi4BIDx4Ilt1dJiaFdsZeck8ucAgeWTJC4CgI1QKO8WwH944o uDuE2rG3rxJNWaprnCnyX2ivVkFxsufYxYqCIZ4zO9phXxNhKo4y51RxbyPbzmzIwkG2 oPH8zQfW/rVojBGswHzO4BVBJIhpVQsD3GsO+KRfLoHEoT1tL3z6/lJlSLWMURUmM1qr pZteDAqSqIOsueyyVQFv/fGcazzEiuWK5w6x18+/C/SoSxCWvc1Q4pVdEvdNKwaPY8Ju 28tu7LpI9F/0ruXYDbLpv6Be8isO32ceBM3QcGnlEN9lMi+mic7jjC3tIlDFmU+y9FIR cgQQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1740693355; x=1741298155; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=XFfAgp/+4UL4ly3BmhcVa92nPM9QTii4Pvvc5uFPH7w=; b=pVdxz+lm+p8e7jD0daWgwoQyS8q/FhYRLwvrWP0iP+vCiSeqznKa30B+WwIOLXEhOQ nNuGlrnkrPB1z8xBZ8IFE3/Mt4ucsAu3PzVKl9tjv8j+o4Ep+z3CYrvFEoQPP+lLjWHa cknBaZBvQYrBjiE+mRpVDUEGgtvDfAT4iLUA8ET5KAWdW5nuq8GsZgE+Se5dpT29VxQR h48rA0vLef2mqz8KNuB5VEbDvvzyLEGCWQjVehLuTLmnzcv2hWRIhVXN/KKr3U0EauPS xT1slwuED1VjY2RGi8WIus+l+0x3KkCJxwhz/LH5qn/f5HHUf4nfnTB+x5NCiZPst96N XDMQ== X-Gm-Message-State: AOJu0YySsjaguZRAc61IH0V1U+ml7Gp9bO62KGGt+y5vmYodWBypXjZm kaawQQDOe/ncPLlvMe1DCAUpUhCNYAhJ9Z9ugIBZEpI/XjmOVFcV X-Gm-Gg: ASbGnctTD8q5w+YmL32ChXVUhGvx1ssc37bLCCLw1r3ZkterxufavadNaT6G+K1JtJW Kx6dI/FSy0oRSPMkXOFePnOh/sTbVF0dvm1N20pHV2FnXMIrDLclQfogh4EyUNHoUZpjMNpSQ12 WGWabfEie82EuH3GJtxPJo6Wq8OzZkVaJHowrYlW+74wXTV3BO6etEWtrAfZlh+at0ysS9h9mSI +olsStexsygbWQ4FS9sgUzxcnqpAMqJ/bL7UjRT61CPZQhm4OWTnIO8o9kKf0UEi7ugAN/b53Ef pyONs1C1LQmh9vu6i7XQuv/L2MdD70oth8Sm3+GQHmC8j/5KpsRzDA== X-Google-Smtp-Source: AGHT+IH6B7scBvtcx+P0GquXZephxa+AthFtxLpIUQHxG+s/yVwzsrR990aIZxl+oUqUg2YUBiMRRg== X-Received: by 2002:a05:6a00:3d13:b0:725:9f02:489a with SMTP id d2e1a72fcca58-734ac3f4e60mr1444541b3a.17.1740693355486; Thu, 27 Feb 2025 13:55:55 -0800 (PST) Received: from jpkobryn-fedora-PF5CFKNC.thefacebook.com ([2620:10d:c090:500::4:4d60]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-734a003eb65sm2301321b3a.149.2025.02.27.13.55.54 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 27 Feb 2025 13:55:54 -0800 (PST) From: inwardvessel To: tj@kernel.org, shakeel.butt@linux.dev, yosryahmed@google.com, mhocko@kernel.org, hannes@cmpxchg.org, akpm@linux-foundation.org Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, kernel-team@meta.com Subject: [PATCH 0/4 v2] cgroup: separate rstat trees Date: Thu, 27 Feb 2025 13:55:39 -0800 Message-ID: <20250227215543.49928-1-inwardvessel@gmail.com> X-Mailer: git-send-email 2.48.1 MIME-Version: 1.0 X-Stat-Signature: h4cgxi9depqmfynbekfnf5j576gk3g6b X-Rspamd-Queue-Id: 16ED520008 X-Rspam-User: X-Rspamd-Server: rspam01 X-HE-Tag: 1740693356-694997 X-HE-Meta: U2FsdGVkX1+HCgjrz8aBwr5C3rxD7UWJ/5YvraIc14tv9sNVom7QTn6aNmVW4MfssegBNEH94fQrGzCGZbhNCxcgrBayW38junK3JsoyuU/7j9ul9HffNKFCuBUlEdjR8aiSHXuiKie9lA49pM+ueMQSTCRrouKMSbhSFCNNHHVcpKdl8x/ibkiKbU/5weGco4ZBr1mRtHwELAf8pMJlDcD0viRgi2pKaeFDh1UmqnRzj5MFlJ8pNam7kn0dPQcSAysW0ss45DeVqhM1UUJgoKxjnTGzYyeivCGOp79NegbfF2Oas+wWf33NkxCVFtSBS7ieViiVLomConuGXsuqE/XcJadO20ELSgk4p7nTherCGCiN6qg/A80+G7WQSoYH02g0DUrppxhP9fWMe4Wb1bHm7/1vg03hlrhoDogMJDQjTWMyPm/3Cvz/eSXvSuBB6L5swkw6iw9Og/3IbgGhs3hpBls8oznkzs6wPD+nlrZ2rddSWfWfBoamcPsAmo3uOMOoqjsgTV9w3Pbk3WqZljBfsplMFdHa+u7YADLL+V1iVhjxAw9Enm+DbvTKR8gwPgXgdrmgF0WQGCix9E4OGxmymVgZyZF7MId3GHbnt4C/QTYd0syuERf4cSBadOuycUtFcwl+F7OkZHbhIvAh4w2ZE/U0qmKCUTFR+3pITKmP5emd5P/6Xcje9gzzApPsdo6UFtevfWj9DeUeq21Z7PvH4pIBpxOj2wui3Z7evyLvVAttIdsu9P9+NNQdmGlizIaqGHt7GqSGw4e76E4/l9WXHZsKVKpHqDXdWTLYc8Em/7FpPziTjc56w9tsG10jJrywyz08Cu9gaZQvJqbPdB4CNeO4ur0ZR3IdzfCzuOpTK9QOr2vSjyb00Bc9blaFzU6fGtn7yotFr5+1qTspX6aOsycwpq+OsTDbsVzW5Cju499tgcBvNcnRL5sTeJ3OJwIFoNqlTfq96QQOFlq aLtj9Vl9 4MPw/DVpigsryr7wa/O7n/2l+54ExDiAOimQELOwnqRWKcbA6Dnkl78MERBOoscbqMGKViMOKITThGaKjEDdu5YdUhFPjbZYoF3SeJuqzT3yt/m9HvCMmN0kOONG5uB8UxXYcAzfc7BPXE9lvTh2HEq1ms9E3QaYMSXHfACCmBzO+wu/Y3DqAHPvQQC4slBVXuZLkCk0cpCthGVy7lzGnec8M3/usGbVPyd14mKsVcnAy1NGU7Rn1yIUQdDP/gHGtd/NVrMmJSA8sw13qD5edJb2xt6MeTJC0mgDHk+BLkosmTrQrkXEclZCVGHMGcpNmoi+L099NNOabD4n5oEdfiS2sL5b5LONUe42I4bqor63/E2DgqeScAtenUw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: JP Kobryn The current design of rstat takes the approach that if one subsystem is to be flushed, all other subsystems with pending updates should also be flushed. It seems that over time, the stat-keeping of some subsystems has grown in size to the extent that they are noticeably slowing down others. This has been most observable in situations where the memory controller is enabled. One big area where the issue comes up is system telemetry, where programs periodically sample cpu stats. It would be a benefit for programs like this if the overhead of having to flush memory stats (and others) could be eliminated. It would save cpu cycles for existing cpu-based telemetry programs and improve scalability in terms of sampling frequency and volume of hosts. This series changes the approach of "flush all subsystems" to "flush only the requested subsystem". The core design change is moving from a single unified rstat tree of cgroups to having separate trees made up of cgroup_subsys_state's. There will be one (per-cpu) tree for the base stats (cgroup::self) and one for each enabled subsystem (if it implements css_rstat_flush()). In order to do this, the rstat list pointers were moved off of the cgroup and onto the css. In the transition, these list pointer types were changed to cgroup_subsys_state. This allows for rstat trees to now be made up of css nodes, where a given tree will only contains css nodes associated with a specific subsystem. The rstat api's were changed to accept a reference to a cgroup_subsys_state instead of a cgroup. This allows for callers to be specific about which stats are being updated/flushed. Since separate trees will be in use, the locking scheme was adjusted. The global locks were split up in such a way that there are separate locks for the base stats (cgroup::self) and each subsystem (memory, io, etc). This allows different subsystems (including base stats) to use rstat in parallel with no contention. Breaking up the unified tree into separate trees eliminates the overhead and scalability issue explained in the first section, but comes at the expense of using additional memory. In an effort to minimize this overhead, a conditional allocation is performed. The cgroup_rstat_cpu originally contained the rstat list pointers and the base stat entities. This struct was renamed to cgroup_rstat_base_cpu and is only allocated when the associated css is cgroup::self. A new compact struct was added that only contains the rstat list pointers. When the css is associated with an actual subsystem, this compact struct is allocated. With this conditional allocation, the change in memory overhead on a per-cpu basis before/after is shown below. before: sizeof(struct cgroup_rstat_cpu) =~ 176 bytes /* can vary based on config */ nr_cgroups * sizeof(struct cgroup_rstat_cpu) nr_cgroups * 176 bytes after: sizeof(struct cgroup_rstat_cpu) == 16 bytes sizeof(struct cgroup_rstat_base_cpu) =~ 176 bytes nr_cgroups * ( sizeof(struct cgroup_rstat_base_cpu) + sizeof(struct cgroup_rstat_cpu) * nr_rstat_controllers ) nr_cgroups * (176 + 16 * nr_rstat_controllers) ... where nr_rstat_controllers is the number of enabled cgroup controllers that implement css_rstat_flush(). On a host where both memory and io are enabled: nr_cgroups * (176 + 16 * 2) nr_cgroups * 208 bytes With regard to validation, there is a measurable benefit when reading stats with this series. A test program was made to loop 1M times while reading all four of the files cgroup.stat, cpu.stat, io.stat, memory.stat of a given parent cgroup each iteration. This test program has been run in the experiments that follow. The first experiment consisted of a parent cgroup with memory.swap.max=0 and memory.max=1G. On a 52-cpu machine, 26 child cgroups were created and within each child cgroup a process was spawned to frequently update the memory cgroup stats by creating and then reading a file of size 1T (encouraging reclaim). The test program was run alongside these 26 tasks in parallel. The results showed a benefit in both time elapsed and perf data of the test program. time before: real 0m44.612s user 0m0.567s sys 0m43.887s perf before: 27.02% mem_cgroup_css_rstat_flush 6.35% __blkcg_rstat_flush 0.06% cgroup_base_stat_cputime_show time after: real 0m27.125s user 0m0.544s sys 0m26.491s perf after: 6.03% mem_cgroup_css_rstat_flush 0.37% blkcg_print_stat 0.11% cgroup_base_stat_cputime_show Another experiment was setup on the same host using a parent cgroup with two child cgroups. The same swap and memory max were used as the previous experiment. In the two child cgroups, kernel builds were done in parallel, each using "-j 20". The perf comparison of the test program was very similar to the values in the previous experiment. The time comparison is shown below. before: real 1m2.077s user 0m0.784s sys 1m0.895s after: real 0m32.216s user 0m0.709s sys 0m31.256s Note that the above two experiments were also done with a modified test program that only reads the cpu.stat file and none of the other three files previously mentioned. The results were similar to what was seen in the v1 email for this series. See changelog for link to v1 if needed. For the final experiment, perf events were recorded during a kernel build with the same host and cgroup setup. The builds took place in the child node. Control and experimental sides both showed similar in cycles spent on cgroup_rstat_updated() and appeard insignificant compared among the events recorded with the workload. changelog v2: drop the patch creating a new cgroup_rstat struct and related code drop bpf-specific patches. instead just use cgroup::self in bpf progs drop the cpu lock patches. instead select cpu lock in updated_list func relocate the cgroup_rstat_init() call to inside css_create() relocate the cgroup_rstat_exit() cleanup from apply_control_enable() to css_free_rwork_fn() v1: https://lore.kernel.org/all/20250218031448.46951-1-inwardvessel@gmail.com/ JP Kobryn (4): cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state cgroup: rstat lock indirection cgroup: separate rstat locks for subsystems cgroup: separate rstat list pointers from base stats block/blk-cgroup.c | 4 +- include/linux/cgroup-defs.h | 67 ++-- include/linux/cgroup.h | 8 +- kernel/cgroup/cgroup-internal.h | 4 +- kernel/cgroup/cgroup.c | 53 +-- kernel/cgroup/rstat.c | 318 +++++++++++------- mm/memcontrol.c | 4 +- .../selftests/bpf/progs/btf_type_tag_percpu.c | 5 +- .../bpf/progs/cgroup_hierarchical_stats.c | 8 +- 9 files changed, 276 insertions(+), 195 deletions(-)