[RFC,0/9,v2] cgroup: separate per-subsystem rstat trees

Message ID	20250103015020.78547-1-inwardvessel@gmail.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: JP Kobryn <inwardvessel@gmail.com> To: shakeel.butt@linux.dev, tj@kernel.org, mhocko@kernel.org, hannes@cmpxchg.org, yosryahmed@google.com, akpm@linux-foundation.org Cc: linux-mm@kvack.org, cgroups@vger.kernel.org Subject: [RFC PATCH 0/9 v2] cgroup: separate per-subsystem rstat trees Date: Thu, 2 Jan 2025 17:50:11 -0800 Message-ID: <20250103015020.78547-1-inwardvessel@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	cgroup: separate per-subsystem rstat trees \| expand [RFC,0/9,v2] cgroup: separate per-subsystem rstat trees [RFC,1/9,v2] cgroup: change cgroup to css in rstat updated and flush api [RFC,2/9,v2] cgroup: change cgroup to css in rstat internal flush and lock funcs [RFC,3/9,v2] cgroup: change cgroup to css in rstat init and exit api [RFC,4/9,v2] cgroup: split rstat from cgroup into separate css [RFC,5/9,v2] cgroup: separate locking between base css and others [RFC,6/9,v2] cgroup: isolate base stat flush [RFC,7/9,v2] cgroup: remove unneeded rcu list [RFC,8/9,v2] cgroup: remove bpf rstat flush from css generic flush [RFC,9/9,v2] cgroup: avoid allocating rstat when flush func not present

Message ID

20250103015020.78547-1-inwardvessel@gmail.com (mailing list archive)

Headers

From: JP Kobryn <inwardvessel@gmail.com>
To: shakeel.butt@linux.dev,
	tj@kernel.org,
	mhocko@kernel.org,
	hannes@cmpxchg.org,
	yosryahmed@google.com,
	akpm@linux-foundation.org
Cc: linux-mm@kvack.org,
	cgroups@vger.kernel.org
Subject: [RFC PATCH 0/9 v2] cgroup: separate per-subsystem rstat trees
Date: Thu,  2 Jan 2025 17:50:11 -0800
Message-ID: <20250103015020.78547-1-inwardvessel@gmail.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

cgroup: separate per-subsystem rstat trees | expand

Message

JP Kobryn Jan. 3, 2025, 1:50 a.m. UTC

The current rstat model is set up to keep track of cgroup stats on a per-cpu
basis. When a stat (of any subsystem) is updated, the updater notes this change
using the cgroup_rstat_updated() API call. This change is propagated to the
cpu-specific rstat tree, by appending the updated cgroup to the tree (unless
it's already on the tree). So for each cpu, an rstat tree will consist of the
cgroups that reported one or more updated stats. Later on when a flush is
requested via cgroup_rstat_flush(), each per-cpu rstat tree is traversed
starting at the requested cgroup and the subsystem-specific flush callbacks
(via css_rstat_flush) are invoked along the way. During the flush, the section
of the tree starting at the requested cgroup through its descendants are
removed.

Using the cgroup struct to represent nodes of change means that the changes
represented by a given tree are heterogeneous - the tree can consist of nodes
that have changes from different subsystems; i.e. changes in stats from the
memory subsystem and the io subsystem can coexist in the same tree. The
implication is that when a flush is requested, usually in the context of a
single subsystem, all other subsystems need to be flushed along with it. This
seems to have become a drawback due to how expensive the flushing of the
memory-specific stats have become [0][1]. Another implication is when updates
are performed, subsystems may contend with each other over the locks involved.

I've been experimenting with an idea that allows for isolating the updating and
flushing of cgroup stats on a per-subsystem basis. The idea was instead of
having a per-cpu rstat tree for managing stats across all subsystems, we could
split up the per-cpu trees into separate trees for each subsystem. So each cpu
would have separate trees for each subsystem. It would allow subsystems to
update and flush their stats without any contention or extra overhead from
other subsystems. The core change is moving ownership of the the rstat entities
from the cgroup struct onto the cgroup_subsystem_state struct.

To complement the ownership change, the lockng scheme was adjusted. The global
cgroup_rstat_lock for synchronizing updates and flushes was replaced with
subsystem-specific locks (in the cgroup_subsystem struct). An additional global
lock was added to allow the base stats pseudo-subsystem to be synchronized in a
similar way. The per-cpu locks called cgroup_rstat_cpu_lock have changed to a
per-cpu array of locks which is indexed by subsystem id. Following suit, there
is also a per-cpu array of locks dedicated to the base subsystem. The dedicated
locks for the base stats was added since the base stats have a NULL subsystem
so it did not fit the subsystem id index approach.

I reached a point where this started to feel stable in my local testing, so I
wanted to share and get feedback on this approach.

[0] https://lore.kernel.org/all/CAOm-9arwY3VLUx5189JAR9J7B=Miad9nQjjet_VNdT3i+J+5FA@mail.gmail.com/
[1] https://github.blog/engineering/debugging-network-stalls-on-kubernetes/

Changelog
v2: updated cover letter and some patch text. no code changes.

JP Kobryn (8):
change cgroup to css in rstat updated and flush api
change cgroup to css in rstat internal flush and lock funcs
change cgroup to css in rstat init and exit api
split rstat from cgroup into separate css
separate locking between base css and others
isolate base stat flush
remove unneeded rcu list
remove bpf rstat flush from css generic flush

Comments

Tejun Heo Jan. 3, 2025, 10:08 p.m. UTC | #1

Hello,

On Thu, Jan 02, 2025 at 05:50:11PM -0800, JP Kobryn wrote:
...
> I reached a point where this started to feel stable in my local testing, so I
> wanted to share and get feedback on this approach.

The rationale for using one tree to track all subsystems was that if one
subsys has been active (e.g. memory), it's likely that other subsyses have
been active too (e.g. cpu) and thus we might as well flush the whole thing
together. The approach can be useful for reducing the amount of work done
when e.g. there are a lot of cgroups which are only active periodically but
has drawbacks when one subsystem's stats are read a lot more actively than
others as you pointed out.

Intuitions go only so far and it's difficult to judge whether splitting the
trees would be a good idea without data. Can you please provide some
numbers along with rationales for the test setups?

Thanks.

Yosry Ahmed Jan. 6, 2025, 11:13 p.m. UTC | #2

On Thu, Jan 2, 2025 at 5:50 PM JP Kobryn <inwardvessel@gmail.com> wrote:
>
> The current rstat model is set up to keep track of cgroup stats on a per-cpu
> basis. When a stat (of any subsystem) is updated, the updater notes this change
> using the cgroup_rstat_updated() API call. This change is propagated to the
> cpu-specific rstat tree, by appending the updated cgroup to the tree (unless
> it's already on the tree). So for each cpu, an rstat tree will consist of the
> cgroups that reported one or more updated stats. Later on when a flush is
> requested via cgroup_rstat_flush(), each per-cpu rstat tree is traversed
> starting at the requested cgroup and the subsystem-specific flush callbacks
> (via css_rstat_flush) are invoked along the way. During the flush, the section
> of the tree starting at the requested cgroup through its descendants are
> removed.
>
> Using the cgroup struct to represent nodes of change means that the changes
> represented by a given tree are heterogeneous - the tree can consist of nodes
> that have changes from different subsystems; i.e. changes in stats from the
> memory subsystem and the io subsystem can coexist in the same tree. The
> implication is that when a flush is requested, usually in the context of a
> single subsystem, all other subsystems need to be flushed along with it. This
> seems to have become a drawback due to how expensive the flushing of the
> memory-specific stats have become [0][1]. Another implication is when updates
> are performed, subsystems may contend with each other over the locks involved.
>
> I've been experimenting with an idea that allows for isolating the updating and
> flushing of cgroup stats on a per-subsystem basis. The idea was instead of
> having a per-cpu rstat tree for managing stats across all subsystems, we could
> split up the per-cpu trees into separate trees for each subsystem. So each cpu
> would have separate trees for each subsystem. It would allow subsystems to
> update and flush their stats without any contention or extra overhead from
> other subsystems. The core change is moving ownership of the the rstat entities
> from the cgroup struct onto the cgroup_subsystem_state struct.
>
> To complement the ownership change, the lockng scheme was adjusted. The global
> cgroup_rstat_lock for synchronizing updates and flushes was replaced with
> subsystem-specific locks (in the cgroup_subsystem struct). An additional global
> lock was added to allow the base stats pseudo-subsystem to be synchronized in a
> similar way. The per-cpu locks called cgroup_rstat_cpu_lock have changed to a
> per-cpu array of locks which is indexed by subsystem id. Following suit, there
> is also a per-cpu array of locks dedicated to the base subsystem. The dedicated
> locks for the base stats was added since the base stats have a NULL subsystem
> so it did not fit the subsystem id index approach.
>
> I reached a point where this started to feel stable in my local testing, so I
> wanted to share and get feedback on this approach.

I remember discussing this with Shakeel and Michal Koutný in LPC two
years ago. I suggested it multiple times over the last few years, most
recently in: https://lore.kernel.org/lkml/CAJD7tkbpFu8z1HaUgkaE6bup_fsD39QLPmgNyOnaTrm+hZ_9hA@mail.gmail.com/.

I think it conceptually makes sense, and I took a stab at it when I
was working on fixing the hard lockups due to atomic flushing, but the
system I was working on was using cgroup v1, so different subsystems
had different hierarchies (and hence different trees) anyway, so it
wouldn't have helped.

This is especially true for the MM subsystem, which apparently flushes
most often and has the most expensive flushes, so other subsystems are
probably being unnecessarily taxed.

>
> [0] https://lore.kernel.org/all/CAOm-9arwY3VLUx5189JAR9J7B=Miad9nQjjet_VNdT3i+J+5FA@mail.gmail.com/
> [1] https://github.blog/engineering/debugging-network-stalls-on-kubernetes/
>
> Changelog
> v2: updated cover letter and some patch text. no code changes.
>
> JP Kobryn (8):
>   change cgroup to css in rstat updated and flush api
>   change cgroup to css in rstat internal flush and lock funcs
>   change cgroup to css in rstat init and exit api
>   split rstat from cgroup into separate css
>   separate locking between base css and others
>   isolate base stat flush
>   remove unneeded rcu list
>   remove bpf rstat flush from css generic flush
>
>  block/blk-cgroup.c              |   4 +-
>  include/linux/cgroup-defs.h     |  35 ++---
>  include/linux/cgroup.h          |   8 +-
>  kernel/cgroup/cgroup-internal.h |   4 +-
>  kernel/cgroup/cgroup.c          |  79 ++++++-----
>  kernel/cgroup/rstat.c           | 225 +++++++++++++++++++-------------
>  mm/memcontrol.c                 |   4 +-
>  7 files changed, 203 insertions(+), 156 deletions(-)
>
> --
> 2.47.1
>