Message ID | 171328990014.3930751.10674097155895405137.stgit@firesoul (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Series | cgroup/rstat: global cgroup_rstat_lock changes | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Not a local patch |
On Tue, Apr 16, 2024 at 10:51 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote: > > This patch aims to reduce userspace-triggered pressure on the global > cgroup_rstat_lock by introducing a mechanism to limit how often reading > stat files causes cgroup rstat flushing. > > In the memory cgroup subsystem, memcg_vmstats_needs_flush() combined with > mem_cgroup_flush_stats_ratelimited() already limits pressure on the > global lock (cgroup_rstat_lock). As a result, reading memory-related stat > files (such as memory.stat, memory.numa_stat, zswap.current) is already > a less userspace-triggerable issue. > > However, other userspace users of cgroup_rstat_flush(), such as when > reading io.stat (blk-cgroup.c) and cpu.stat, lack a similar system to > limit pressure on the global lock. Furthermore, userspace can easily > trigger this issue by reading those stat files. > > Typically, normal userspace stats tools (e.g., cadvisor, nomad, systemd) > spawn threads that read io.stat, cpu.stat, and memory.stat (even from the > same cgroup) without realizing that on the kernel side, they share the > same global lock. This limitation also helps prevent malicious userspace > applications from harming the kernel by reading these stat files in a > tight loop. > > To address this, the patch introduces cgroup_rstat_flush_ratelimited(), > similar to memcg's mem_cgroup_flush_stats_ratelimited(). > > Flushing occurs per cgroup (even though the lock remains global) a > variable named rstat_flush_last_time is introduced to track when a given > cgroup was last flushed. This variable, which contains the jiffies of the > flush, shares properties and a cache line with rstat_flush_next and is > updated simultaneously. > > For cpu.stat, we need to acquire the lock (via cgroup_rstat_flush_hold) > because other data is read under the lock, but we skip the expensive > flushing if it occurred recently. > > Regarding io.stat, there is an opportunity outside the lock to skip the > flush, but inside the lock, we must recheck to handle races. > > Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org> As I mentioned in another thread, I really don't like time-based rate-limiting [1]. Would it be possible to generalize the magnitude-based rate-limiting instead? Have something like memcg_vmstats_needs_flush() in the core rstat code? Also, why do we keep the memcg time rate-limiting with this patch? Is it because we use a much larger window there (2s)? Having two layers of time-based rate-limiting is not ideal imo. [1]https://lore.kernel.org/lkml/CAJD7tkYnSRwJTpXxSnGgo-i3-OdD7cdT-e3_S_yf7dSknPoRKw@mail.gmail.com/
On 18/04/2024 04.21, Yosry Ahmed wrote: > On Tue, Apr 16, 2024 at 10:51 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote: >> >> This patch aims to reduce userspace-triggered pressure on the global >> cgroup_rstat_lock by introducing a mechanism to limit how often reading >> stat files causes cgroup rstat flushing. >> >> In the memory cgroup subsystem, memcg_vmstats_needs_flush() combined with >> mem_cgroup_flush_stats_ratelimited() already limits pressure on the >> global lock (cgroup_rstat_lock). As a result, reading memory-related stat >> files (such as memory.stat, memory.numa_stat, zswap.current) is already >> a less userspace-triggerable issue. >> >> However, other userspace users of cgroup_rstat_flush(), such as when >> reading io.stat (blk-cgroup.c) and cpu.stat, lack a similar system to >> limit pressure on the global lock. Furthermore, userspace can easily >> trigger this issue by reading those stat files. >> >> Typically, normal userspace stats tools (e.g., cadvisor, nomad, systemd) >> spawn threads that read io.stat, cpu.stat, and memory.stat (even from the >> same cgroup) without realizing that on the kernel side, they share the >> same global lock. This limitation also helps prevent malicious userspace >> applications from harming the kernel by reading these stat files in a >> tight loop. >> >> To address this, the patch introduces cgroup_rstat_flush_ratelimited(), >> similar to memcg's mem_cgroup_flush_stats_ratelimited(). >> >> Flushing occurs per cgroup (even though the lock remains global) a >> variable named rstat_flush_last_time is introduced to track when a given >> cgroup was last flushed. This variable, which contains the jiffies of the >> flush, shares properties and a cache line with rstat_flush_next and is >> updated simultaneously. >> >> For cpu.stat, we need to acquire the lock (via cgroup_rstat_flush_hold) >> because other data is read under the lock, but we skip the expensive >> flushing if it occurred recently. >> >> Regarding io.stat, there is an opportunity outside the lock to skip the >> flush, but inside the lock, we must recheck to handle races. >> >> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org> > > As I mentioned in another thread, I really don't like time-based > rate-limiting [1]. Would it be possible to generalize the > magnitude-based rate-limiting instead? Have something like > memcg_vmstats_needs_flush() in the core rstat code? > I've taken a closer look at memcg_vmstats_needs_flush(). And I'm concerned about overhead maintaining the stats (that is used as a filter). static bool memcg_vmstats_needs_flush(struct memcg_vmstats *vmstats) { return atomic64_read(&vmstats->stats_updates) > MEMCG_CHARGE_BATCH * num_online_cpus(); } I looked at `vmstats->stats_updates` to see how often this is getting updated. It is updated in memcg_rstat_updated(), but it gets inlined into a number function (__mod_memcg_state, __mod_memcg_lruvec_state, __count_memcg_events), plus it calls cgroup_rstat_updated(). Counting invocations per sec (via funccount): 10:28:09 FUNC COUNT __mod_memcg_state 377553 __count_memcg_events 393078 __mod_memcg_lruvec_state 1229673 cgroup_rstat_updated 2632389 I'm surprised to see how many time per sec this is getting invoked. Originating from memcg_rstat_updated() = 2,000,304 times per sec. (On a 128 CPU core machine with 39% idle CPU-load.) Maintaining these stats seems excessive... Then how often does the filter lower pressure on lock: MEMCG_CHARGE_BATCH(64) * 128 CPU = 8192 2000304/(64*128) = 244 time per sec (every ~4ms) (assuming memcg_rstat_updated val=1) > Also, why do we keep the memcg time rate-limiting with this patch? Is > it because we use a much larger window there (2s)? Having two layers > of time-based rate-limiting is not ideal imo. > I'm also not-a-fan of having two layer of time-based rate-limiting, but they do operate a different time scales *and* are not active at the same time with current patch, if you noticed the details, then I excluded memcg from using this as I commented "memcg have own ratelimit layer" (in do_flush_stats). I would prefer removing the memcg time rate-limiting and use this more granular 50ms (20 timer/sec) for memcg also. And I was planning to do that in a followup patchset. The 50ms (20 timer/sec) limit will be per cgroup in the system, which then "scales"/increase with the number of cgroups, but better than unbounded read/access locks per sec. --Jesper > [1]https://lore.kernel.org/lkml/CAJD7tkYnSRwJTpXxSnGgo-i3-OdD7cdT-e3_S_yf7dSknPoRKw@mail.gmail.com/ sudo ./bcc/tools/funccount -Ti 1 -d 10 '__mod_memcg_state|__mod_memcg_lruvec_state|__count_memcg_events|cgroup_rstat_updated'
On Thu, Apr 18, 2024 at 01:00:30PM +0200, Jesper Dangaard Brouer wrote: > > > On 18/04/2024 04.21, Yosry Ahmed wrote: > > On Tue, Apr 16, 2024 at 10:51 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote: > > > > > > This patch aims to reduce userspace-triggered pressure on the global > > > cgroup_rstat_lock by introducing a mechanism to limit how often reading > > > stat files causes cgroup rstat flushing. > > > > > > In the memory cgroup subsystem, memcg_vmstats_needs_flush() combined with > > > mem_cgroup_flush_stats_ratelimited() already limits pressure on the > > > global lock (cgroup_rstat_lock). As a result, reading memory-related stat > > > files (such as memory.stat, memory.numa_stat, zswap.current) is already > > > a less userspace-triggerable issue. > > > > > > However, other userspace users of cgroup_rstat_flush(), such as when > > > reading io.stat (blk-cgroup.c) and cpu.stat, lack a similar system to > > > limit pressure on the global lock. Furthermore, userspace can easily > > > trigger this issue by reading those stat files. > > > > > > Typically, normal userspace stats tools (e.g., cadvisor, nomad, systemd) > > > spawn threads that read io.stat, cpu.stat, and memory.stat (even from the > > > same cgroup) without realizing that on the kernel side, they share the > > > same global lock. This limitation also helps prevent malicious userspace > > > applications from harming the kernel by reading these stat files in a > > > tight loop. > > > > > > To address this, the patch introduces cgroup_rstat_flush_ratelimited(), > > > similar to memcg's mem_cgroup_flush_stats_ratelimited(). > > > > > > Flushing occurs per cgroup (even though the lock remains global) a > > > variable named rstat_flush_last_time is introduced to track when a given > > > cgroup was last flushed. This variable, which contains the jiffies of the > > > flush, shares properties and a cache line with rstat_flush_next and is > > > updated simultaneously. > > > > > > For cpu.stat, we need to acquire the lock (via cgroup_rstat_flush_hold) > > > because other data is read under the lock, but we skip the expensive > > > flushing if it occurred recently. > > > > > > Regarding io.stat, there is an opportunity outside the lock to skip the > > > flush, but inside the lock, we must recheck to handle races. > > > > > > Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org> > > > > As I mentioned in another thread, I really don't like time-based > > rate-limiting [1]. Would it be possible to generalize the > > magnitude-based rate-limiting instead? Have something like > > memcg_vmstats_needs_flush() in the core rstat code? > > > > I've taken a closer look at memcg_vmstats_needs_flush(). And I'm > concerned about overhead maintaining the stats (that is used as a filter). > > static bool memcg_vmstats_needs_flush(struct memcg_vmstats *vmstats) > { > return atomic64_read(&vmstats->stats_updates) > > MEMCG_CHARGE_BATCH * num_online_cpus(); > } > > I looked at `vmstats->stats_updates` to see how often this is getting > updated. It is updated in memcg_rstat_updated(), but it gets inlined into a > number function (__mod_memcg_state, __mod_memcg_lruvec_state, > __count_memcg_events), plus it calls cgroup_rstat_updated(). > Counting invocations per sec (via funccount): > > 10:28:09 > FUNC COUNT > __mod_memcg_state 377553 > __count_memcg_events 393078 > __mod_memcg_lruvec_state 1229673 > cgroup_rstat_updated 2632389 > Is it possible for you to also measure the frequency of the unique callstacks calling these functions? In addition the frequency of the each stat item update would be awesome. > > I'm surprised to see how many time per sec this is getting invoked. > Originating from memcg_rstat_updated() = 2,000,304 times per sec. > (On a 128 CPU core machine with 39% idle CPU-load.) > Maintaining these stats seems excessive... > > Then how often does the filter lower pressure on lock: > > MEMCG_CHARGE_BATCH(64) * 128 CPU = 8192 > 2000304/(64*128) = 244 time per sec (every ~4ms) > (assuming memcg_rstat_updated val=1) > It seems like we have opportunities to improve the stat update side and we definitely need to improve the stat flush side. One issue from the memcg side is that kernel has to do a lot of work, so we should be reducing that.
On Thu, Apr 18, 2024 at 4:00 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote: > > > > On 18/04/2024 04.21, Yosry Ahmed wrote: > > On Tue, Apr 16, 2024 at 10:51 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote: > >> > >> This patch aims to reduce userspace-triggered pressure on the global > >> cgroup_rstat_lock by introducing a mechanism to limit how often reading > >> stat files causes cgroup rstat flushing. > >> > >> In the memory cgroup subsystem, memcg_vmstats_needs_flush() combined with > >> mem_cgroup_flush_stats_ratelimited() already limits pressure on the > >> global lock (cgroup_rstat_lock). As a result, reading memory-related stat > >> files (such as memory.stat, memory.numa_stat, zswap.current) is already > >> a less userspace-triggerable issue. > >> > >> However, other userspace users of cgroup_rstat_flush(), such as when > >> reading io.stat (blk-cgroup.c) and cpu.stat, lack a similar system to > >> limit pressure on the global lock. Furthermore, userspace can easily > >> trigger this issue by reading those stat files. > >> > >> Typically, normal userspace stats tools (e.g., cadvisor, nomad, systemd) > >> spawn threads that read io.stat, cpu.stat, and memory.stat (even from the > >> same cgroup) without realizing that on the kernel side, they share the > >> same global lock. This limitation also helps prevent malicious userspace > >> applications from harming the kernel by reading these stat files in a > >> tight loop. > >> > >> To address this, the patch introduces cgroup_rstat_flush_ratelimited(), > >> similar to memcg's mem_cgroup_flush_stats_ratelimited(). > >> > >> Flushing occurs per cgroup (even though the lock remains global) a > >> variable named rstat_flush_last_time is introduced to track when a given > >> cgroup was last flushed. This variable, which contains the jiffies of the > >> flush, shares properties and a cache line with rstat_flush_next and is > >> updated simultaneously. > >> > >> For cpu.stat, we need to acquire the lock (via cgroup_rstat_flush_hold) > >> because other data is read under the lock, but we skip the expensive > >> flushing if it occurred recently. > >> > >> Regarding io.stat, there is an opportunity outside the lock to skip the > >> flush, but inside the lock, we must recheck to handle races. > >> > >> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org> > > > > As I mentioned in another thread, I really don't like time-based > > rate-limiting [1]. Would it be possible to generalize the > > magnitude-based rate-limiting instead? Have something like > > memcg_vmstats_needs_flush() in the core rstat code? > > > > I've taken a closer look at memcg_vmstats_needs_flush(). And I'm > concerned about overhead maintaining the stats (that is used as a filter). > > static bool memcg_vmstats_needs_flush(struct memcg_vmstats *vmstats) > { > return atomic64_read(&vmstats->stats_updates) > > MEMCG_CHARGE_BATCH * num_online_cpus(); > } > > I looked at `vmstats->stats_updates` to see how often this is getting > updated. It is updated in memcg_rstat_updated(), but it gets inlined > into a number function (__mod_memcg_state, __mod_memcg_lruvec_state, > __count_memcg_events), plus it calls cgroup_rstat_updated(). > Counting invocations per sec (via funccount): > > 10:28:09 > FUNC COUNT > __mod_memcg_state 377553 > __count_memcg_events 393078 > __mod_memcg_lruvec_state 1229673 > cgroup_rstat_updated 2632389 > > > I'm surprised to see how many time per sec this is getting invoked. > Originating from memcg_rstat_updated() = 2,000,304 times per sec. > (On a 128 CPU core machine with 39% idle CPU-load.) > Maintaining these stats seems excessive... Well, the number of calls to memcg_rstat_updated() is not affected by maintaining stats_updates, and this only adds a few percpu updates in the common path. I did not see any regressions (after all optimizations) in any benchmarks with this, including will-it-scale and netperf. > > Then how often does the filter lower pressure on lock: > > MEMCG_CHARGE_BATCH(64) * 128 CPU = 8192 > 2000304/(64*128) = 244 time per sec (every ~4ms) > (assuming memcg_rstat_updated val=1) This does not tell the whole story though because: 1. The threshold (8192 in this case) is per-memcg. I am assuming 2,000,304 is the number of calls per second for the entire system. In this case, the filtering should be more effective. 2. This assumes that updates and flushes are uniform, I am not sure this applies in practice. 3. In my experiments, this thresholding drastically improved userspace read latency under heavy contention (100s or 1000s of readers), especially the tail latencies. Generally, I think magnitude-based thresholding is better than time-based, especially in larger systems where a lot can change in a short amount of time. I did not observe any regressions from this scheme, and I observed very noticeable improvements in flushing latency. Taking a step back, I think this series is trying to address two issues in one go: interrupt handling latency and lock contention. While both are related because reducing flushing reduces irq disablement, I think it would be better if we can fix that issue separately with a more fundamental solution (e.g. using a mutex or dropping the lock at each CPU boundary). After that, we can more clearly evaluate the lock contention problem with data purely about flushing latency, without taking into consideration the irq handling problem. Does this make sense to you? > > > > Also, why do we keep the memcg time rate-limiting with this patch? Is > > it because we use a much larger window there (2s)? Having two layers > > of time-based rate-limiting is not ideal imo. > > > > I'm also not-a-fan of having two layer of time-based rate-limiting, but > they do operate a different time scales *and* are not active at the same > time with current patch, if you noticed the details, then I excluded > memcg from using this as I commented "memcg have own ratelimit layer" > (in do_flush_stats). Right, I meant generally having two schemes doing very similar things, even if they are not active at the same time. I think this is an artifact of different subsystems sharing the same rstat tree for no specific reason. I think almost all flushing calls really need the stats from one subsystem after all. If we have separate trees, lock contention gets slightly better as different subsystems do not compete. We can also have different subsystems "customize" their trees, for e.g. by setting different time-based or magnitude-based rate-limiting thresholds. I know this is a bigger lift, just thinking out loud :) > > I would prefer removing the memcg time rate-limiting and use this more > granular 50ms (20 timer/sec) for memcg also. And I was planning to do > that in a followup patchset. The 50ms (20 timer/sec) limit will be per > cgroup in the system, which then "scales"/increase with the number of > cgroups, but better than unbounded read/access locks per sec. > > --Jesper > > > > [1]https://lore.kernel.org/lkml/CAJD7tkYnSRwJTpXxSnGgo-i3-OdD7cdT-e3_S_yf7dSknPoRKw@mail.gmail.com/ > > > sudo ./bcc/tools/funccount -Ti 1 -d 10 > '__mod_memcg_state|__mod_memcg_lruvec_state|__count_memcg_events|cgroup_rstat_updated'
Hello, Yosry. On Thu, Apr 18, 2024 at 02:00:28PM -0700, Yosry Ahmed wrote: ... > I think this is an artifact of different subsystems sharing the same > rstat tree for no specific reason. I think almost all flushing calls > really need the stats from one subsystem after all. > > If we have separate trees, lock contention gets slightly better as > different subsystems do not compete. We can also have different > subsystems "customize" their trees, for e.g. by setting different > time-based or magnitude-based rate-limiting thresholds. > > I know this is a bigger lift, just thinking out loud :) I have no objection to separating out rstat trees so that it has per-controller tracking. However, the high frequency source of updates are cpu and memory, which tend to fire together, and the only really high frequency consumer seems to be memory, so I'm not too sure how much benefit separating the trees out would bring. Thanks.
On Thu, Apr 18, 2024 at 2:15 PM Tejun Heo <tj@kernel.org> wrote: > > Hello, Yosry. > > On Thu, Apr 18, 2024 at 02:00:28PM -0700, Yosry Ahmed wrote: > ... > > I think this is an artifact of different subsystems sharing the same > > rstat tree for no specific reason. I think almost all flushing calls > > really need the stats from one subsystem after all. > > > > If we have separate trees, lock contention gets slightly better as > > different subsystems do not compete. We can also have different > > subsystems "customize" their trees, for e.g. by setting different > > time-based or magnitude-based rate-limiting thresholds. > > > > I know this is a bigger lift, just thinking out loud :) > > I have no objection to separating out rstat trees so that it has > per-controller tracking. However, the high frequency source of updates are > cpu and memory, which tend to fire together, and the only really high > frequency consumer seems to be memory, so I'm not too sure how much benefit > separating the trees out would bring. Well, we could split the global lock into multiple ones, which isn't really a solution, but it would help other controllers not to be affected by the high frequency of flushing from the memory controller (which has its own thresholding). For updates, cpu and memory would use separate percpu locks as well, which may help slightly. Outside of this, I think it helps us add controller-specific optimizations. For example, I tried to generalize the thresholding code in the memory controller and put it in the rstat code, but I couldn't really have a single value representing the "pending stats" from all controllers. It's impossible to compare memory stats (mostly in pages or bytes) to cpu time stats for instance. Similarly, with this proposal from Jesper (which I am not saying I am agreeing with :P), instead of having time-based ratelimiting in both the rstat code and the memcg code to support different thresholds, we could have the memory controller set a different threshold for itself. So perhaps the lock breakdowns are not enough motivation, but if we start generalizing optimizations in some controllers, we may want to split the tree for flexibility.
Hello, On Thu, Apr 18, 2024 at 02:22:58PM -0700, Yosry Ahmed wrote: > Outside of this, I think it helps us add controller-specific > optimizations. For example, I tried to generalize the thresholding > code in the memory controller and put it in the rstat code, but I > couldn't really have a single value representing the "pending stats" > from all controllers. It's impossible to compare memory stats (mostly > in pages or bytes) to cpu time stats for instance. > > Similarly, with this proposal from Jesper (which I am not saying I am > agreeing with :P), instead of having time-based ratelimiting in both > the rstat code and the memcg code to support different thresholds, we > could have the memory controller set a different threshold for itself. > > So perhaps the lock breakdowns are not enough motivation, but if we > start generalizing optimizations in some controllers, we may want to > split the tree for flexibility. I see. Yeah, that makes more sense to me. Thanks.
On 18/04/2024 23.00, Yosry Ahmed wrote: > On Thu, Apr 18, 2024 at 4:00 AM Jesper Dangaard Brouer<hawk@kernel.org> wrote: >> On 18/04/2024 04.21, Yosry Ahmed wrote: >>> On Tue, Apr 16, 2024 at 10:51 AM Jesper Dangaard Brouer<hawk@kernel.org> wrote: >>>> This patch aims to reduce userspace-triggered pressure on the global >>>> cgroup_rstat_lock by introducing a mechanism to limit how often reading >>>> stat files causes cgroup rstat flushing. >>>> [...] > Taking a step back, I think this series is trying to address two > issues in one go: interrupt handling latency and lock contention. Yes, patch 2 and 3 are essentially independent and address two different aspects. > While both are related because reducing flushing reduces irq > disablement, I think it would be better if we can fix that issue > separately with a more fundamental solution (e.g. using a mutex or > dropping the lock at each CPU boundary). > > After that, we can more clearly evaluate the lock contention problem > with data purely about flushing latency, without taking into > consideration the irq handling problem. > > Does this make sense to you? Yes, make sense. So, you are suggesting we start with the mutex change? (patch 2) (which still needs some adjustments/tuning) This make sense to me, as there are likely many solutions to reducing the pressure on the lock. With the tracepoint patch in-place, I/we can measure the pressure on the lock, and I plan to do this across our CF fleet. Then we can slowly work on improving lock contention and evaluate this on our fleets. --Jesper p.s. Setting expectations: - Going on vacation today, so will resume work after 29th April.
On Fri, Apr 19, 2024 at 3:17 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote: > > > On 18/04/2024 23.00, Yosry Ahmed wrote: > > On Thu, Apr 18, 2024 at 4:00 AM Jesper Dangaard Brouer<hawk@kernel.org> wrote: > >> On 18/04/2024 04.21, Yosry Ahmed wrote: > >>> On Tue, Apr 16, 2024 at 10:51 AM Jesper Dangaard Brouer<hawk@kernel.org> wrote: > >>>> This patch aims to reduce userspace-triggered pressure on the global > >>>> cgroup_rstat_lock by introducing a mechanism to limit how often reading > >>>> stat files causes cgroup rstat flushing. > >>>> > [...] > > > Taking a step back, I think this series is trying to address two > > issues in one go: interrupt handling latency and lock contention. > > Yes, patch 2 and 3 are essentially independent and address two different > aspects. > > > While both are related because reducing flushing reduces irq > > disablement, I think it would be better if we can fix that issue > > separately with a more fundamental solution (e.g. using a mutex or > > dropping the lock at each CPU boundary). > > > > After that, we can more clearly evaluate the lock contention problem > > with data purely about flushing latency, without taking into > > consideration the irq handling problem. > > > > Does this make sense to you? > > Yes, make sense. > > So, you are suggesting we start with the mutex change? (patch 2) > (which still needs some adjustments/tuning) Yes. Let's focus on (what I assume to be) the more important problem, IRQ serving latency. Once this is fixed, let's consider the tradeoffs for improving lock contention separately. Thanks! > > This make sense to me, as there are likely many solutions to reducing > the pressure on the lock. > > With the tracepoint patch in-place, I/we can measure the pressure on the > lock, and I plan to do this across our CF fleet. Then we can slowly > work on improving lock contention and evaluate this on our fleets. > > --Jesper > p.s. > Setting expectations: > - Going on vacation today, so will resume work after 29th April.
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index bdbb557feb5a..4156fedbb910 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -1162,7 +1162,7 @@ static int blkcg_print_stat(struct seq_file *sf, void *v) if (!seq_css(sf)->parent) blkcg_fill_root_iostats(); else - cgroup_rstat_flush(blkcg->css.cgroup); + cgroup_rstat_flush_ratelimited(blkcg->css.cgroup); rcu_read_lock(); hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) { diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index ea48c861cd36..366dc644e075 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -509,6 +509,7 @@ struct cgroup { * cgroup_rstat_flush_locked() and protected by cgroup_rstat_lock. */ struct cgroup *rstat_flush_next; + unsigned long rstat_flush_last_time; /* cgroup basic resource statistics */ struct cgroup_base_stat last_bstat; diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 2150ca60394b..8622b222453e 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -689,7 +689,8 @@ static inline void cgroup_path_from_kernfs_id(u64 id, char *buf, size_t buflen) */ void cgroup_rstat_updated(struct cgroup *cgrp, int cpu); void cgroup_rstat_flush(struct cgroup *cgrp); -void cgroup_rstat_flush_hold(struct cgroup *cgrp); +int cgroup_rstat_flush_hold(struct cgroup *cgrp); +int cgroup_rstat_flush_ratelimited(struct cgroup *cgrp); void cgroup_rstat_flush_release(struct cgroup *cgrp); /* diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index a90d68a7c27f..8c71af67b197 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -193,6 +193,7 @@ static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu) /* Push @root to the list first before pushing the children */ head = root; root->rstat_flush_next = NULL; + root->rstat_flush_last_time = jiffies; child = rstatc->updated_children; rstatc->updated_children = root; if (child != root) @@ -261,12 +262,15 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp) lockdep_assert_held(&cgroup_rstat_lock); + cgrp->rstat_flush_last_time = jiffies; + for_each_possible_cpu(cpu) { struct cgroup *pos = cgroup_rstat_updated_list(cgrp, cpu); for (; pos; pos = pos->rstat_flush_next) { struct cgroup_subsys_state *css; + pos->rstat_flush_last_time = jiffies; cgroup_base_stat_flush(pos, cpu); bpf_rstat_flush(pos, cgroup_parent(pos), cpu); @@ -309,6 +313,49 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp) __cgroup_rstat_unlock(cgrp, -1); } +#define FLUSH_TIME msecs_to_jiffies(50) + +/** + * cgroup_rstat_flush_ratelimited - limit pressure on global lock (cgroup_rstat_lock) + * @cgrp: target cgroup + * + * Trade accuracy for less pressure on global lock. Only use this when caller + * don't need 100% up-to-date stats. + * + * Userspace stats tools spawn threads reading io.stat, cpu.stat and memory.stat + * from same cgroup, but kernel side they share same global lock. This also + * limit malicious userspace from hurting kernel by reading these stat files in + * a tight loop. + * + * This function exit early if flush recently happened. + * + * Returns 1 if flush happened + */ +int cgroup_rstat_flush_ratelimited(struct cgroup *cgrp) +{ + int res = 0; + + might_sleep(); + + /* Outside lock: check if this flush and lock can be skipped */ + if (time_before(jiffies, + READ_ONCE(cgrp->rstat_flush_last_time) + FLUSH_TIME)) + return 0; + + __cgroup_rstat_lock(cgrp, -1); + + /* Recheck inside lock, do flush after enough time have passed */ + if (time_after(jiffies, + cgrp->rstat_flush_last_time + FLUSH_TIME)) { + cgroup_rstat_flush_locked(cgrp); + res = 1; + } + + __cgroup_rstat_unlock(cgrp, -1); + + return res; +} + /** * cgroup_rstat_flush_hold - flush stats in @cgrp's subtree and hold * @cgrp: target cgroup @@ -317,13 +364,22 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp) * paired with cgroup_rstat_flush_release(). * * This function may block. + * + * Returns 1 if flush happened */ -void cgroup_rstat_flush_hold(struct cgroup *cgrp) +int cgroup_rstat_flush_hold(struct cgroup *cgrp) __acquires(&cgroup_rstat_lock) { might_sleep(); __cgroup_rstat_lock(cgrp, -1); - cgroup_rstat_flush_locked(cgrp); + + /* Only do expensive flush after enough time have passed */ + if (time_after(jiffies, + cgrp->rstat_flush_last_time + FLUSH_TIME)) { + cgroup_rstat_flush_locked(cgrp); + return 1; + } + return 0; } /** diff --git a/mm/memcontrol.c b/mm/memcontrol.c index fabce2b50c69..771261612ec1 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -742,6 +742,7 @@ static void do_flush_stats(struct mem_cgroup *memcg) if (mem_cgroup_is_root(memcg)) WRITE_ONCE(flush_last_time, jiffies_64); + /* memcg have own ratelimit layer */ cgroup_rstat_flush(memcg->css.cgroup); }
This patch aims to reduce userspace-triggered pressure on the global cgroup_rstat_lock by introducing a mechanism to limit how often reading stat files causes cgroup rstat flushing. In the memory cgroup subsystem, memcg_vmstats_needs_flush() combined with mem_cgroup_flush_stats_ratelimited() already limits pressure on the global lock (cgroup_rstat_lock). As a result, reading memory-related stat files (such as memory.stat, memory.numa_stat, zswap.current) is already a less userspace-triggerable issue. However, other userspace users of cgroup_rstat_flush(), such as when reading io.stat (blk-cgroup.c) and cpu.stat, lack a similar system to limit pressure on the global lock. Furthermore, userspace can easily trigger this issue by reading those stat files. Typically, normal userspace stats tools (e.g., cadvisor, nomad, systemd) spawn threads that read io.stat, cpu.stat, and memory.stat (even from the same cgroup) without realizing that on the kernel side, they share the same global lock. This limitation also helps prevent malicious userspace applications from harming the kernel by reading these stat files in a tight loop. To address this, the patch introduces cgroup_rstat_flush_ratelimited(), similar to memcg's mem_cgroup_flush_stats_ratelimited(). Flushing occurs per cgroup (even though the lock remains global) a variable named rstat_flush_last_time is introduced to track when a given cgroup was last flushed. This variable, which contains the jiffies of the flush, shares properties and a cache line with rstat_flush_next and is updated simultaneously. For cpu.stat, we need to acquire the lock (via cgroup_rstat_flush_hold) because other data is read under the lock, but we skip the expensive flushing if it occurred recently. Regarding io.stat, there is an opportunity outside the lock to skip the flush, but inside the lock, we must recheck to handle races. Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org> --- block/blk-cgroup.c | 2 + include/linux/cgroup-defs.h | 1 + include/linux/cgroup.h | 3 +- kernel/cgroup/rstat.c | 60 ++++++++++++++++++++++++++++++++++++++++++- mm/memcontrol.c | 1 + 5 files changed, 63 insertions(+), 4 deletions(-)