Message ID | 20250227215543.49928-1-inwardvessel@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | cgroup: separate rstat trees | expand |
On Thu, Feb 27, 2025 at 01:55:39PM -0800, inwardvessel wrote: > From: JP Kobryn <inwardvessel@gmail.com> > > The current design of rstat takes the approach that if one subsystem is > to be flushed, all other subsystems with pending updates should also be > flushed. It seems that over time, the stat-keeping of some subsystems > has grown in size to the extent that they are noticeably slowing down > others. This has been most observable in situations where the memory > controller is enabled. One big area where the issue comes up is system > telemetry, where programs periodically sample cpu stats. It would be a > benefit for programs like this if the overhead of having to flush memory > stats (and others) could be eliminated. It would save cpu cycles for > existing cpu-based telemetry programs and improve scalability in terms > of sampling frequency and volume of hosts. > > This series changes the approach of "flush all subsystems" to "flush > only the requested subsystem". The core design change is moving from a > single unified rstat tree of cgroups to having separate trees made up of > cgroup_subsys_state's. There will be one (per-cpu) tree for the base > stats (cgroup::self) and one for each enabled subsystem (if it > implements css_rstat_flush()). In order to do this, the rstat list > pointers were moved off of the cgroup and onto the css. In the > transition, these list pointer types were changed to > cgroup_subsys_state. This allows for rstat trees to now be made up of > css nodes, where a given tree will only contains css nodes associated > with a specific subsystem. The rstat api's were changed to accept a > reference to a cgroup_subsys_state instead of a cgroup. This allows for > callers to be specific about which stats are being updated/flushed. > Since separate trees will be in use, the locking scheme was adjusted. > The global locks were split up in such a way that there are separate > locks for the base stats (cgroup::self) and each subsystem (memory, io, > etc). This allows different subsystems (including base stats) to use > rstat in parallel with no contention. > > Breaking up the unified tree into separate trees eliminates the overhead > and scalability issue explained in the first section, but comes at the > expense of using additional memory. In an effort to minimize this > overhead, a conditional allocation is performed. The cgroup_rstat_cpu > originally contained the rstat list pointers and the base stat entities. > This struct was renamed to cgroup_rstat_base_cpu and is only allocated > when the associated css is cgroup::self. A new compact struct was added > that only contains the rstat list pointers. When the css is associated > with an actual subsystem, this compact struct is allocated. With this > conditional allocation, the change in memory overhead on a per-cpu basis > before/after is shown below. > > before: > sizeof(struct cgroup_rstat_cpu) =~ 176 bytes /* can vary based on config */ > > nr_cgroups * sizeof(struct cgroup_rstat_cpu) > nr_cgroups * 176 bytes > > after: > sizeof(struct cgroup_rstat_cpu) == 16 bytes > sizeof(struct cgroup_rstat_base_cpu) =~ 176 bytes > > nr_cgroups * ( > sizeof(struct cgroup_rstat_base_cpu) + > sizeof(struct cgroup_rstat_cpu) * nr_rstat_controllers > ) > > nr_cgroups * (176 + 16 * nr_rstat_controllers) > > ... where nr_rstat_controllers is the number of enabled cgroup > controllers that implement css_rstat_flush(). On a host where both > memory and io are enabled: > > nr_cgroups * (176 + 16 * 2) > nr_cgroups * 208 bytes > > With regard to validation, there is a measurable benefit when reading > stats with this series. A test program was made to loop 1M times while > reading all four of the files cgroup.stat, cpu.stat, io.stat, > memory.stat of a given parent cgroup each iteration. This test program > has been run in the experiments that follow. > > The first experiment consisted of a parent cgroup with memory.swap.max=0 > and memory.max=1G. On a 52-cpu machine, 26 child cgroups were created > and within each child cgroup a process was spawned to frequently update > the memory cgroup stats by creating and then reading a file of size 1T > (encouraging reclaim). The test program was run alongside these 26 tasks > in parallel. The results showed a benefit in both time elapsed and perf > data of the test program. > > time before: > real 0m44.612s > user 0m0.567s > sys 0m43.887s > > perf before: > 27.02% mem_cgroup_css_rstat_flush > 6.35% __blkcg_rstat_flush > 0.06% cgroup_base_stat_cputime_show > > time after: > real 0m27.125s > user 0m0.544s > sys 0m26.491s > > perf after: > 6.03% mem_cgroup_css_rstat_flush > 0.37% blkcg_print_stat > 0.11% cgroup_base_stat_cputime_show > > Another experiment was setup on the same host using a parent cgroup with > two child cgroups. The same swap and memory max were used as the > previous experiment. In the two child cgroups, kernel builds were done > in parallel, each using "-j 20". The perf comparison of the test program > was very similar to the values in the previous experiment. The time > comparison is shown below. > > before: > real 1m2.077s > user 0m0.784s > sys 1m0.895s > > after: > real 0m32.216s > user 0m0.709s > sys 0m31.256s Great results, and I am glad that the series went down from 11 patches to 4 once we simplified the BPF handling. The added memory overhead doesn't seem to be concerning (~320KB on a system with 100 cgroups and 100 CPUs). Nice work.
Hello JP. On Thu, Feb 27, 2025 at 01:55:39PM -0800, inwardvessel <inwardvessel@gmail.com> wrote: > From: JP Kobryn <inwardvessel@gmail.com> > > The current design of rstat takes the approach that if one subsystem is > to be flushed, all other subsystems with pending updates should also be > flushed. It seems that over time, the stat-keeping of some subsystems > has grown in size to the extent that they are noticeably slowing down > others. This has been most observable in situations where the memory > controller is enabled. One big area where the issue comes up is system > telemetry, where programs periodically sample cpu stats. It would be a > benefit for programs like this if the overhead of having to flush memory > stats (and others) could be eliminated. It would save cpu cycles for > existing cpu-based telemetry programs and improve scalability in terms > of sampling frequency and volume of hosts. > This series changes the approach of "flush all subsystems" to "flush > only the requested subsystem". ... > before: > sizeof(struct cgroup_rstat_cpu) =~ 176 bytes /* can vary based on config */ > > nr_cgroups * sizeof(struct cgroup_rstat_cpu) > nr_cgroups * 176 bytes > > after: ... > nr_cgroups * (176 + 16 * 2) > nr_cgroups * 208 bytes ~ 32B/cgroup/cpu > With regard to validation, there is a measurable benefit when reading > stats with this series. A test program was made to loop 1M times while > reading all four of the files cgroup.stat, cpu.stat, io.stat, > memory.stat of a given parent cgroup each iteration. This test program > has been run in the experiments that follow. Thanks for looking into this and running experiments on the behavior of split rstat trees. > The first experiment consisted of a parent cgroup with memory.swap.max=0 > and memory.max=1G. On a 52-cpu machine, 26 child cgroups were created > and within each child cgroup a process was spawned to frequently update > the memory cgroup stats by creating and then reading a file of size 1T > (encouraging reclaim). The test program was run alongside these 26 tasks > in parallel. The results showed a benefit in both time elapsed and perf > data of the test program. > > time before: > real 0m44.612s > user 0m0.567s > sys 0m43.887s > > perf before: > 27.02% mem_cgroup_css_rstat_flush > 6.35% __blkcg_rstat_flush > 0.06% cgroup_base_stat_cputime_show > > time after: > real 0m27.125s > user 0m0.544s > sys 0m26.491s So this shows that flushing rstat trees one by one (as the test program reads *.stat) is quicker than flushing all at once (+idle reads of *.stat). Interesting, I'd not bet on that at first but that is convincing to favor the separate trees approach. > perf after: > 6.03% mem_cgroup_css_rstat_flush > 0.37% blkcg_print_stat > 0.11% cgroup_base_stat_cputime_show I'd understand why the series reduces time spent in mem_cgroup_flush_stats() but what does the lower proportion of mem_cgroup_css_rstat_flush() show? > Another experiment was setup on the same host using a parent cgroup with > two child cgroups. The same swap and memory max were used as the > previous experiment. In the two child cgroups, kernel builds were done > in parallel, each using "-j 20". The perf comparison of the test program > was very similar to the values in the previous experiment. The time > comparison is shown below. > > before: > real 1m2.077s > user 0m0.784s > sys 1m0.895s This is 1M loops of stats reading program like before? I.e. if this should be analogous to 0m44.612s above why isn't it same? (I'm thinking of more frequent updates in the latter test.) > after: > real 0m32.216s > user 0m0.709s > sys 0m31.256s What was impact on the kernel build workloads (cgroup_rstat_updated)? (Perhaps the saved 30s of CPU work (if potentially moved from readers to writers) would be spread too thin in all of two 20-parallel kernel builds, right?) ... > For the final experiment, perf events were recorded during a kernel > build with the same host and cgroup setup. The builds took place in the > child node. Control and experimental sides both showed similar in cycles > spent on cgroup_rstat_updated() and appeard insignificant compared among > the events recorded with the workload. What's the change between control vs experiment? Runnning in root cg vs nested? Or running without *.stat readers vs with them against the kernel build? (This clarification would likely answer my question above.) Michal
On 3/3/25 7:19 AM, Michal Koutný wrote: > Hello JP. > > On Thu, Feb 27, 2025 at 01:55:39PM -0800, inwardvessel <inwardvessel@gmail.com> wrote: >> From: JP Kobryn <inwardvessel@gmail.com> >> >> The current design of rstat takes the approach that if one subsystem is >> to be flushed, all other subsystems with pending updates should also be >> flushed. It seems that over time, the stat-keeping of some subsystems >> has grown in size to the extent that they are noticeably slowing down >> others. This has been most observable in situations where the memory >> controller is enabled. One big area where the issue comes up is system >> telemetry, where programs periodically sample cpu stats. It would be a >> benefit for programs like this if the overhead of having to flush memory >> stats (and others) could be eliminated. It would save cpu cycles for >> existing cpu-based telemetry programs and improve scalability in terms >> of sampling frequency and volume of hosts. > >> This series changes the approach of "flush all subsystems" to "flush >> only the requested subsystem". > ... > >> before: >> sizeof(struct cgroup_rstat_cpu) =~ 176 bytes /* can vary based on config */ >> >> nr_cgroups * sizeof(struct cgroup_rstat_cpu) >> nr_cgroups * 176 bytes >> >> after: > ... >> nr_cgroups * (176 + 16 * 2) >> nr_cgroups * 208 bytes > > ~ 32B/cgroup/cpu Thanks. I'll make this clear in the cover letter next rev. > >> With regard to validation, there is a measurable benefit when reading >> stats with this series. A test program was made to loop 1M times while >> reading all four of the files cgroup.stat, cpu.stat, io.stat, >> memory.stat of a given parent cgroup each iteration. This test program >> has been run in the experiments that follow. > > Thanks for looking into this and running experiments on the behavior of > split rstat trees. And thank you for reviewing along with the good questions. > >> The first experiment consisted of a parent cgroup with memory.swap.max=0 >> and memory.max=1G. On a 52-cpu machine, 26 child cgroups were created >> and within each child cgroup a process was spawned to frequently update >> the memory cgroup stats by creating and then reading a file of size 1T >> (encouraging reclaim). The test program was run alongside these 26 tasks >> in parallel. The results showed a benefit in both time elapsed and perf >> data of the test program. >> >> time before: >> real 0m44.612s >> user 0m0.567s >> sys 0m43.887s >> >> perf before: >> 27.02% mem_cgroup_css_rstat_flush >> 6.35% __blkcg_rstat_flush >> 0.06% cgroup_base_stat_cputime_show >> >> time after: >> real 0m27.125s >> user 0m0.544s >> sys 0m26.491s > > So this shows that flushing rstat trees one by one (as the test program > reads *.stat) is quicker than flushing all at once (+idle reads of > *.stat). > Interesting, I'd not bet on that at first but that is convincing to > favor the separate trees approach. > >> perf after:mem_cgroup_css_rstat_flush >> 6.03% mem_cgroup_css_rstat_flush >> 0.37% blkcg_print_stat >> 0.11% cgroup_base_stat_cputime_show > > I'd understand why the series reduces time spent in > mem_cgroup_flush_stats() but what does the lower proportion of > mem_cgroup_css_rstat_flush() show? When the entry point for flushing is reading the file memory.stat, memory_stat_show() is called which leads to __mem_cgroup_flush_stats(). In this function, there is an early return when (!force && !needs_flush) is true. This opportunity to "skip" a flush is not reached when another subsystem has initiated the flush and entry point for flushing memory is css->css_rstat_flush(). To verify above, I made use of a tracepoint previously added [0] to get info info on the number of memcg flushes performed vs skipped. In a comparison between reading only the memory.stat file vs reading {memory,io,cpu}.stat files under the same test, the flush count increased by about the same value the skip count decreased. Reading memory.stat non-forced flushes: 5781 flushes skipped: 995826 Reading {memory,io.cpu}.stat non-forced flushes: 12047 flushes skipped: 990857 If the flushes were not skipped, I think we would see similar proportion of mem_cgroup_css_rstat_flush() when reading memory.stat. [0] https://lore.kernel.org/all/20241029021106.25587-1-inwardvessel@gmail.com/ > > >> Another experiment was setup on the same host using a parent cgroup with >> two child cgroups. The same swap and memory max were used as the >> previous experiment. In the two child cgroups, kernel builds were done >> in parallel, each using "-j 20". The perf comparison of the test program >> was very similar to the values in the previous experiment. The time >> comparison is shown below. >> >> before: >> real 1m2.077s >> user 0m0.784s >> sys 1m0.895s > > This is 1M loops of stats reading program like before? I.e. if this > should be analogous to 0m44.612s above why isn't it same? (I'm thinking > of more frequent updates in the latter test.) Yes. One notable difference on this test is there are more threads in the workload (40 vs 26) which are doing the updates. > >> after: >> real 0m32.216s >> user 0m0.709s >> sys 0m31.256s > > What was impact on the kernel build workloads (cgroup_rstat_updated)? You can now find some workload timing results further down. If you're asking specifically about time spent in cgroup_rstat_updated(), perf reports show fractional values on both sides. > > (Perhaps the saved 30s of CPU work (if potentially moved from readers to > writers) would be spread too thin in all of two 20-parallel kernel > builds, right?) Are you suggesting a workload with fewer threads? > > ... >> For the final experiment, perf events were recorded during a kernel >> build with the same host and cgroup setup. The builds took place in the >> child node. Control and experimental sides both showed similar in cycles >> spent on cgroup_rstat_updated() and appeard insignificant compared among >> the events recorded with the workload. > > What's the change between control vs experiment? Runnning in root cg vs > nested? Or running without *.stat readers vs with them against the > kernel build? > (This clarification would likely answer my question above.) > workload control with no readers: real 6m54.818s user 117m3.122s sys 5m4.996s workload experiment with no readers: real 6m54.862s user 117m12.812s sys 5m0.943s workload control with constant readers {memory,io,cpu,cgroup}.stat: real 6m59.468s user 118m26.981s sys 5m20.163s workload experiment with constant readers {memory,io,cpu,cgroup}.stat: real 6m57.031s user 118m13.833s sys 5m3.454s These tests were done in a child (nested) cgroup. Were you also asking for a root vs nested experiment or were you just needing clarification on the test details? > > Michal
On Wed, Mar 05, 2025 at 05:07:04PM -0800, JP Kobryn <inwardvessel@gmail.com> wrote: > When the entry point for flushing is reading the file memory.stat, > memory_stat_show() is called which leads to __mem_cgroup_flush_stats(). In > this function, there is an early return when (!force && !needs_flush) is > true. This opportunity to "skip" a flush is not reached when another > subsystem has initiated the flush and entry point for flushing memory is > css->css_rstat_flush(). That sounds spot on, I'd say that explains the savings observed. Could you add a note the next version along the lines like this: memcg flushing uses heuristics to optimize flushing but this is bypassed when memcg is flushed as consequence of sharing the update tree with another controller. IOW, other controllers did flushing work instead of memcg but it was inefficient (effective though). > Are you suggesting a workload with fewer threads? No, no, I only roughly wondered where the work disappeared (but I've understood it from the flushing heuristics above). > > What's the change between control vs experiment? Runnning in root cg vs > > nested? Or running without *.stat readers vs with them against the > > kernel build? > > (This clarification would likely answer my question above.) > > > (reordered by me, hopefully we're on the same page) before split: > workload control with no readers: > real 6m54.818s > user 117m3.122s > sys 5m4.996s > > workload control with constant readers {memory,io,cpu,cgroup}.stat: > real 6m59.468s > user 118m26.981s > sys 5m20.163s after split: > workload experiment with no readers: > real 6m54.862s > user 117m12.812s > sys 5m0.943s > > workload experiment with constant readers {memory,io,cpu,cgroup}.stat: > real 6m57.031s > user 118m13.833s > sys 5m3.454s I reckon this is positive effect* of the utilized heuristics (no unnecessary flushes, therefore no unnecessary tree updates on writer side neither). *) Not statistical but it doesn't look worse. > These tests were done in a child (nested) cgroup. Were you also asking for a > root vs nested experiment or were you just needing clarification on the test > details? No, I don't think the root vs nested would be that much interesting in this case. Thanks, Michal
From: JP Kobryn <inwardvessel@gmail.com> The current design of rstat takes the approach that if one subsystem is to be flushed, all other subsystems with pending updates should also be flushed. It seems that over time, the stat-keeping of some subsystems has grown in size to the extent that they are noticeably slowing down others. This has been most observable in situations where the memory controller is enabled. One big area where the issue comes up is system telemetry, where programs periodically sample cpu stats. It would be a benefit for programs like this if the overhead of having to flush memory stats (and others) could be eliminated. It would save cpu cycles for existing cpu-based telemetry programs and improve scalability in terms of sampling frequency and volume of hosts. This series changes the approach of "flush all subsystems" to "flush only the requested subsystem". The core design change is moving from a single unified rstat tree of cgroups to having separate trees made up of cgroup_subsys_state's. There will be one (per-cpu) tree for the base stats (cgroup::self) and one for each enabled subsystem (if it implements css_rstat_flush()). In order to do this, the rstat list pointers were moved off of the cgroup and onto the css. In the transition, these list pointer types were changed to cgroup_subsys_state. This allows for rstat trees to now be made up of css nodes, where a given tree will only contains css nodes associated with a specific subsystem. The rstat api's were changed to accept a reference to a cgroup_subsys_state instead of a cgroup. This allows for callers to be specific about which stats are being updated/flushed. Since separate trees will be in use, the locking scheme was adjusted. The global locks were split up in such a way that there are separate locks for the base stats (cgroup::self) and each subsystem (memory, io, etc). This allows different subsystems (including base stats) to use rstat in parallel with no contention. Breaking up the unified tree into separate trees eliminates the overhead and scalability issue explained in the first section, but comes at the expense of using additional memory. In an effort to minimize this overhead, a conditional allocation is performed. The cgroup_rstat_cpu originally contained the rstat list pointers and the base stat entities. This struct was renamed to cgroup_rstat_base_cpu and is only allocated when the associated css is cgroup::self. A new compact struct was added that only contains the rstat list pointers. When the css is associated with an actual subsystem, this compact struct is allocated. With this conditional allocation, the change in memory overhead on a per-cpu basis before/after is shown below. before: sizeof(struct cgroup_rstat_cpu) =~ 176 bytes /* can vary based on config */ nr_cgroups * sizeof(struct cgroup_rstat_cpu) nr_cgroups * 176 bytes after: sizeof(struct cgroup_rstat_cpu) == 16 bytes sizeof(struct cgroup_rstat_base_cpu) =~ 176 bytes nr_cgroups * ( sizeof(struct cgroup_rstat_base_cpu) + sizeof(struct cgroup_rstat_cpu) * nr_rstat_controllers ) nr_cgroups * (176 + 16 * nr_rstat_controllers) ... where nr_rstat_controllers is the number of enabled cgroup controllers that implement css_rstat_flush(). On a host where both memory and io are enabled: nr_cgroups * (176 + 16 * 2) nr_cgroups * 208 bytes With regard to validation, there is a measurable benefit when reading stats with this series. A test program was made to loop 1M times while reading all four of the files cgroup.stat, cpu.stat, io.stat, memory.stat of a given parent cgroup each iteration. This test program has been run in the experiments that follow. The first experiment consisted of a parent cgroup with memory.swap.max=0 and memory.max=1G. On a 52-cpu machine, 26 child cgroups were created and within each child cgroup a process was spawned to frequently update the memory cgroup stats by creating and then reading a file of size 1T (encouraging reclaim). The test program was run alongside these 26 tasks in parallel. The results showed a benefit in both time elapsed and perf data of the test program. time before: real 0m44.612s user 0m0.567s sys 0m43.887s perf before: 27.02% mem_cgroup_css_rstat_flush 6.35% __blkcg_rstat_flush 0.06% cgroup_base_stat_cputime_show time after: real 0m27.125s user 0m0.544s sys 0m26.491s perf after: 6.03% mem_cgroup_css_rstat_flush 0.37% blkcg_print_stat 0.11% cgroup_base_stat_cputime_show Another experiment was setup on the same host using a parent cgroup with two child cgroups. The same swap and memory max were used as the previous experiment. In the two child cgroups, kernel builds were done in parallel, each using "-j 20". The perf comparison of the test program was very similar to the values in the previous experiment. The time comparison is shown below. before: real 1m2.077s user 0m0.784s sys 1m0.895s after: real 0m32.216s user 0m0.709s sys 0m31.256s Note that the above two experiments were also done with a modified test program that only reads the cpu.stat file and none of the other three files previously mentioned. The results were similar to what was seen in the v1 email for this series. See changelog for link to v1 if needed. For the final experiment, perf events were recorded during a kernel build with the same host and cgroup setup. The builds took place in the child node. Control and experimental sides both showed similar in cycles spent on cgroup_rstat_updated() and appeard insignificant compared among the events recorded with the workload. changelog v2: drop the patch creating a new cgroup_rstat struct and related code drop bpf-specific patches. instead just use cgroup::self in bpf progs drop the cpu lock patches. instead select cpu lock in updated_list func relocate the cgroup_rstat_init() call to inside css_create() relocate the cgroup_rstat_exit() cleanup from apply_control_enable() to css_free_rwork_fn() v1: https://lore.kernel.org/all/20250218031448.46951-1-inwardvessel@gmail.com/ JP Kobryn (4): cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state cgroup: rstat lock indirection cgroup: separate rstat locks for subsystems cgroup: separate rstat list pointers from base stats block/blk-cgroup.c | 4 +- include/linux/cgroup-defs.h | 67 ++-- include/linux/cgroup.h | 8 +- kernel/cgroup/cgroup-internal.h | 4 +- kernel/cgroup/cgroup.c | 53 +-- kernel/cgroup/rstat.c | 318 +++++++++++------- mm/memcontrol.c | 4 +- .../selftests/bpf/progs/btf_type_tag_percpu.c | 5 +- .../bpf/progs/cgroup_hierarchical_stats.c | 8 +- 9 files changed, 276 insertions(+), 195 deletions(-)