Message ID | 20250319071330.898763-1-gthelen@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | cgroup/rstat: avoid disabling irqs for O(num_cpu) | expand |
(fix mistyped CC address to Eric) On Wed, Mar 19, 2025 at 12:13 AM Greg Thelen <gthelen@google.com> wrote: > > From: Eric Dumazet <edumazet@google.com> > > cgroup_rstat_flush_locked() grabs the irq safe cgroup_rstat_lock while > iterating all possible cpus. It only drops the lock if there is > scheduler or spin lock contention. If neither, then interrupts can be > disabled for a long time. On large machines this can disable interrupts > for a long enough time to drop network packets. On 400+ CPU machines > I've seen interrupt disabled for over 40 msec. > > Prevent rstat from disabling interrupts while processing all possible > cpus. Instead drop and reacquire cgroup_rstat_lock for each cpu. This > approach was previously discussed in > https://lore.kernel.org/lkml/ZBz%2FV5a7%2F6PZeM7S@slm.duckdns.org/, > though this was in the context of an non-irq rstat spin lock. > > Benchmark this change with: > 1) a single stat_reader process with 400 threads, each reading a test > memcg's memory.stat repeatedly for 10 seconds. > 2) 400 memory hog processes running in the test memcg and repeatedly > charging memory until oom killed. Then they repeat charging and oom > killing. > > v6.14-rc6 with CONFIG_IRQSOFF_TRACER with stat_reader and hogs, finds > interrupts are disabled by rstat for 45341 usec: > # => started at: _raw_spin_lock_irq > # => ended at: cgroup_rstat_flush > # > # > # _------=> CPU# > # / _-----=> irqs-off/BH-disabled > # | / _----=> need-resched > # || / _---=> hardirq/softirq > # ||| / _--=> preempt-depth > # |||| / _-=> migrate-disable > # ||||| / delay > # cmd pid |||||| time | caller > # \ / |||||| \ | / > stat_rea-96532 52d.... 0us*: _raw_spin_lock_irq > stat_rea-96532 52d.... 45342us : cgroup_rstat_flush > stat_rea-96532 52d.... 45342us : tracer_hardirqs_on <-cgroup_rstat_flush > stat_rea-96532 52d.... 45343us : <stack trace> > => memcg1_stat_format > => memory_stat_format > => memory_stat_show > => seq_read_iter > => vfs_read > => ksys_read > => do_syscall_64 > => entry_SYSCALL_64_after_hwframe > > With this patch the CONFIG_IRQSOFF_TRACER doesn't find rstat to be the > longest holder. The longest irqs-off holder has irqs disabled for > 4142 usec, a huge reduction from previous 45341 usec rstat finding. > > Running stat_reader memory.stat reader for 10 seconds: > - without memory hogs: 9.84M accesses => 12.7M accesses > - with memory hogs: 9.46M accesses => 11.1M accesses > The throughput of memory.stat access improves. > > The mode of memory.stat access latency after grouping by of 2 buckets: > - without memory hogs: 64 usec => 16 usec > - with memory hogs: 64 usec => 8 usec > The memory.stat latency improves. > > Signed-off-by: Eric Dumazet <edumazet@google.com> > Signed-off-by: Greg Thelen <gthelen@google.com> > Tested-by: Greg Thelen <gthelen@google.com> > --- > kernel/cgroup/rstat.c | 12 +++++------- > 1 file changed, 5 insertions(+), 7 deletions(-) > > diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c > index aac91466279f..976c24b3671a 100644 > --- a/kernel/cgroup/rstat.c > +++ b/kernel/cgroup/rstat.c > @@ -323,13 +323,11 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp) > rcu_read_unlock(); > } > > - /* play nice and yield if necessary */ > - if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) { > - __cgroup_rstat_unlock(cgrp, cpu); > - if (!cond_resched()) > - cpu_relax(); > - __cgroup_rstat_lock(cgrp, cpu); > - } > + /* play nice and avoid disabling interrupts for a long time */ > + __cgroup_rstat_unlock(cgrp, cpu); > + if (!cond_resched()) > + cpu_relax(); > + __cgroup_rstat_lock(cgrp, cpu); > } > } > > -- > 2.49.0.rc1.451.g8f38331e32-goog >
Hello. On Wed, Mar 19, 2025 at 12:13:30AM -0700, Greg Thelen <gthelen@google.com> wrote: > cgroup_rstat_flush_locked() grabs the irq safe cgroup_rstat_lock while > iterating all possible cpus. It only drops the lock if there is > scheduler or spin lock contention. If neither, then interrupts can be > disabled for a long time. On large machines this can disable interrupts > for a long enough time to drop network packets. On 400+ CPU machines > I've seen interrupt disabled for over 40 msec. This is peanuts, watchdog_thresh defaults to 10000 msec. (Tongue-in-cheek, to put that threshold into relation but I see the problem.) > The mode of memory.stat access latency after grouping by of 2 buckets: power > - without memory hogs: 64 usec => 16 usec > - with memory hogs: 64 usec => 8 usec > The memory.stat latency improves. > > Signed-off-by: Eric Dumazet <edumazet@google.com> > Signed-off-by: Greg Thelen <gthelen@google.com> > Tested-by: Greg Thelen <gthelen@google.com> > --- > kernel/cgroup/rstat.c | 12 +++++------- > 1 file changed, 5 insertions(+), 7 deletions(-) FTR, the lock may end up split per-subsys [1] but this would still make sense for memcg's one. (I wonder if Tejun would consider it small enough then to avoid interrupt disabling. Then this could be converted to more widely used cond_resched_lock().) [1] https://lore.kernel.org/r/20250227215543.49928-4-inwardvessel@gmail.com/ But all in all, thanks for this and Acked-by: Michal Koutný <mkoutny@suse.com>
On Wed, Mar 19, 2025 at 12:13:30AM -0700, Greg Thelen wrote: > From: Eric Dumazet <edumazet@google.com> > > cgroup_rstat_flush_locked() grabs the irq safe cgroup_rstat_lock while > iterating all possible cpus. It only drops the lock if there is > scheduler or spin lock contention. If neither, then interrupts can be > disabled for a long time. On large machines this can disable interrupts > for a long enough time to drop network packets. On 400+ CPU machines > I've seen interrupt disabled for over 40 msec. > > Prevent rstat from disabling interrupts while processing all possible > cpus. Instead drop and reacquire cgroup_rstat_lock for each cpu. This > approach was previously discussed in > https://lore.kernel.org/lkml/ZBz%2FV5a7%2F6PZeM7S@slm.duckdns.org/, > though this was in the context of an non-irq rstat spin lock. > > Benchmark this change with: > 1) a single stat_reader process with 400 threads, each reading a test > memcg's memory.stat repeatedly for 10 seconds. > 2) 400 memory hog processes running in the test memcg and repeatedly > charging memory until oom killed. Then they repeat charging and oom > killing. > > v6.14-rc6 with CONFIG_IRQSOFF_TRACER with stat_reader and hogs, finds > interrupts are disabled by rstat for 45341 usec: > # => started at: _raw_spin_lock_irq > # => ended at: cgroup_rstat_flush > # > # > # _------=> CPU# > # / _-----=> irqs-off/BH-disabled > # | / _----=> need-resched > # || / _---=> hardirq/softirq > # ||| / _--=> preempt-depth > # |||| / _-=> migrate-disable > # ||||| / delay > # cmd pid |||||| time | caller > # \ / |||||| \ | / > stat_rea-96532 52d.... 0us*: _raw_spin_lock_irq > stat_rea-96532 52d.... 45342us : cgroup_rstat_flush > stat_rea-96532 52d.... 45342us : tracer_hardirqs_on <-cgroup_rstat_flush > stat_rea-96532 52d.... 45343us : <stack trace> > => memcg1_stat_format > => memory_stat_format > => memory_stat_show > => seq_read_iter > => vfs_read > => ksys_read > => do_syscall_64 > => entry_SYSCALL_64_after_hwframe > > With this patch the CONFIG_IRQSOFF_TRACER doesn't find rstat to be the > longest holder. The longest irqs-off holder has irqs disabled for > 4142 usec, a huge reduction from previous 45341 usec rstat finding. > > Running stat_reader memory.stat reader for 10 seconds: > - without memory hogs: 9.84M accesses => 12.7M accesses > - with memory hogs: 9.46M accesses => 11.1M accesses > The throughput of memory.stat access improves. > > The mode of memory.stat access latency after grouping by of 2 buckets: > - without memory hogs: 64 usec => 16 usec > - with memory hogs: 64 usec => 8 usec > The memory.stat latency improves. > > Signed-off-by: Eric Dumazet <edumazet@google.com> > Signed-off-by: Greg Thelen <gthelen@google.com> > Tested-by: Greg Thelen <gthelen@google.com> > --- > kernel/cgroup/rstat.c | 12 +++++------- > 1 file changed, 5 insertions(+), 7 deletions(-) > > diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c > index aac91466279f..976c24b3671a 100644 > --- a/kernel/cgroup/rstat.c > +++ b/kernel/cgroup/rstat.c > @@ -323,13 +323,11 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp) > rcu_read_unlock(); > } > > - /* play nice and yield if necessary */ > - if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) { > - __cgroup_rstat_unlock(cgrp, cpu); > - if (!cond_resched()) > - cpu_relax(); > - __cgroup_rstat_lock(cgrp, cpu); > - } > + /* play nice and avoid disabling interrupts for a long time */ > + __cgroup_rstat_unlock(cgrp, cpu); > + if (!cond_resched()) > + cpu_relax(); > + __cgroup_rstat_lock(cgrp, cpu); > } > } Is not this going a little too far? the lock + irq trip is quite expensive in its own right and now is going to be paid for each cpu, as in the total time spent executing cgroup_rstat_flush_locked is going to go up. Would your problem go away toggling this every -- say -- 8 cpus? Just a suggestion.
On Wed, Mar 19, 2025 at 12:13:30AM -0700, Greg Thelen wrote: > From: Eric Dumazet <edumazet@google.com> > > cgroup_rstat_flush_locked() grabs the irq safe cgroup_rstat_lock while > iterating all possible cpus. It only drops the lock if there is > scheduler or spin lock contention. If neither, then interrupts can be > disabled for a long time. On large machines this can disable interrupts > for a long enough time to drop network packets. On 400+ CPU machines > I've seen interrupt disabled for over 40 msec. > > Prevent rstat from disabling interrupts while processing all possible > cpus. Instead drop and reacquire cgroup_rstat_lock for each cpu. This > approach was previously discussed in > https://lore.kernel.org/lkml/ZBz%2FV5a7%2F6PZeM7S@slm.duckdns.org/, > though this was in the context of an non-irq rstat spin lock. > > Benchmark this change with: > 1) a single stat_reader process with 400 threads, each reading a test > memcg's memory.stat repeatedly for 10 seconds. > 2) 400 memory hog processes running in the test memcg and repeatedly > charging memory until oom killed. Then they repeat charging and oom > killing. > > v6.14-rc6 with CONFIG_IRQSOFF_TRACER with stat_reader and hogs, finds > interrupts are disabled by rstat for 45341 usec: > # => started at: _raw_spin_lock_irq > # => ended at: cgroup_rstat_flush > # > # > # _------=> CPU# > # / _-----=> irqs-off/BH-disabled > # | / _----=> need-resched > # || / _---=> hardirq/softirq > # ||| / _--=> preempt-depth > # |||| / _-=> migrate-disable > # ||||| / delay > # cmd pid |||||| time | caller > # \ / |||||| \ | / > stat_rea-96532 52d.... 0us*: _raw_spin_lock_irq > stat_rea-96532 52d.... 45342us : cgroup_rstat_flush > stat_rea-96532 52d.... 45342us : tracer_hardirqs_on <-cgroup_rstat_flush > stat_rea-96532 52d.... 45343us : <stack trace> > => memcg1_stat_format > => memory_stat_format > => memory_stat_show > => seq_read_iter > => vfs_read > => ksys_read > => do_syscall_64 > => entry_SYSCALL_64_after_hwframe > > With this patch the CONFIG_IRQSOFF_TRACER doesn't find rstat to be the > longest holder. The longest irqs-off holder has irqs disabled for > 4142 usec, a huge reduction from previous 45341 usec rstat finding. > > Running stat_reader memory.stat reader for 10 seconds: > - without memory hogs: 9.84M accesses => 12.7M accesses > - with memory hogs: 9.46M accesses => 11.1M accesses > The throughput of memory.stat access improves. > > The mode of memory.stat access latency after grouping by of 2 buckets: > - without memory hogs: 64 usec => 16 usec > - with memory hogs: 64 usec => 8 usec > The memory.stat latency improves. > > Signed-off-by: Eric Dumazet <edumazet@google.com> > Signed-off-by: Greg Thelen <gthelen@google.com> > Tested-by: Greg Thelen <gthelen@google.com> > --- > kernel/cgroup/rstat.c | 12 +++++------- > 1 file changed, 5 insertions(+), 7 deletions(-) > > diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c > index aac91466279f..976c24b3671a 100644 > --- a/kernel/cgroup/rstat.c > +++ b/kernel/cgroup/rstat.c > @@ -323,13 +323,11 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp) > rcu_read_unlock(); > } > > - /* play nice and yield if necessary */ > - if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) { > - __cgroup_rstat_unlock(cgrp, cpu); > - if (!cond_resched()) > - cpu_relax(); > - __cgroup_rstat_lock(cgrp, cpu); > - } > + /* play nice and avoid disabling interrupts for a long time */ > + __cgroup_rstat_unlock(cgrp, cpu); > + if (!cond_resched()) > + cpu_relax(); > + __cgroup_rstat_lock(cgrp, cpu); This patch looks good as-is, so feel free to add: Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> That being said I think we can do further cleanups here now. We should probably inline cgroup_rstat_flush_locked() into cgroup_rstat_flush(), and move the lock hold and release into the loop. cgroup_rstat_flush_hold() can simply call cgroup_rstat_flush() then hold the lock. Something like this on top: diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index 976c24b3671a7..4f4b8d22555d7 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -299,17 +299,29 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop) spin_unlock_irq(&cgroup_rstat_lock); } -/* see cgroup_rstat_flush() */ -static void cgroup_rstat_flush_locked(struct cgroup *cgrp) - __releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock) +/** + * cgroup_rstat_flush - flush stats in @cgrp's subtree + * @cgrp: target cgroup + * + * Collect all per-cpu stats in @cgrp's subtree into the global counters + * and propagate them upwards. After this function returns, all cgroups in + * the subtree have up-to-date ->stat. + * + * This also gets all cgroups in the subtree including @cgrp off the + * ->updated_children lists. + * + * This function may block. + */ +__bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp) { int cpu; - lockdep_assert_held(&cgroup_rstat_lock); - + might_sleep(); for_each_possible_cpu(cpu) { struct cgroup *pos = cgroup_rstat_updated_list(cgrp, cpu); + /* Reacquire for every CPU to avoiding disabing IRQs too long */ + __cgroup_rstat_lock(cgrp, cpu); for (; pos; pos = pos->rstat_flush_next) { struct cgroup_subsys_state *css; @@ -322,37 +334,12 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp) css->ss->css_rstat_flush(css, cpu); rcu_read_unlock(); } - - /* play nice and avoid disabling interrupts for a long time */ __cgroup_rstat_unlock(cgrp, cpu); if (!cond_resched()) cpu_relax(); - __cgroup_rstat_lock(cgrp, cpu); } } -/** - * cgroup_rstat_flush - flush stats in @cgrp's subtree - * @cgrp: target cgroup - * - * Collect all per-cpu stats in @cgrp's subtree into the global counters - * and propagate them upwards. After this function returns, all cgroups in - * the subtree have up-to-date ->stat. - * - * This also gets all cgroups in the subtree including @cgrp off the - * ->updated_children lists. - * - * This function may block. - */ -__bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp) -{ - might_sleep(); - - __cgroup_rstat_lock(cgrp, -1); - cgroup_rstat_flush_locked(cgrp); - __cgroup_rstat_unlock(cgrp, -1); -} - /** * cgroup_rstat_flush_hold - flush stats in @cgrp's subtree and hold * @cgrp: target cgroup @@ -365,9 +352,8 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp) void cgroup_rstat_flush_hold(struct cgroup *cgrp) __acquires(&cgroup_rstat_lock) { - might_sleep(); + cgroup_rstat_flush(cgrp); __cgroup_rstat_lock(cgrp, -1); - cgroup_rstat_flush_locked(cgrp); } /**
On Wed, Mar 19, 2025 at 11:47:32AM +0100, Mateusz Guzik wrote: > On Wed, Mar 19, 2025 at 12:13:30AM -0700, Greg Thelen wrote: > > From: Eric Dumazet <edumazet@google.com> > > > > cgroup_rstat_flush_locked() grabs the irq safe cgroup_rstat_lock while > > iterating all possible cpus. It only drops the lock if there is > > scheduler or spin lock contention. If neither, then interrupts can be > > disabled for a long time. On large machines this can disable interrupts > > for a long enough time to drop network packets. On 400+ CPU machines > > I've seen interrupt disabled for over 40 msec. > > > > Prevent rstat from disabling interrupts while processing all possible > > cpus. Instead drop and reacquire cgroup_rstat_lock for each cpu. This > > approach was previously discussed in > > https://lore.kernel.org/lkml/ZBz%2FV5a7%2F6PZeM7S@slm.duckdns.org/, > > though this was in the context of an non-irq rstat spin lock. > > > > Benchmark this change with: > > 1) a single stat_reader process with 400 threads, each reading a test > > memcg's memory.stat repeatedly for 10 seconds. > > 2) 400 memory hog processes running in the test memcg and repeatedly > > charging memory until oom killed. Then they repeat charging and oom > > killing. > > > > v6.14-rc6 with CONFIG_IRQSOFF_TRACER with stat_reader and hogs, finds > > interrupts are disabled by rstat for 45341 usec: > > # => started at: _raw_spin_lock_irq > > # => ended at: cgroup_rstat_flush > > # > > # > > # _------=> CPU# > > # / _-----=> irqs-off/BH-disabled > > # | / _----=> need-resched > > # || / _---=> hardirq/softirq > > # ||| / _--=> preempt-depth > > # |||| / _-=> migrate-disable > > # ||||| / delay > > # cmd pid |||||| time | caller > > # \ / |||||| \ | / > > stat_rea-96532 52d.... 0us*: _raw_spin_lock_irq > > stat_rea-96532 52d.... 45342us : cgroup_rstat_flush > > stat_rea-96532 52d.... 45342us : tracer_hardirqs_on <-cgroup_rstat_flush > > stat_rea-96532 52d.... 45343us : <stack trace> > > => memcg1_stat_format > > => memory_stat_format > > => memory_stat_show > > => seq_read_iter > > => vfs_read > > => ksys_read > > => do_syscall_64 > > => entry_SYSCALL_64_after_hwframe > > > > With this patch the CONFIG_IRQSOFF_TRACER doesn't find rstat to be the > > longest holder. The longest irqs-off holder has irqs disabled for > > 4142 usec, a huge reduction from previous 45341 usec rstat finding. > > > > Running stat_reader memory.stat reader for 10 seconds: > > - without memory hogs: 9.84M accesses => 12.7M accesses > > - with memory hogs: 9.46M accesses => 11.1M accesses > > The throughput of memory.stat access improves. > > > > The mode of memory.stat access latency after grouping by of 2 buckets: > > - without memory hogs: 64 usec => 16 usec > > - with memory hogs: 64 usec => 8 usec > > The memory.stat latency improves. > > > > Signed-off-by: Eric Dumazet <edumazet@google.com> > > Signed-off-by: Greg Thelen <gthelen@google.com> > > Tested-by: Greg Thelen <gthelen@google.com> > > --- > > kernel/cgroup/rstat.c | 12 +++++------- > > 1 file changed, 5 insertions(+), 7 deletions(-) > > > > diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c > > index aac91466279f..976c24b3671a 100644 > > --- a/kernel/cgroup/rstat.c > > +++ b/kernel/cgroup/rstat.c > > @@ -323,13 +323,11 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp) > > rcu_read_unlock(); > > } > > > > - /* play nice and yield if necessary */ > > - if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) { > > - __cgroup_rstat_unlock(cgrp, cpu); > > - if (!cond_resched()) > > - cpu_relax(); > > - __cgroup_rstat_lock(cgrp, cpu); > > - } > > + /* play nice and avoid disabling interrupts for a long time */ > > + __cgroup_rstat_unlock(cgrp, cpu); > > + if (!cond_resched()) > > + cpu_relax(); > > + __cgroup_rstat_lock(cgrp, cpu); > > } > > } > > Is not this going a little too far? > > the lock + irq trip is quite expensive in its own right and now is > going to be paid for each cpu, as in the total time spent executing > cgroup_rstat_flush_locked is going to go up. > > Would your problem go away toggling this every -- say -- 8 cpus? I was concerned about this too, and about more lock bouncing, but the testing suggests that this actually overall improves the latency of cgroup_rstat_flush_locked() (at least on tested HW). So I don't think we need to do something like this unless a regression is observed. > > Just a suggestion. >
On Wed, Mar 19, 2025 at 12:13:30AM -0700, Greg Thelen wrote: > From: Eric Dumazet <edumazet@google.com> > > cgroup_rstat_flush_locked() grabs the irq safe cgroup_rstat_lock while > iterating all possible cpus. It only drops the lock if there is > scheduler or spin lock contention. If neither, then interrupts can be > disabled for a long time. On large machines this can disable interrupts > for a long enough time to drop network packets. On 400+ CPU machines > I've seen interrupt disabled for over 40 msec. > > Prevent rstat from disabling interrupts while processing all possible > cpus. Instead drop and reacquire cgroup_rstat_lock for each cpu. This > approach was previously discussed in > https://lore.kernel.org/lkml/ZBz%2FV5a7%2F6PZeM7S@slm.duckdns.org/, > though this was in the context of an non-irq rstat spin lock. > > Benchmark this change with: > 1) a single stat_reader process with 400 threads, each reading a test > memcg's memory.stat repeatedly for 10 seconds. > 2) 400 memory hog processes running in the test memcg and repeatedly > charging memory until oom killed. Then they repeat charging and oom > killing. > > v6.14-rc6 with CONFIG_IRQSOFF_TRACER with stat_reader and hogs, finds > interrupts are disabled by rstat for 45341 usec: > # => started at: _raw_spin_lock_irq > # => ended at: cgroup_rstat_flush > # > # > # _------=> CPU# > # / _-----=> irqs-off/BH-disabled > # | / _----=> need-resched > # || / _---=> hardirq/softirq > # ||| / _--=> preempt-depth > # |||| / _-=> migrate-disable > # ||||| / delay > # cmd pid |||||| time | caller > # \ / |||||| \ | / > stat_rea-96532 52d.... 0us*: _raw_spin_lock_irq > stat_rea-96532 52d.... 45342us : cgroup_rstat_flush > stat_rea-96532 52d.... 45342us : tracer_hardirqs_on <-cgroup_rstat_flush > stat_rea-96532 52d.... 45343us : <stack trace> > => memcg1_stat_format > => memory_stat_format > => memory_stat_show > => seq_read_iter > => vfs_read > => ksys_read > => do_syscall_64 > => entry_SYSCALL_64_after_hwframe > > With this patch the CONFIG_IRQSOFF_TRACER doesn't find rstat to be the > longest holder. The longest irqs-off holder has irqs disabled for > 4142 usec, a huge reduction from previous 45341 usec rstat finding. > > Running stat_reader memory.stat reader for 10 seconds: > - without memory hogs: 9.84M accesses => 12.7M accesses > - with memory hogs: 9.46M accesses => 11.1M accesses > The throughput of memory.stat access improves. > > The mode of memory.stat access latency after grouping by of 2 buckets: > - without memory hogs: 64 usec => 16 usec > - with memory hogs: 64 usec => 8 usec > The memory.stat latency improves. > > Signed-off-by: Eric Dumazet <edumazet@google.com> > Signed-off-by: Greg Thelen <gthelen@google.com> > Tested-by: Greg Thelen <gthelen@google.com> Applied to cgroup/for-6.15. Thanks.
On Wed, Mar 19, 2025 at 11:47:32AM +0100, Mateusz Guzik wrote: ... > Is not this going a little too far? > > the lock + irq trip is quite expensive in its own right and now is > going to be paid for each cpu, as in the total time spent executing > cgroup_rstat_flush_locked is going to go up. Lock/unlock when the cacheline is already on the cpu is pretty cheap and on modern x86 cpus at least irq on/off are also only a few cycles, so you probably wouldn't be able to tell the difference. Thanks.
On Wed, Mar 19, 2025 at 12:13:30AM -0700, Greg Thelen wrote: > From: Eric Dumazet <edumazet@google.com> > > cgroup_rstat_flush_locked() grabs the irq safe cgroup_rstat_lock while > iterating all possible cpus. It only drops the lock if there is > scheduler or spin lock contention. If neither, then interrupts can be > disabled for a long time. On large machines this can disable interrupts > for a long enough time to drop network packets. On 400+ CPU machines > I've seen interrupt disabled for over 40 msec. > > Prevent rstat from disabling interrupts while processing all possible > cpus. Instead drop and reacquire cgroup_rstat_lock for each cpu. This > approach was previously discussed in > https://lore.kernel.org/lkml/ZBz%2FV5a7%2F6PZeM7S@slm.duckdns.org/, > though this was in the context of an non-irq rstat spin lock. > > Benchmark this change with: > 1) a single stat_reader process with 400 threads, each reading a test > memcg's memory.stat repeatedly for 10 seconds. > 2) 400 memory hog processes running in the test memcg and repeatedly > charging memory until oom killed. Then they repeat charging and oom > killing. > > v6.14-rc6 with CONFIG_IRQSOFF_TRACER with stat_reader and hogs, finds > interrupts are disabled by rstat for 45341 usec: > # => started at: _raw_spin_lock_irq > # => ended at: cgroup_rstat_flush > # > # > # _------=> CPU# > # / _-----=> irqs-off/BH-disabled > # | / _----=> need-resched > # || / _---=> hardirq/softirq > # ||| / _--=> preempt-depth > # |||| / _-=> migrate-disable > # ||||| / delay > # cmd pid |||||| time | caller > # \ / |||||| \ | / > stat_rea-96532 52d.... 0us*: _raw_spin_lock_irq > stat_rea-96532 52d.... 45342us : cgroup_rstat_flush > stat_rea-96532 52d.... 45342us : tracer_hardirqs_on <-cgroup_rstat_flush > stat_rea-96532 52d.... 45343us : <stack trace> > => memcg1_stat_format > => memory_stat_format > => memory_stat_show > => seq_read_iter > => vfs_read > => ksys_read > => do_syscall_64 > => entry_SYSCALL_64_after_hwframe > > With this patch the CONFIG_IRQSOFF_TRACER doesn't find rstat to be the > longest holder. The longest irqs-off holder has irqs disabled for > 4142 usec, a huge reduction from previous 45341 usec rstat finding. > > Running stat_reader memory.stat reader for 10 seconds: > - without memory hogs: 9.84M accesses => 12.7M accesses > - with memory hogs: 9.46M accesses => 11.1M accesses > The throughput of memory.stat access improves. > > The mode of memory.stat access latency after grouping by of 2 buckets: > - without memory hogs: 64 usec => 16 usec > - with memory hogs: 64 usec => 8 usec > The memory.stat latency improves. > > Signed-off-by: Eric Dumazet <edumazet@google.com> > Signed-off-by: Greg Thelen <gthelen@google.com> > Tested-by: Greg Thelen <gthelen@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
On Wed, Mar 19, 2025 at 12:13:30AM -0700, Greg Thelen wrote: > From: Eric Dumazet <edumazet@google.com> > > cgroup_rstat_flush_locked() grabs the irq safe cgroup_rstat_lock while > iterating all possible cpus. It only drops the lock if there is > scheduler or spin lock contention. If neither, then interrupts can be > disabled for a long time. On large machines this can disable interrupts > for a long enough time to drop network packets. On 400+ CPU machines > I've seen interrupt disabled for over 40 msec. Which kernel was this observed on in production? > > Prevent rstat from disabling interrupts while processing all possible > cpus. Instead drop and reacquire cgroup_rstat_lock for each cpu. Doing for each cpu might be too extreme. Have you tried with some batching? > This > approach was previously discussed in > https://lore.kernel.org/lkml/ZBz%2FV5a7%2F6PZeM7S@slm.duckdns.org/, > though this was in the context of an non-irq rstat spin lock. > > Benchmark this change with: > 1) a single stat_reader process with 400 threads, each reading a test > memcg's memory.stat repeatedly for 10 seconds. > 2) 400 memory hog processes running in the test memcg and repeatedly > charging memory until oom killed. Then they repeat charging and oom > killing. > Though this benchmark seems too extreme but userspace holding off irqs for that long time is bad. BTW are these memory hoggers, creating anon memory or file memory? Is [z]swap enabled? For the long term, I think we can use make this work without disabling irqs, similar to how networking manages sock lock. > v6.14-rc6 with CONFIG_IRQSOFF_TRACER with stat_reader and hogs, finds > interrupts are disabled by rstat for 45341 usec: > # => started at: _raw_spin_lock_irq > # => ended at: cgroup_rstat_flush > # > # > # _------=> CPU# > # / _-----=> irqs-off/BH-disabled > # | / _----=> need-resched > # || / _---=> hardirq/softirq > # ||| / _--=> preempt-depth > # |||| / _-=> migrate-disable > # ||||| / delay > # cmd pid |||||| time | caller > # \ / |||||| \ | / > stat_rea-96532 52d.... 0us*: _raw_spin_lock_irq > stat_rea-96532 52d.... 45342us : cgroup_rstat_flush > stat_rea-96532 52d.... 45342us : tracer_hardirqs_on <-cgroup_rstat_flush > stat_rea-96532 52d.... 45343us : <stack trace> > => memcg1_stat_format > => memory_stat_format > => memory_stat_show > => seq_read_iter > => vfs_read > => ksys_read > => do_syscall_64 > => entry_SYSCALL_64_after_hwframe > > With this patch the CONFIG_IRQSOFF_TRACER doesn't find rstat to be the > longest holder. The longest irqs-off holder has irqs disabled for > 4142 usec, a huge reduction from previous 45341 usec rstat finding. > > Running stat_reader memory.stat reader for 10 seconds: > - without memory hogs: 9.84M accesses => 12.7M accesses > - with memory hogs: 9.46M accesses => 11.1M accesses > The throughput of memory.stat access improves. > > The mode of memory.stat access latency after grouping by of 2 buckets: > - without memory hogs: 64 usec => 16 usec > - with memory hogs: 64 usec => 8 usec > The memory.stat latency improves. So, things are improving even without batching. I wonder if there are less readers then how will this look like. Can you try with single reader as well? > > Signed-off-by: Eric Dumazet <edumazet@google.com> > Signed-off-by: Greg Thelen <gthelen@google.com> > Tested-by: Greg Thelen <gthelen@google.com> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
On Wed, Mar 19, 2025 at 05:16:02PM +0000, Yosry Ahmed wrote: > @@ -365,9 +352,8 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp) > void cgroup_rstat_flush_hold(struct cgroup *cgrp) > __acquires(&cgroup_rstat_lock) > { > - might_sleep(); > + cgroup_rstat_flush(cgrp); > __cgroup_rstat_lock(cgrp, -1); > - cgroup_rstat_flush_locked(cgrp); > } Might as well remove cgroup_rstat_flush_hold/release entirely? There are no external users, and the concept seems moot when the lock is dropped per default. cgroup_base_stat_cputime_show() can open-code the lock/unlock to stabilize the counts while reading. (btw, why do we not have any locking around the root stats in cgroup_base_stat_cputime_show()? There isn't anything preventing a reader from seeing all zeroes if another reader runs the memset() on cgrp->bstat, is there? Or double times...)
On Wed, Mar 19, 2025 at 02:06:43PM -0400, Johannes Weiner wrote: > On Wed, Mar 19, 2025 at 05:16:02PM +0000, Yosry Ahmed wrote: > > @@ -365,9 +352,8 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp) > > void cgroup_rstat_flush_hold(struct cgroup *cgrp) > > __acquires(&cgroup_rstat_lock) > > { > > - might_sleep(); > > + cgroup_rstat_flush(cgrp); > > __cgroup_rstat_lock(cgrp, -1); > > - cgroup_rstat_flush_locked(cgrp); > > } > > Might as well remove cgroup_rstat_flush_hold/release entirely? There > are no external users, and the concept seems moot when the lock is > dropped per default. cgroup_base_stat_cputime_show() can open-code the > lock/unlock to stabilize the counts while reading. Yeah I missed the fact that the users are internal because the functions are not static. I also don't see the point of keeping them. Tejun/Greg, should I send a patch on top of this one or do you prefer sending a new version? > (btw, why do we not have any locking around the root stats in > cgroup_base_stat_cputime_show()? There isn't anything preventing a > reader from seeing all zeroes if another reader runs the memset() on > cgrp->bstat, is there? Or double times...) (I think root_cgroup_cputime() operates on a stack allocated bstat, not cgrp->bstat)
On Wed, Mar 19, 2025 at 6:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > On Wed, Mar 19, 2025 at 12:13:30AM -0700, Greg Thelen wrote: > > From: Eric Dumazet <edumazet@google.com> > > > > cgroup_rstat_flush_locked() grabs the irq safe cgroup_rstat_lock while > > iterating all possible cpus. It only drops the lock if there is > > scheduler or spin lock contention. If neither, then interrupts can be > > disabled for a long time. On large machines this can disable interrupts > > for a long enough time to drop network packets. On 400+ CPU machines > > I've seen interrupt disabled for over 40 msec. > > Which kernel was this observed on in production? > I had these issues with the latest upstream kernels, and have seen a 80 ms delay. ./trace-cgroup_rstat_flush_locked.sh Attaching 3 probes... ^C @cgroup_rstat_flush_locked_us: [64, 128) 2 |@ | [128, 256) 3 |@ | [256, 512) 2 |@ | [512, 1K) 6 |@@@ | [1K, 2K) 8 |@@@@@ | [2K, 4K) 3 |@ | [4K, 8K) 0 | | [8K, 16K) 0 | | [16K, 32K) 4 |@@ | [32K, 64K) 83 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [64K, 128K) 74 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | @max_us: 79214 Greg in his tests was able to get 45 ms "only" > > > > > Prevent rstat from disabling interrupts while processing all possible > > cpus. Instead drop and reacquire cgroup_rstat_lock for each cpu. > > Doing for each cpu might be too extreme. Have you tried with some > batching? > Not really, modern platforms are doing the unlock/lock pair faster than all the cache line misses done in the per-cpu loop. > > > This > > approach was previously discussed in > > https://lore.kernel.org/lkml/ZBz%2FV5a7%2F6PZeM7S@slm.duckdns.org/, > > though this was in the context of an non-irq rstat spin lock. > > > > Benchmark this change with: > > 1) a single stat_reader process with 400 threads, each reading a test > > memcg's memory.stat repeatedly for 10 seconds. > > 2) 400 memory hog processes running in the test memcg and repeatedly > > charging memory until oom killed. Then they repeat charging and oom > > killing. > > > > Though this benchmark seems too extreme but userspace holding off irqs > for that long time is bad. BTW are these memory hoggers, creating anon > memory or file memory? Is [z]swap enabled? > > For the long term, I think we can use make this work without disabling > irqs, similar to how networking manages sock lock. > > > v6.14-rc6 with CONFIG_IRQSOFF_TRACER with stat_reader and hogs, finds > > interrupts are disabled by rstat for 45341 usec: > > # => started at: _raw_spin_lock_irq > > # => ended at: cgroup_rstat_flush > > # > > # > > # _------=> CPU# > > # / _-----=> irqs-off/BH-disabled > > # | / _----=> need-resched > > # || / _---=> hardirq/softirq > > # ||| / _--=> preempt-depth > > # |||| / _-=> migrate-disable > > # ||||| / delay > > # cmd pid |||||| time | caller > > # \ / |||||| \ | / > > stat_rea-96532 52d.... 0us*: _raw_spin_lock_irq > > stat_rea-96532 52d.... 45342us : cgroup_rstat_flush > > stat_rea-96532 52d.... 45342us : tracer_hardirqs_on > <-cgroup_rstat_flush > > stat_rea-96532 52d.... 45343us : <stack trace> > > => memcg1_stat_format > > => memory_stat_format > > => memory_stat_show > > => seq_read_iter > > => vfs_read > > => ksys_read > > => do_syscall_64 > > => entry_SYSCALL_64_after_hwframe > > > > With this patch the CONFIG_IRQSOFF_TRACER doesn't find rstat to be the > > longest holder. The longest irqs-off holder has irqs disabled for > > 4142 usec, a huge reduction from previous 45341 usec rstat finding. > > > > Running stat_reader memory.stat reader for 10 seconds: > > - without memory hogs: 9.84M accesses => 12.7M accesses > > - with memory hogs: 9.46M accesses => 11.1M accesses > > The throughput of memory.stat access improves. > > > > The mode of memory.stat access latency after grouping by of 2 buckets: > > - without memory hogs: 64 usec => 16 usec > > - with memory hogs: 64 usec => 8 usec > > The memory.stat latency improves. > > So, things are improving even without batching. I wonder if there are > less readers then how will this look like. Can you try with single > reader as well? > > > > > Signed-off-by: Eric Dumazet <edumazet@google.com> > > Signed-off-by: Greg Thelen <gthelen@google.com> > > Tested-by: Greg Thelen <gthelen@google.com> > > Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> >
On Wed, Mar 19, 2025 at 06:35:02PM +0000, Yosry Ahmed wrote: > On Wed, Mar 19, 2025 at 02:06:43PM -0400, Johannes Weiner wrote: > > On Wed, Mar 19, 2025 at 05:16:02PM +0000, Yosry Ahmed wrote: > > > @@ -365,9 +352,8 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp) > > > void cgroup_rstat_flush_hold(struct cgroup *cgrp) > > > __acquires(&cgroup_rstat_lock) > > > { > > > - might_sleep(); > > > + cgroup_rstat_flush(cgrp); > > > __cgroup_rstat_lock(cgrp, -1); > > > - cgroup_rstat_flush_locked(cgrp); > > > } > > > > Might as well remove cgroup_rstat_flush_hold/release entirely? There > > are no external users, and the concept seems moot when the lock is > > dropped per default. cgroup_base_stat_cputime_show() can open-code the > > lock/unlock to stabilize the counts while reading. > > Yeah I missed the fact that the users are internal because the functions > are not static. I also don't see the point of keeping them. > > Tejun/Greg, should I send a patch on top of this one or do you prefer > sending a new version? Please send a patch on top. Thanks.
On Wed, Mar 19, 2025 at 06:35:02PM +0000, Yosry Ahmed wrote: > On Wed, Mar 19, 2025 at 02:06:43PM -0400, Johannes Weiner wrote: > > (btw, why do we not have any locking around the root stats in > > cgroup_base_stat_cputime_show()? There isn't anything preventing a > > reader from seeing all zeroes if another reader runs the memset() on > > cgrp->bstat, is there? Or double times...) > > (I think root_cgroup_cputime() operates on a stack allocated bstat, not > cgrp->bstat) That was the case until: commit b824766504e49f3fdcbb8c722e70996a78c3636e Author: Chen Ridong <chenridong@huawei.com> Date: Thu Jul 4 14:01:19 2024 +0000 cgroup/rstat: add force idle show helper Now it's doing this: void cgroup_base_stat_cputime_show(struct seq_file *seq) { struct cgroup *cgrp = seq_css(seq)->cgroup; if (cgroup_parent(cgrp)) { ... } else { /* cgrp->bstat of root is not actually used, reuse it */ root_cgroup_cputime(&cgrp->bstat); usage = cgrp->bstat.cputime.sum_exec_runtime; utime = cgrp->bstat.cputime.utime; stime = cgrp->bstat.cputime.stime; ntime = cgrp->bstat.ntime; } } and then: static void root_cgroup_cputime(struct cgroup_base_stat *bstat) { struct task_cputime *cputime = &bstat->cputime; int i; memset(bstat, 0, sizeof(*bstat)); /* various += on bstat and cputime members */ } So all readers are mucking with the root cgroup's bstat.
On Wed, Mar 19, 2025 at 03:16:42PM -0400, Johannes Weiner wrote: > On Wed, Mar 19, 2025 at 06:35:02PM +0000, Yosry Ahmed wrote: > > On Wed, Mar 19, 2025 at 02:06:43PM -0400, Johannes Weiner wrote: > > > (btw, why do we not have any locking around the root stats in > > > cgroup_base_stat_cputime_show()? There isn't anything preventing a > > > reader from seeing all zeroes if another reader runs the memset() on > > > cgrp->bstat, is there? Or double times...) > > > > (I think root_cgroup_cputime() operates on a stack allocated bstat, not > > cgrp->bstat) > > That was the case until: > > commit b824766504e49f3fdcbb8c722e70996a78c3636e > Author: Chen Ridong <chenridong@huawei.com> > Date: Thu Jul 4 14:01:19 2024 +0000 Nevermind, Tejun pointed me to the follow-up fix he's got already queued up: https://web.git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git/commit/?id=c4af66a95aa3bc1d4f607ebd4eea524fb58946e3 That brings it all back on the stack.
On Wed, Mar 19, 2025 at 10:53 AM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Wed, Mar 19, 2025 at 12:13:30AM -0700, Greg Thelen wrote: > > From: Eric Dumazet <edumazet@google.com> > > > > cgroup_rstat_flush_locked() grabs the irq safe cgroup_rstat_lock while > > iterating all possible cpus. It only drops the lock if there is > > scheduler or spin lock contention. If neither, then interrupts can be > > disabled for a long time. On large machines this can disable interrupts > > for a long enough time to drop network packets. On 400+ CPU machines > > I've seen interrupt disabled for over 40 msec. > > Which kernel was this observed on in production? > > > > > Prevent rstat from disabling interrupts while processing all possible > > cpus. Instead drop and reacquire cgroup_rstat_lock for each cpu. > > Doing for each cpu might be too extreme. Have you tried with some > batching? > > > This > > approach was previously discussed in > > https://lore.kernel.org/lkml/ZBz%2FV5a7%2F6PZeM7S@slm.duckdns.org/, > > though this was in the context of an non-irq rstat spin lock. > > > > Benchmark this change with: > > 1) a single stat_reader process with 400 threads, each reading a test > > memcg's memory.stat repeatedly for 10 seconds. > > 2) 400 memory hog processes running in the test memcg and repeatedly > > charging memory until oom killed. Then they repeat charging and oom > > killing. > > > > Though this benchmark seems too extreme but userspace holding off irqs > for that long time is bad. BTW are these memory hoggers, creating anon > memory or file memory? Is [z]swap enabled? The memory hoggers were anon, without any form of swap. I think the other questions were answered in other replies, but feel free to re-ask and I'll provide details. > For the long term, I think we can use make this work without disabling > irqs, similar to how networking manages sock lock. > > > v6.14-rc6 with CONFIG_IRQSOFF_TRACER with stat_reader and hogs, finds > > interrupts are disabled by rstat for 45341 usec: > > # => started at: _raw_spin_lock_irq > > # => ended at: cgroup_rstat_flush > > # > > # > > # _------=> CPU# > > # / _-----=> irqs-off/BH-disabled > > # | / _----=> need-resched > > # || / _---=> hardirq/softirq > > # ||| / _--=> preempt-depth > > # |||| / _-=> migrate-disable > > # ||||| / delay > > # cmd pid |||||| time | caller > > # \ / |||||| \ | / > > stat_rea-96532 52d.... 0us*: _raw_spin_lock_irq > > stat_rea-96532 52d.... 45342us : cgroup_rstat_flush > > stat_rea-96532 52d.... 45342us : tracer_hardirqs_on <-cgroup_rstat_flush > > stat_rea-96532 52d.... 45343us : <stack trace> > > => memcg1_stat_format > > => memory_stat_format > > => memory_stat_show > > => seq_read_iter > > => vfs_read > > => ksys_read > > => do_syscall_64 > > => entry_SYSCALL_64_after_hwframe > > > > With this patch the CONFIG_IRQSOFF_TRACER doesn't find rstat to be the > > longest holder. The longest irqs-off holder has irqs disabled for > > 4142 usec, a huge reduction from previous 45341 usec rstat finding. > > > > Running stat_reader memory.stat reader for 10 seconds: > > - without memory hogs: 9.84M accesses => 12.7M accesses > > - with memory hogs: 9.46M accesses => 11.1M accesses > > The throughput of memory.stat access improves. > > > > The mode of memory.stat access latency after grouping by of 2 buckets: > > - without memory hogs: 64 usec => 16 usec > > - with memory hogs: 64 usec => 8 usec > > The memory.stat latency improves. > > So, things are improving even without batching. I wonder if there are > less readers then how will this look like. Can you try with single > reader as well? > > > > > Signed-off-by: Eric Dumazet <edumazet@google.com> > > Signed-off-by: Greg Thelen <gthelen@google.com> > > Tested-by: Greg Thelen <gthelen@google.com> > > Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> >
On Wed, Mar 19, 2025 at 6:26 PM Tejun Heo <tj@kernel.org> wrote: > > On Wed, Mar 19, 2025 at 11:47:32AM +0100, Mateusz Guzik wrote: > ... > > Is not this going a little too far? > > > > the lock + irq trip is quite expensive in its own right and now is > > going to be paid for each cpu, as in the total time spent executing > > cgroup_rstat_flush_locked is going to go up. > > Lock/unlock when the cacheline is already on the cpu is pretty cheap and on > modern x86 cpus at least irq on/off are also only a few cycles, so you > probably wouldn't be able to tell the difference. > This does not line up with what I had seen on x86-64 uarchs up to this point, including Sapphire Rapids, which I think still counts as reasonably new. Most notably I see smp_mb() as a factor on my profiles, which compiles to lock add $0 to a stack pointer -- a very much cache hot variable. However, the cost of an atomic op has a lot of to do with state accumulated around it -- the more atomics in short succession, the lesser the impact. That is to say it may be given code is slow enough that adding a lock-prefixed instruction does not add notable overhead. In order to not just handwave, here is overhead of __legitimize_mnt() while performing a path lookup for access() on my test box. smp_mb() is the stock variant. irq() merely toggles interrupts, it does not provide any sanity for the routine, but lets me see the cost if it got planted there for some magic reason. Finally the last bit is what I expect the final routine to have -- merely a branch to account for bad mounts (taking advantage of synchronize_rcu() on unmount). smp_mb: 3.31% irq: 1.44% nothing: 0.63% These would be higher if it was not for the fact that memory allocation (for path buffer) is dog slow. And indeed I get over a 3% speed up in access() rate by not suffering that smp_mb() [I note I don't have a viable patch for it, rather I whacked it for benchmarking purposes to see if pursuing proper removal is worth it] I'll note though that profiling open()+close() shows virtually no difference (less than 0.5%) -- but that's not an argument for atomics being cheap, instead it is an argument for open() in particular being slow (which it is, I'm working on it though). If you want I can ship you the test case, diffs and kernel config to reproduce on your own. -- Mateusz Guzik <mjguzik gmail.com>
On Wed, Mar 19, 2025 at 05:18:05PM +0000, Yosry Ahmed wrote: > On Wed, Mar 19, 2025 at 11:47:32AM +0100, Mateusz Guzik wrote: > > Is not this going a little too far? > > > > the lock + irq trip is quite expensive in its own right and now is > > going to be paid for each cpu, as in the total time spent executing > > cgroup_rstat_flush_locked is going to go up. > > > > Would your problem go away toggling this every -- say -- 8 cpus? > > I was concerned about this too, and about more lock bouncing, but the > testing suggests that this actually overall improves the latency of > cgroup_rstat_flush_locked() (at least on tested HW). > > So I don't think we need to do something like this unless a regression > is observed. > To my reading it reduces max time spent with irq disabled, which of course it does -- after all it toggles it for every CPU. Per my other e-mail in the thread the irq + lock trips remain not cheap at least on Sapphire Rapids. In my testing outlined below I see 11% increase in total execution time with the irq + lock trip for every CPU in a 24-way vm. So I stand by instead doing this every n CPUs, call it 8 or whatever. How to repro: I employed a poor-man's profiler like so: bpftrace -e 'kprobe:cgroup_rstat_flush_locked { @start[tid] = nsecs; } kretprobe:cgroup_rstat_flush_locked /@start[tid]/ { print(nsecs - @start[tid]); delete(@start[tid]); } interval:s:60 { exit(); }' This patch or not, execution time varies wildly even while the box is idle. The above runs for a minute, collecting 23 samples (you may get "lucky" and get one extra, in that case remove it for comparison). A sysctl was added to toggle the new behavior vs old one. Patch at the end. "enabled"(1) means new behavior, "disabled"(0) means the old one. Sum of nsecs (results piped to: awk '{ sum += $1 } END { print sum }'): disabled: 903610 enabled: 1006833 (+11.4%) Toggle at runtime with: sysctl fs.magic_tunable=0 # disabled, no mandatory relocks sysctl fs.magic_tunable=1 # enabled, relock for every CPU I attached the stats I got for reference. I patched v6.14 with the following: diff --git a/fs/file_table.c b/fs/file_table.c index c04ed94cdc4b..441f89421413 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -106,6 +106,8 @@ static int proc_nr_files(const struct ctl_table *table, int write, void *buffer, return proc_doulongvec_minmax(table, write, buffer, lenp, ppos); } +unsigned long magic_tunable; + static const struct ctl_table fs_stat_sysctls[] = { { .procname = "file-nr", @@ -123,6 +125,16 @@ static const struct ctl_table fs_stat_sysctls[] = { .extra1 = SYSCTL_LONG_ZERO, .extra2 = SYSCTL_LONG_MAX, }, + { + .procname = "magic_tunable", + .data = &magic_tunable, + .maxlen = sizeof(magic_tunable), + .mode = 0644, + .proc_handler = proc_doulongvec_minmax, + .extra1 = SYSCTL_LONG_ZERO, + .extra2 = SYSCTL_LONG_MAX, + }, + { .procname = "nr_open", .data = &sysctl_nr_open, diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index 3e01781aeb7b..f6444bf25b2f 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -299,6 +299,8 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop) spin_unlock_irq(&cgroup_rstat_lock); } +extern unsigned long magic_tunable; + /* see cgroup_rstat_flush() */ static void cgroup_rstat_flush_locked(struct cgroup *cgrp) __releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock) @@ -323,12 +325,18 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp) rcu_read_unlock(); } - /* play nice and yield if necessary */ - if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) { + if (READ_ONCE(magic_tunable)) { __cgroup_rstat_unlock(cgrp, cpu); if (!cond_resched()) cpu_relax(); __cgroup_rstat_lock(cgrp, cpu); + } else { + if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) { + __cgroup_rstat_unlock(cgrp, cpu); + if (!cond_resched()) + cpu_relax(); + __cgroup_rstat_lock(cgrp, cpu); + } } } } 69869 30473 64670 30544 30950 36445 36235 29920 51179 35760 33424 42426 30177 31211 44974 34450 37871 72642 33016 29518 31800 35730 30326 63507 50113 36280 35148 63329 41232 51265 41341 41418 42824 35200 35550 54684 41597 55325 36120 48675 41179 39339 35794 38826 37411 40676
On Thu, Mar 27, 2025 at 03:38:50PM +0100, Mateusz Guzik wrote: > On Wed, Mar 19, 2025 at 05:18:05PM +0000, Yosry Ahmed wrote: > > On Wed, Mar 19, 2025 at 11:47:32AM +0100, Mateusz Guzik wrote: > > > Is not this going a little too far? > > > > > > the lock + irq trip is quite expensive in its own right and now is > > > going to be paid for each cpu, as in the total time spent executing > > > cgroup_rstat_flush_locked is going to go up. > > > > > > Would your problem go away toggling this every -- say -- 8 cpus? > > > > I was concerned about this too, and about more lock bouncing, but the > > testing suggests that this actually overall improves the latency of > > cgroup_rstat_flush_locked() (at least on tested HW). > > > > So I don't think we need to do something like this unless a regression > > is observed. > > > > To my reading it reduces max time spent with irq disabled, which of > course it does -- after all it toggles it for every CPU. > > Per my other e-mail in the thread the irq + lock trips remain not cheap > at least on Sapphire Rapids. > > In my testing outlined below I see 11% increase in total execution time > with the irq + lock trip for every CPU in a 24-way vm. > > So I stand by instead doing this every n CPUs, call it 8 or whatever. > > How to repro: > > I employed a poor-man's profiler like so: > > bpftrace -e 'kprobe:cgroup_rstat_flush_locked { @start[tid] = nsecs; } kretprobe:cgroup_rstat_flush_locked /@start[tid]/ { print(nsecs - @start[tid]); delete(@start[tid]); } interval:s:60 { exit(); }' > > This patch or not, execution time varies wildly even while the box is idle. > > The above runs for a minute, collecting 23 samples (you may get > "lucky" and get one extra, in that case remove it for comparison). > > A sysctl was added to toggle the new behavior vs old one. Patch at the > end. > > "enabled"(1) means new behavior, "disabled"(0) means the old one. > > Sum of nsecs (results piped to: awk '{ sum += $1 } END { print sum }'): > disabled: 903610 > enabled: 1006833 (+11.4%) IIUC this calculates the amount of elapsed time between start and finish, not necessarily the function's own execution time. Is it possible that the increase in time is due to more interrupts arriving during the function execution (which is what we want), rather than more time being spent on disabling/enabling IRQs?
On Thu, Mar 27, 2025 at 6:17 PM Yosry Ahmed <yosry.ahmed@linux.dev> wrote: > > On Thu, Mar 27, 2025 at 03:38:50PM +0100, Mateusz Guzik wrote: > > On Wed, Mar 19, 2025 at 05:18:05PM +0000, Yosry Ahmed wrote: > > > On Wed, Mar 19, 2025 at 11:47:32AM +0100, Mateusz Guzik wrote: > > > > Is not this going a little too far? > > > > > > > > the lock + irq trip is quite expensive in its own right and now is > > > > going to be paid for each cpu, as in the total time spent executing > > > > cgroup_rstat_flush_locked is going to go up. > > > > > > > > Would your problem go away toggling this every -- say -- 8 cpus? > > > > > > I was concerned about this too, and about more lock bouncing, but the > > > testing suggests that this actually overall improves the latency of > > > cgroup_rstat_flush_locked() (at least on tested HW). > > > > > > So I don't think we need to do something like this unless a regression > > > is observed. > > > > > > > To my reading it reduces max time spent with irq disabled, which of > > course it does -- after all it toggles it for every CPU. > > > > Per my other e-mail in the thread the irq + lock trips remain not cheap > > at least on Sapphire Rapids. > > > > In my testing outlined below I see 11% increase in total execution time > > with the irq + lock trip for every CPU in a 24-way vm. > > > > So I stand by instead doing this every n CPUs, call it 8 or whatever. > > > > How to repro: > > > > I employed a poor-man's profiler like so: > > > > bpftrace -e 'kprobe:cgroup_rstat_flush_locked { @start[tid] = nsecs; } kretprobe:cgroup_rstat_flush_locked /@start[tid]/ { print(nsecs - @start[tid]); delete(@start[tid]); } interval:s:60 { exit(); }' > > > > This patch or not, execution time varies wildly even while the box is idle. > > > > The above runs for a minute, collecting 23 samples (you may get > > "lucky" and get one extra, in that case remove it for comparison). > > > > A sysctl was added to toggle the new behavior vs old one. Patch at the > > end. > > > > "enabled"(1) means new behavior, "disabled"(0) means the old one. > > > > Sum of nsecs (results piped to: awk '{ sum += $1 } END { print sum }'): > > disabled: 903610 > > enabled: 1006833 (+11.4%) > > IIUC this calculates the amount of elapsed time between start and > finish, not necessarily the function's own execution time. Is it > possible that the increase in time is due to more interrupts arriving > during the function execution (which is what we want), rather than more > time being spent on disabling/enabling IRQs? I can agree irq handlers have more opportunities to execute in the toggling case and that the time accounted in the way above will include them. I don't think explains it, but why not, let's test without this problem. I feel compelled to note atomics on x86-64 were expensive for as long as the architecture was around so I'm confused what's up with the resistance to the notion that they remain costly even with modern uarchs. If anything, imo claims that they are cheap require strong evidence. That said, I modified the patch to add a section which issues conditional relock if needed and smp_mb otherwise -- irqs remain disabled, but we are still paying for a full fence. smp_mb is a lock add $0 on the stack pointer. Note this has less work to do than what was added in your patch. It looks like this: switch (READ_ONCE(magic_tunable)) { case 1: __cgroup_rstat_unlock(cgrp, cpu); if (!cond_resched()) cpu_relax(); __cgroup_rstat_lock(cgrp, cpu); break; case 2: if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) { __cgroup_rstat_unlock(cgrp, cpu); if (!cond_resched()) cpu_relax(); __cgroup_rstat_lock(cgrp, cpu); } else { smp_mb(); } break; default: if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) { __cgroup_rstat_unlock(cgrp, cpu); if (!cond_resched()) cpu_relax(); __cgroup_rstat_lock(cgrp, cpu); } break; } With this in place I'm seeing about 4% increase in execution time measured the same way, so irq handlers sneaking in don't explain it. Note smp_mb() alone is a smaller cost than the locked instruction + func calls + irq trips. I also state I'm running this in a VM (24-way), where paravirt spinlocks also issue a lock-prefixed instruction to release the lock. I would say this very much justifies the original claim of 11% with the patch as proposed.
Hello Mateusz. On Thu, Mar 27, 2025 at 06:47:56PM +0100, Mateusz Guzik <mjguzik@gmail.com> wrote: > I feel compelled to note atomics on x86-64 were expensive for as long > as the architecture was around so I'm confused what's up with the > resistance to the notion that they remain costly even with modern > uarchs. If anything, imo claims that they are cheap require strong > evidence. I don't there's strong resistance, your measurements show that it's not negligible under given conditions. The question is -- how much benefit would flushers have in practice with coalesced unlock-locks. There is the approach now with releasing for each CPU that is simple and benefits latency of irq dependants. If you see practical issues with the limited throughputs of stat readers (or flushers in general) because of this, please send a patch for discussion that resolves it while preserving (some of) the irq freedom. Also there is ongoing work of splitting up flushing per controller -- I'd like to see whether the given locks become "small" enough to require no _irq exclusion at all during flushing. Michal
On Tue, Apr 1, 2025 at 5:00 PM Michal Koutný <mkoutny@suse.com> wrote: > On Thu, Mar 27, 2025 at 06:47:56PM +0100, Mateusz Guzik <mjguzik@gmail.com> wrote: > > I feel compelled to note atomics on x86-64 were expensive for as long > > as the architecture was around so I'm confused what's up with the > > resistance to the notion that they remain costly even with modern > > uarchs. If anything, imo claims that they are cheap require strong > > evidence. > > I don't there's strong resistance, your measurements show that it's not > negligible under given conditions. > > The question is -- how much benefit would flushers have in practice with > coalesced unlock-locks. There is the approach now with releasing for > each CPU that is simple and benefits latency of irq dependants. > Toggling every n cpus instead of every single time is trivial to do. I'm trying to avoid sending a patch in hopes of not getting CC'ed for unrelated stuff later. > If you see practical issues with the limited throughputs of stat readers > (or flushers in general) because of this, please send a patch for > discussion that resolves it while preserving (some of) the irq freedom. > This is some background maintenance work and it should do what's feasible to not eat CPU. The stock loop was behaving poorly in face of a high CPU count and it makes excellent sense make it toggle the lock in *some* capacity. I just don't believe going from 400+ CPUs straight to literally 1 every time is warranted. It seems the author felt justified with the notion that it does not add overhead on contemporary hardware, but per your own e-mail I demonstrated this does not hold. Is this really going to suffer for toggling every 8 CPUs? that's a 50x factor reduction I would not be mailing here if the change was hard to do, but it really is not. it's literally a counter in a loop getting checked. > Also there is ongoing work of splitting up flushing per controller -- > I'd like to see whether the given locks become "small" enough to require > no _irq exclusion at all during flushing. the temp changes like the to stay for a long time. tl;dr I don't believe going straight from 400 to 1 was properly justified and I demonstrated how it hurts on a rather modest box. at the same time a (likely) more than enough improvement over the stock state can be trivially achieved while adding only a small fraction of the overhead. that said, there is bigger fish to fry elsewhere and I have no stake in this code, so I'm not going to mail any further about this.
On Tue, Apr 01, 2025 at 05:46:41PM +0200, Mateusz Guzik <mjguzik@gmail.com> wrote: > Is this really going to suffer for toggling every 8 CPUs? that's a 50x > factor reduction As per the original patch, there's ~10x saving in max holding irqs-off, naïevely thinking aggregating it flushing by 8 CPUs could reduce it to (10/8) ~1.25x saving only. (OTOH, it's not 400x effect, so it's not explained completely, not all CPUs are same.) I can imagine the balanced value with this information would be around 20 CPUs (sqrt(400)). But the issue is it could as well be 4 or 32 or 8. Starting with 1 is the simplest approach without introducing magic constants or heuristics. > the temp changes like the to stay for a long time. That'd mean that no one notices the performance impact there :-) It can be easily changed later too. > that said, there is bigger fish to fry elsewhere and I have no stake > in this code, so I'm not going to mail any further about this. Thank you for spending your effort on this, it's useful reference for the future! Michal
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index aac91466279f..976c24b3671a 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -323,13 +323,11 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp) rcu_read_unlock(); } - /* play nice and yield if necessary */ - if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) { - __cgroup_rstat_unlock(cgrp, cpu); - if (!cond_resched()) - cpu_relax(); - __cgroup_rstat_lock(cgrp, cpu); - } + /* play nice and avoid disabling interrupts for a long time */ + __cgroup_rstat_unlock(cgrp, cpu); + if (!cond_resched()) + cpu_relax(); + __cgroup_rstat_lock(cgrp, cpu); } }