diff mbox series

cgroup/rstat: avoid disabling irqs for O(num_cpu)

Message ID 20250319071330.898763-1-gthelen@google.com (mailing list archive)
State New
Headers show
Series cgroup/rstat: avoid disabling irqs for O(num_cpu) | expand

Commit Message

Greg Thelen March 19, 2025, 7:13 a.m. UTC
From: Eric Dumazet <edumazet@google.com>

cgroup_rstat_flush_locked() grabs the irq safe cgroup_rstat_lock while
iterating all possible cpus. It only drops the lock if there is
scheduler or spin lock contention. If neither, then interrupts can be
disabled for a long time. On large machines this can disable interrupts
for a long enough time to drop network packets. On 400+ CPU machines
I've seen interrupt disabled for over 40 msec.

Prevent rstat from disabling interrupts while processing all possible
cpus. Instead drop and reacquire cgroup_rstat_lock for each cpu. This
approach was previously discussed in
https://lore.kernel.org/lkml/ZBz%2FV5a7%2F6PZeM7S@slm.duckdns.org/,
though this was in the context of an non-irq rstat spin lock.

Benchmark this change with:
1) a single stat_reader process with 400 threads, each reading a test
   memcg's memory.stat repeatedly for 10 seconds.
2) 400 memory hog processes running in the test memcg and repeatedly
   charging memory until oom killed. Then they repeat charging and oom
   killing.

v6.14-rc6 with CONFIG_IRQSOFF_TRACER with stat_reader and hogs, finds
interrupts are disabled by rstat for 45341 usec:
  #  => started at: _raw_spin_lock_irq
  #  => ended at:   cgroup_rstat_flush
  #
  #
  #                    _------=> CPU#
  #                   / _-----=> irqs-off/BH-disabled
  #                  | / _----=> need-resched
  #                  || / _---=> hardirq/softirq
  #                  ||| / _--=> preempt-depth
  #                  |||| / _-=> migrate-disable
  #                  ||||| /     delay
  #  cmd     pid     |||||| time  |   caller
  #     \   /        ||||||  \    |    /
  stat_rea-96532    52d....    0us*: _raw_spin_lock_irq
  stat_rea-96532    52d.... 45342us : cgroup_rstat_flush
  stat_rea-96532    52d.... 45342us : tracer_hardirqs_on <-cgroup_rstat_flush
  stat_rea-96532    52d.... 45343us : <stack trace>
   => memcg1_stat_format
   => memory_stat_format
   => memory_stat_show
   => seq_read_iter
   => vfs_read
   => ksys_read
   => do_syscall_64
   => entry_SYSCALL_64_after_hwframe

With this patch the CONFIG_IRQSOFF_TRACER doesn't find rstat to be the
longest holder. The longest irqs-off holder has irqs disabled for
4142 usec, a huge reduction from previous 45341 usec rstat finding.

Running stat_reader memory.stat reader for 10 seconds:
- without memory hogs: 9.84M accesses => 12.7M accesses
-    with memory hogs: 9.46M accesses => 11.1M accesses
The throughput of memory.stat access improves.

The mode of memory.stat access latency after grouping by of 2 buckets:
- without memory hogs: 64 usec => 16 usec
-    with memory hogs: 64 usec =>  8 usec
The memory.stat latency improves.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
Tested-by: Greg Thelen <gthelen@google.com>
---
 kernel/cgroup/rstat.c | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

Comments

Greg Thelen March 19, 2025, 7:17 a.m. UTC | #1
(fix mistyped CC address to Eric)

On Wed, Mar 19, 2025 at 12:13 AM Greg Thelen <gthelen@google.com> wrote:
>
> From: Eric Dumazet <edumazet@google.com>
>
> cgroup_rstat_flush_locked() grabs the irq safe cgroup_rstat_lock while
> iterating all possible cpus. It only drops the lock if there is
> scheduler or spin lock contention. If neither, then interrupts can be
> disabled for a long time. On large machines this can disable interrupts
> for a long enough time to drop network packets. On 400+ CPU machines
> I've seen interrupt disabled for over 40 msec.
>
> Prevent rstat from disabling interrupts while processing all possible
> cpus. Instead drop and reacquire cgroup_rstat_lock for each cpu. This
> approach was previously discussed in
> https://lore.kernel.org/lkml/ZBz%2FV5a7%2F6PZeM7S@slm.duckdns.org/,
> though this was in the context of an non-irq rstat spin lock.
>
> Benchmark this change with:
> 1) a single stat_reader process with 400 threads, each reading a test
>    memcg's memory.stat repeatedly for 10 seconds.
> 2) 400 memory hog processes running in the test memcg and repeatedly
>    charging memory until oom killed. Then they repeat charging and oom
>    killing.
>
> v6.14-rc6 with CONFIG_IRQSOFF_TRACER with stat_reader and hogs, finds
> interrupts are disabled by rstat for 45341 usec:
>   #  => started at: _raw_spin_lock_irq
>   #  => ended at:   cgroup_rstat_flush
>   #
>   #
>   #                    _------=> CPU#
>   #                   / _-----=> irqs-off/BH-disabled
>   #                  | / _----=> need-resched
>   #                  || / _---=> hardirq/softirq
>   #                  ||| / _--=> preempt-depth
>   #                  |||| / _-=> migrate-disable
>   #                  ||||| /     delay
>   #  cmd     pid     |||||| time  |   caller
>   #     \   /        ||||||  \    |    /
>   stat_rea-96532    52d....    0us*: _raw_spin_lock_irq
>   stat_rea-96532    52d.... 45342us : cgroup_rstat_flush
>   stat_rea-96532    52d.... 45342us : tracer_hardirqs_on <-cgroup_rstat_flush
>   stat_rea-96532    52d.... 45343us : <stack trace>
>    => memcg1_stat_format
>    => memory_stat_format
>    => memory_stat_show
>    => seq_read_iter
>    => vfs_read
>    => ksys_read
>    => do_syscall_64
>    => entry_SYSCALL_64_after_hwframe
>
> With this patch the CONFIG_IRQSOFF_TRACER doesn't find rstat to be the
> longest holder. The longest irqs-off holder has irqs disabled for
> 4142 usec, a huge reduction from previous 45341 usec rstat finding.
>
> Running stat_reader memory.stat reader for 10 seconds:
> - without memory hogs: 9.84M accesses => 12.7M accesses
> -    with memory hogs: 9.46M accesses => 11.1M accesses
> The throughput of memory.stat access improves.
>
> The mode of memory.stat access latency after grouping by of 2 buckets:
> - without memory hogs: 64 usec => 16 usec
> -    with memory hogs: 64 usec =>  8 usec
> The memory.stat latency improves.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>
> Tested-by: Greg Thelen <gthelen@google.com>
> ---
>  kernel/cgroup/rstat.c | 12 +++++-------
>  1 file changed, 5 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> index aac91466279f..976c24b3671a 100644
> --- a/kernel/cgroup/rstat.c
> +++ b/kernel/cgroup/rstat.c
> @@ -323,13 +323,11 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
>                         rcu_read_unlock();
>                 }
>
> -               /* play nice and yield if necessary */
> -               if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
> -                       __cgroup_rstat_unlock(cgrp, cpu);
> -                       if (!cond_resched())
> -                               cpu_relax();
> -                       __cgroup_rstat_lock(cgrp, cpu);
> -               }
> +               /* play nice and avoid disabling interrupts for a long time */
> +               __cgroup_rstat_unlock(cgrp, cpu);
> +               if (!cond_resched())
> +                       cpu_relax();
> +               __cgroup_rstat_lock(cgrp, cpu);
>         }
>  }
>
> --
> 2.49.0.rc1.451.g8f38331e32-goog
>
Michal Koutný March 19, 2025, 10:20 a.m. UTC | #2
Hello.

On Wed, Mar 19, 2025 at 12:13:30AM -0700, Greg Thelen <gthelen@google.com> wrote:
> cgroup_rstat_flush_locked() grabs the irq safe cgroup_rstat_lock while
> iterating all possible cpus. It only drops the lock if there is
> scheduler or spin lock contention. If neither, then interrupts can be
> disabled for a long time. On large machines this can disable interrupts
> for a long enough time to drop network packets. On 400+ CPU machines
> I've seen interrupt disabled for over 40 msec.

This is peanuts, watchdog_thresh defaults to 10000 msec.
(Tongue-in-cheek, to put that threshold into relation but I see the
problem.)


> The mode of memory.stat access latency after grouping by of 2 buckets:
                                                        power

> - without memory hogs: 64 usec => 16 usec
> -    with memory hogs: 64 usec =>  8 usec
> The memory.stat latency improves.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>
> Tested-by: Greg Thelen <gthelen@google.com>
> ---
>  kernel/cgroup/rstat.c | 12 +++++-------
>  1 file changed, 5 insertions(+), 7 deletions(-)

FTR, the lock may end up split per-subsys [1] but this would still make
sense for memcg's one. (I wonder if Tejun would consider it small enough
then to avoid interrupt disabling. Then this could be converted to more
widely used cond_resched_lock().)

[1] https://lore.kernel.org/r/20250227215543.49928-4-inwardvessel@gmail.com/

But all in all, thanks for this and

Acked-by: Michal Koutný <mkoutny@suse.com>
Mateusz Guzik March 19, 2025, 10:47 a.m. UTC | #3
On Wed, Mar 19, 2025 at 12:13:30AM -0700, Greg Thelen wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> cgroup_rstat_flush_locked() grabs the irq safe cgroup_rstat_lock while
> iterating all possible cpus. It only drops the lock if there is
> scheduler or spin lock contention. If neither, then interrupts can be
> disabled for a long time. On large machines this can disable interrupts
> for a long enough time to drop network packets. On 400+ CPU machines
> I've seen interrupt disabled for over 40 msec.
> 
> Prevent rstat from disabling interrupts while processing all possible
> cpus. Instead drop and reacquire cgroup_rstat_lock for each cpu. This
> approach was previously discussed in
> https://lore.kernel.org/lkml/ZBz%2FV5a7%2F6PZeM7S@slm.duckdns.org/,
> though this was in the context of an non-irq rstat spin lock.
> 
> Benchmark this change with:
> 1) a single stat_reader process with 400 threads, each reading a test
>    memcg's memory.stat repeatedly for 10 seconds.
> 2) 400 memory hog processes running in the test memcg and repeatedly
>    charging memory until oom killed. Then they repeat charging and oom
>    killing.
> 
> v6.14-rc6 with CONFIG_IRQSOFF_TRACER with stat_reader and hogs, finds
> interrupts are disabled by rstat for 45341 usec:
>   #  => started at: _raw_spin_lock_irq
>   #  => ended at:   cgroup_rstat_flush
>   #
>   #
>   #                    _------=> CPU#
>   #                   / _-----=> irqs-off/BH-disabled
>   #                  | / _----=> need-resched
>   #                  || / _---=> hardirq/softirq
>   #                  ||| / _--=> preempt-depth
>   #                  |||| / _-=> migrate-disable
>   #                  ||||| /     delay
>   #  cmd     pid     |||||| time  |   caller
>   #     \   /        ||||||  \    |    /
>   stat_rea-96532    52d....    0us*: _raw_spin_lock_irq
>   stat_rea-96532    52d.... 45342us : cgroup_rstat_flush
>   stat_rea-96532    52d.... 45342us : tracer_hardirqs_on <-cgroup_rstat_flush
>   stat_rea-96532    52d.... 45343us : <stack trace>
>    => memcg1_stat_format
>    => memory_stat_format
>    => memory_stat_show
>    => seq_read_iter
>    => vfs_read
>    => ksys_read
>    => do_syscall_64
>    => entry_SYSCALL_64_after_hwframe
> 
> With this patch the CONFIG_IRQSOFF_TRACER doesn't find rstat to be the
> longest holder. The longest irqs-off holder has irqs disabled for
> 4142 usec, a huge reduction from previous 45341 usec rstat finding.
> 
> Running stat_reader memory.stat reader for 10 seconds:
> - without memory hogs: 9.84M accesses => 12.7M accesses
> -    with memory hogs: 9.46M accesses => 11.1M accesses
> The throughput of memory.stat access improves.
> 
> The mode of memory.stat access latency after grouping by of 2 buckets:
> - without memory hogs: 64 usec => 16 usec
> -    with memory hogs: 64 usec =>  8 usec
> The memory.stat latency improves.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>
> Tested-by: Greg Thelen <gthelen@google.com>
> ---
>  kernel/cgroup/rstat.c | 12 +++++-------
>  1 file changed, 5 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> index aac91466279f..976c24b3671a 100644
> --- a/kernel/cgroup/rstat.c
> +++ b/kernel/cgroup/rstat.c
> @@ -323,13 +323,11 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
>  			rcu_read_unlock();
>  		}
>  
> -		/* play nice and yield if necessary */
> -		if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
> -			__cgroup_rstat_unlock(cgrp, cpu);
> -			if (!cond_resched())
> -				cpu_relax();
> -			__cgroup_rstat_lock(cgrp, cpu);
> -		}
> +		/* play nice and avoid disabling interrupts for a long time */
> +		__cgroup_rstat_unlock(cgrp, cpu);
> +		if (!cond_resched())
> +			cpu_relax();
> +		__cgroup_rstat_lock(cgrp, cpu);
>  	}
>  }

Is not this going a little too far?

the lock + irq trip is quite expensive in its own right and now is
going to be paid for each cpu, as in the total time spent executing
cgroup_rstat_flush_locked is going to go up.

Would your problem go away toggling this every -- say -- 8 cpus?

Just a suggestion.
Yosry Ahmed March 19, 2025, 5:16 p.m. UTC | #4
On Wed, Mar 19, 2025 at 12:13:30AM -0700, Greg Thelen wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> cgroup_rstat_flush_locked() grabs the irq safe cgroup_rstat_lock while
> iterating all possible cpus. It only drops the lock if there is
> scheduler or spin lock contention. If neither, then interrupts can be
> disabled for a long time. On large machines this can disable interrupts
> for a long enough time to drop network packets. On 400+ CPU machines
> I've seen interrupt disabled for over 40 msec.
> 
> Prevent rstat from disabling interrupts while processing all possible
> cpus. Instead drop and reacquire cgroup_rstat_lock for each cpu. This
> approach was previously discussed in
> https://lore.kernel.org/lkml/ZBz%2FV5a7%2F6PZeM7S@slm.duckdns.org/,
> though this was in the context of an non-irq rstat spin lock.
> 
> Benchmark this change with:
> 1) a single stat_reader process with 400 threads, each reading a test
>    memcg's memory.stat repeatedly for 10 seconds.
> 2) 400 memory hog processes running in the test memcg and repeatedly
>    charging memory until oom killed. Then they repeat charging and oom
>    killing.
> 
> v6.14-rc6 with CONFIG_IRQSOFF_TRACER with stat_reader and hogs, finds
> interrupts are disabled by rstat for 45341 usec:
>   #  => started at: _raw_spin_lock_irq
>   #  => ended at:   cgroup_rstat_flush
>   #
>   #
>   #                    _------=> CPU#
>   #                   / _-----=> irqs-off/BH-disabled
>   #                  | / _----=> need-resched
>   #                  || / _---=> hardirq/softirq
>   #                  ||| / _--=> preempt-depth
>   #                  |||| / _-=> migrate-disable
>   #                  ||||| /     delay
>   #  cmd     pid     |||||| time  |   caller
>   #     \   /        ||||||  \    |    /
>   stat_rea-96532    52d....    0us*: _raw_spin_lock_irq
>   stat_rea-96532    52d.... 45342us : cgroup_rstat_flush
>   stat_rea-96532    52d.... 45342us : tracer_hardirqs_on <-cgroup_rstat_flush
>   stat_rea-96532    52d.... 45343us : <stack trace>
>    => memcg1_stat_format
>    => memory_stat_format
>    => memory_stat_show
>    => seq_read_iter
>    => vfs_read
>    => ksys_read
>    => do_syscall_64
>    => entry_SYSCALL_64_after_hwframe
> 
> With this patch the CONFIG_IRQSOFF_TRACER doesn't find rstat to be the
> longest holder. The longest irqs-off holder has irqs disabled for
> 4142 usec, a huge reduction from previous 45341 usec rstat finding.
> 
> Running stat_reader memory.stat reader for 10 seconds:
> - without memory hogs: 9.84M accesses => 12.7M accesses
> -    with memory hogs: 9.46M accesses => 11.1M accesses
> The throughput of memory.stat access improves.
> 
> The mode of memory.stat access latency after grouping by of 2 buckets:
> - without memory hogs: 64 usec => 16 usec
> -    with memory hogs: 64 usec =>  8 usec
> The memory.stat latency improves.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>
> Tested-by: Greg Thelen <gthelen@google.com>
> ---
>  kernel/cgroup/rstat.c | 12 +++++-------
>  1 file changed, 5 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> index aac91466279f..976c24b3671a 100644
> --- a/kernel/cgroup/rstat.c
> +++ b/kernel/cgroup/rstat.c
> @@ -323,13 +323,11 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
>  			rcu_read_unlock();
>  		}
>  
> -		/* play nice and yield if necessary */
> -		if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
> -			__cgroup_rstat_unlock(cgrp, cpu);
> -			if (!cond_resched())
> -				cpu_relax();
> -			__cgroup_rstat_lock(cgrp, cpu);
> -		}
> +		/* play nice and avoid disabling interrupts for a long time */
> +		__cgroup_rstat_unlock(cgrp, cpu);
> +		if (!cond_resched())
> +			cpu_relax();
> +		__cgroup_rstat_lock(cgrp, cpu);

This patch looks good as-is, so feel free to add:

Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>

That being said I think we can do further cleanups here now. We should
probably inline cgroup_rstat_flush_locked() into cgroup_rstat_flush(),
and move the lock hold and release into the loop.
cgroup_rstat_flush_hold() can simply call cgroup_rstat_flush() then hold
the lock.

Something like this on top:

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 976c24b3671a7..4f4b8d22555d7 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -299,17 +299,29 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
 	spin_unlock_irq(&cgroup_rstat_lock);
 }
 
-/* see cgroup_rstat_flush() */
-static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
-	__releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
+/**
+ * cgroup_rstat_flush - flush stats in @cgrp's subtree
+ * @cgrp: target cgroup
+ *
+ * Collect all per-cpu stats in @cgrp's subtree into the global counters
+ * and propagate them upwards.  After this function returns, all cgroups in
+ * the subtree have up-to-date ->stat.
+ *
+ * This also gets all cgroups in the subtree including @cgrp off the
+ * ->updated_children lists.
+ *
+ * This function may block.
+ */
+__bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
 {
 	int cpu;
 
-	lockdep_assert_held(&cgroup_rstat_lock);
-
+	might_sleep();
 	for_each_possible_cpu(cpu) {
 		struct cgroup *pos = cgroup_rstat_updated_list(cgrp, cpu);
 
+		/* Reacquire for every CPU to avoiding disabing IRQs too long */
+		__cgroup_rstat_lock(cgrp, cpu);
 		for (; pos; pos = pos->rstat_flush_next) {
 			struct cgroup_subsys_state *css;
 
@@ -322,37 +334,12 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
 				css->ss->css_rstat_flush(css, cpu);
 			rcu_read_unlock();
 		}
-
-		/* play nice and avoid disabling interrupts for a long time */
 		__cgroup_rstat_unlock(cgrp, cpu);
 		if (!cond_resched())
 			cpu_relax();
-		__cgroup_rstat_lock(cgrp, cpu);
 	}
 }
 
-/**
- * cgroup_rstat_flush - flush stats in @cgrp's subtree
- * @cgrp: target cgroup
- *
- * Collect all per-cpu stats in @cgrp's subtree into the global counters
- * and propagate them upwards.  After this function returns, all cgroups in
- * the subtree have up-to-date ->stat.
- *
- * This also gets all cgroups in the subtree including @cgrp off the
- * ->updated_children lists.
- *
- * This function may block.
- */
-__bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
-{
-	might_sleep();
-
-	__cgroup_rstat_lock(cgrp, -1);
-	cgroup_rstat_flush_locked(cgrp);
-	__cgroup_rstat_unlock(cgrp, -1);
-}
-
 /**
  * cgroup_rstat_flush_hold - flush stats in @cgrp's subtree and hold
  * @cgrp: target cgroup
@@ -365,9 +352,8 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
 void cgroup_rstat_flush_hold(struct cgroup *cgrp)
 	__acquires(&cgroup_rstat_lock)
 {
-	might_sleep();
+	cgroup_rstat_flush(cgrp);
 	__cgroup_rstat_lock(cgrp, -1);
-	cgroup_rstat_flush_locked(cgrp);
 }
 
 /**
Yosry Ahmed March 19, 2025, 5:18 p.m. UTC | #5
On Wed, Mar 19, 2025 at 11:47:32AM +0100, Mateusz Guzik wrote:
> On Wed, Mar 19, 2025 at 12:13:30AM -0700, Greg Thelen wrote:
> > From: Eric Dumazet <edumazet@google.com>
> > 
> > cgroup_rstat_flush_locked() grabs the irq safe cgroup_rstat_lock while
> > iterating all possible cpus. It only drops the lock if there is
> > scheduler or spin lock contention. If neither, then interrupts can be
> > disabled for a long time. On large machines this can disable interrupts
> > for a long enough time to drop network packets. On 400+ CPU machines
> > I've seen interrupt disabled for over 40 msec.
> > 
> > Prevent rstat from disabling interrupts while processing all possible
> > cpus. Instead drop and reacquire cgroup_rstat_lock for each cpu. This
> > approach was previously discussed in
> > https://lore.kernel.org/lkml/ZBz%2FV5a7%2F6PZeM7S@slm.duckdns.org/,
> > though this was in the context of an non-irq rstat spin lock.
> > 
> > Benchmark this change with:
> > 1) a single stat_reader process with 400 threads, each reading a test
> >    memcg's memory.stat repeatedly for 10 seconds.
> > 2) 400 memory hog processes running in the test memcg and repeatedly
> >    charging memory until oom killed. Then they repeat charging and oom
> >    killing.
> > 
> > v6.14-rc6 with CONFIG_IRQSOFF_TRACER with stat_reader and hogs, finds
> > interrupts are disabled by rstat for 45341 usec:
> >   #  => started at: _raw_spin_lock_irq
> >   #  => ended at:   cgroup_rstat_flush
> >   #
> >   #
> >   #                    _------=> CPU#
> >   #                   / _-----=> irqs-off/BH-disabled
> >   #                  | / _----=> need-resched
> >   #                  || / _---=> hardirq/softirq
> >   #                  ||| / _--=> preempt-depth
> >   #                  |||| / _-=> migrate-disable
> >   #                  ||||| /     delay
> >   #  cmd     pid     |||||| time  |   caller
> >   #     \   /        ||||||  \    |    /
> >   stat_rea-96532    52d....    0us*: _raw_spin_lock_irq
> >   stat_rea-96532    52d.... 45342us : cgroup_rstat_flush
> >   stat_rea-96532    52d.... 45342us : tracer_hardirqs_on <-cgroup_rstat_flush
> >   stat_rea-96532    52d.... 45343us : <stack trace>
> >    => memcg1_stat_format
> >    => memory_stat_format
> >    => memory_stat_show
> >    => seq_read_iter
> >    => vfs_read
> >    => ksys_read
> >    => do_syscall_64
> >    => entry_SYSCALL_64_after_hwframe
> > 
> > With this patch the CONFIG_IRQSOFF_TRACER doesn't find rstat to be the
> > longest holder. The longest irqs-off holder has irqs disabled for
> > 4142 usec, a huge reduction from previous 45341 usec rstat finding.
> > 
> > Running stat_reader memory.stat reader for 10 seconds:
> > - without memory hogs: 9.84M accesses => 12.7M accesses
> > -    with memory hogs: 9.46M accesses => 11.1M accesses
> > The throughput of memory.stat access improves.
> > 
> > The mode of memory.stat access latency after grouping by of 2 buckets:
> > - without memory hogs: 64 usec => 16 usec
> > -    with memory hogs: 64 usec =>  8 usec
> > The memory.stat latency improves.
> > 
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Signed-off-by: Greg Thelen <gthelen@google.com>
> > Tested-by: Greg Thelen <gthelen@google.com>
> > ---
> >  kernel/cgroup/rstat.c | 12 +++++-------
> >  1 file changed, 5 insertions(+), 7 deletions(-)
> > 
> > diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> > index aac91466279f..976c24b3671a 100644
> > --- a/kernel/cgroup/rstat.c
> > +++ b/kernel/cgroup/rstat.c
> > @@ -323,13 +323,11 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
> >  			rcu_read_unlock();
> >  		}
> >  
> > -		/* play nice and yield if necessary */
> > -		if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
> > -			__cgroup_rstat_unlock(cgrp, cpu);
> > -			if (!cond_resched())
> > -				cpu_relax();
> > -			__cgroup_rstat_lock(cgrp, cpu);
> > -		}
> > +		/* play nice and avoid disabling interrupts for a long time */
> > +		__cgroup_rstat_unlock(cgrp, cpu);
> > +		if (!cond_resched())
> > +			cpu_relax();
> > +		__cgroup_rstat_lock(cgrp, cpu);
> >  	}
> >  }
> 
> Is not this going a little too far?
> 
> the lock + irq trip is quite expensive in its own right and now is
> going to be paid for each cpu, as in the total time spent executing
> cgroup_rstat_flush_locked is going to go up.
> 
> Would your problem go away toggling this every -- say -- 8 cpus?

I was concerned about this too, and about more lock bouncing, but the
testing suggests that this actually overall improves the latency of
cgroup_rstat_flush_locked() (at least on tested HW).

So I don't think we need to do something like this unless a regression
is observed.

> 
> Just a suggestion.
>
Tejun Heo March 19, 2025, 5:26 p.m. UTC | #6
On Wed, Mar 19, 2025 at 12:13:30AM -0700, Greg Thelen wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> cgroup_rstat_flush_locked() grabs the irq safe cgroup_rstat_lock while
> iterating all possible cpus. It only drops the lock if there is
> scheduler or spin lock contention. If neither, then interrupts can be
> disabled for a long time. On large machines this can disable interrupts
> for a long enough time to drop network packets. On 400+ CPU machines
> I've seen interrupt disabled for over 40 msec.
> 
> Prevent rstat from disabling interrupts while processing all possible
> cpus. Instead drop and reacquire cgroup_rstat_lock for each cpu. This
> approach was previously discussed in
> https://lore.kernel.org/lkml/ZBz%2FV5a7%2F6PZeM7S@slm.duckdns.org/,
> though this was in the context of an non-irq rstat spin lock.
> 
> Benchmark this change with:
> 1) a single stat_reader process with 400 threads, each reading a test
>    memcg's memory.stat repeatedly for 10 seconds.
> 2) 400 memory hog processes running in the test memcg and repeatedly
>    charging memory until oom killed. Then they repeat charging and oom
>    killing.
> 
> v6.14-rc6 with CONFIG_IRQSOFF_TRACER with stat_reader and hogs, finds
> interrupts are disabled by rstat for 45341 usec:
>   #  => started at: _raw_spin_lock_irq
>   #  => ended at:   cgroup_rstat_flush
>   #
>   #
>   #                    _------=> CPU#
>   #                   / _-----=> irqs-off/BH-disabled
>   #                  | / _----=> need-resched
>   #                  || / _---=> hardirq/softirq
>   #                  ||| / _--=> preempt-depth
>   #                  |||| / _-=> migrate-disable
>   #                  ||||| /     delay
>   #  cmd     pid     |||||| time  |   caller
>   #     \   /        ||||||  \    |    /
>   stat_rea-96532    52d....    0us*: _raw_spin_lock_irq
>   stat_rea-96532    52d.... 45342us : cgroup_rstat_flush
>   stat_rea-96532    52d.... 45342us : tracer_hardirqs_on <-cgroup_rstat_flush
>   stat_rea-96532    52d.... 45343us : <stack trace>
>    => memcg1_stat_format
>    => memory_stat_format
>    => memory_stat_show
>    => seq_read_iter
>    => vfs_read
>    => ksys_read
>    => do_syscall_64
>    => entry_SYSCALL_64_after_hwframe
> 
> With this patch the CONFIG_IRQSOFF_TRACER doesn't find rstat to be the
> longest holder. The longest irqs-off holder has irqs disabled for
> 4142 usec, a huge reduction from previous 45341 usec rstat finding.
> 
> Running stat_reader memory.stat reader for 10 seconds:
> - without memory hogs: 9.84M accesses => 12.7M accesses
> -    with memory hogs: 9.46M accesses => 11.1M accesses
> The throughput of memory.stat access improves.
> 
> The mode of memory.stat access latency after grouping by of 2 buckets:
> - without memory hogs: 64 usec => 16 usec
> -    with memory hogs: 64 usec =>  8 usec
> The memory.stat latency improves.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>
> Tested-by: Greg Thelen <gthelen@google.com>

Applied to cgroup/for-6.15.

Thanks.
Tejun Heo March 19, 2025, 5:26 p.m. UTC | #7
On Wed, Mar 19, 2025 at 11:47:32AM +0100, Mateusz Guzik wrote:
...
> Is not this going a little too far?
> 
> the lock + irq trip is quite expensive in its own right and now is
> going to be paid for each cpu, as in the total time spent executing
> cgroup_rstat_flush_locked is going to go up.

Lock/unlock when the cacheline is already on the cpu is pretty cheap and on
modern x86 cpus at least irq on/off are also only a few cycles, so you
probably wouldn't be able to tell the difference.

Thanks.
Johannes Weiner March 19, 2025, 5:35 p.m. UTC | #8
On Wed, Mar 19, 2025 at 12:13:30AM -0700, Greg Thelen wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> cgroup_rstat_flush_locked() grabs the irq safe cgroup_rstat_lock while
> iterating all possible cpus. It only drops the lock if there is
> scheduler or spin lock contention. If neither, then interrupts can be
> disabled for a long time. On large machines this can disable interrupts
> for a long enough time to drop network packets. On 400+ CPU machines
> I've seen interrupt disabled for over 40 msec.
> 
> Prevent rstat from disabling interrupts while processing all possible
> cpus. Instead drop and reacquire cgroup_rstat_lock for each cpu. This
> approach was previously discussed in
> https://lore.kernel.org/lkml/ZBz%2FV5a7%2F6PZeM7S@slm.duckdns.org/,
> though this was in the context of an non-irq rstat spin lock.
> 
> Benchmark this change with:
> 1) a single stat_reader process with 400 threads, each reading a test
>    memcg's memory.stat repeatedly for 10 seconds.
> 2) 400 memory hog processes running in the test memcg and repeatedly
>    charging memory until oom killed. Then they repeat charging and oom
>    killing.
> 
> v6.14-rc6 with CONFIG_IRQSOFF_TRACER with stat_reader and hogs, finds
> interrupts are disabled by rstat for 45341 usec:
>   #  => started at: _raw_spin_lock_irq
>   #  => ended at:   cgroup_rstat_flush
>   #
>   #
>   #                    _------=> CPU#
>   #                   / _-----=> irqs-off/BH-disabled
>   #                  | / _----=> need-resched
>   #                  || / _---=> hardirq/softirq
>   #                  ||| / _--=> preempt-depth
>   #                  |||| / _-=> migrate-disable
>   #                  ||||| /     delay
>   #  cmd     pid     |||||| time  |   caller
>   #     \   /        ||||||  \    |    /
>   stat_rea-96532    52d....    0us*: _raw_spin_lock_irq
>   stat_rea-96532    52d.... 45342us : cgroup_rstat_flush
>   stat_rea-96532    52d.... 45342us : tracer_hardirqs_on <-cgroup_rstat_flush
>   stat_rea-96532    52d.... 45343us : <stack trace>
>    => memcg1_stat_format
>    => memory_stat_format
>    => memory_stat_show
>    => seq_read_iter
>    => vfs_read
>    => ksys_read
>    => do_syscall_64
>    => entry_SYSCALL_64_after_hwframe
> 
> With this patch the CONFIG_IRQSOFF_TRACER doesn't find rstat to be the
> longest holder. The longest irqs-off holder has irqs disabled for
> 4142 usec, a huge reduction from previous 45341 usec rstat finding.
> 
> Running stat_reader memory.stat reader for 10 seconds:
> - without memory hogs: 9.84M accesses => 12.7M accesses
> -    with memory hogs: 9.46M accesses => 11.1M accesses
> The throughput of memory.stat access improves.
> 
> The mode of memory.stat access latency after grouping by of 2 buckets:
> - without memory hogs: 64 usec => 16 usec
> -    with memory hogs: 64 usec =>  8 usec
> The memory.stat latency improves.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>
> Tested-by: Greg Thelen <gthelen@google.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Shakeel Butt March 19, 2025, 5:53 p.m. UTC | #9
On Wed, Mar 19, 2025 at 12:13:30AM -0700, Greg Thelen wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> cgroup_rstat_flush_locked() grabs the irq safe cgroup_rstat_lock while
> iterating all possible cpus. It only drops the lock if there is
> scheduler or spin lock contention. If neither, then interrupts can be
> disabled for a long time. On large machines this can disable interrupts
> for a long enough time to drop network packets. On 400+ CPU machines
> I've seen interrupt disabled for over 40 msec.

Which kernel was this observed on in production?

> 
> Prevent rstat from disabling interrupts while processing all possible
> cpus. Instead drop and reacquire cgroup_rstat_lock for each cpu.

Doing for each cpu might be too extreme. Have you tried with some
batching?

> This
> approach was previously discussed in
> https://lore.kernel.org/lkml/ZBz%2FV5a7%2F6PZeM7S@slm.duckdns.org/,
> though this was in the context of an non-irq rstat spin lock.
> 
> Benchmark this change with:
> 1) a single stat_reader process with 400 threads, each reading a test
>    memcg's memory.stat repeatedly for 10 seconds.
> 2) 400 memory hog processes running in the test memcg and repeatedly
>    charging memory until oom killed. Then they repeat charging and oom
>    killing.
> 

Though this benchmark seems too extreme but userspace holding off irqs
for that long time is bad. BTW are these memory hoggers, creating anon
memory or file memory? Is [z]swap enabled?

For the long term, I think we can use make this work without disabling
irqs, similar to how networking manages sock lock.

> v6.14-rc6 with CONFIG_IRQSOFF_TRACER with stat_reader and hogs, finds
> interrupts are disabled by rstat for 45341 usec:
>   #  => started at: _raw_spin_lock_irq
>   #  => ended at:   cgroup_rstat_flush
>   #
>   #
>   #                    _------=> CPU#
>   #                   / _-----=> irqs-off/BH-disabled
>   #                  | / _----=> need-resched
>   #                  || / _---=> hardirq/softirq
>   #                  ||| / _--=> preempt-depth
>   #                  |||| / _-=> migrate-disable
>   #                  ||||| /     delay
>   #  cmd     pid     |||||| time  |   caller
>   #     \   /        ||||||  \    |    /
>   stat_rea-96532    52d....    0us*: _raw_spin_lock_irq
>   stat_rea-96532    52d.... 45342us : cgroup_rstat_flush
>   stat_rea-96532    52d.... 45342us : tracer_hardirqs_on <-cgroup_rstat_flush
>   stat_rea-96532    52d.... 45343us : <stack trace>
>    => memcg1_stat_format
>    => memory_stat_format
>    => memory_stat_show
>    => seq_read_iter
>    => vfs_read
>    => ksys_read
>    => do_syscall_64
>    => entry_SYSCALL_64_after_hwframe
> 
> With this patch the CONFIG_IRQSOFF_TRACER doesn't find rstat to be the
> longest holder. The longest irqs-off holder has irqs disabled for
> 4142 usec, a huge reduction from previous 45341 usec rstat finding.
> 
> Running stat_reader memory.stat reader for 10 seconds:
> - without memory hogs: 9.84M accesses => 12.7M accesses
> -    with memory hogs: 9.46M accesses => 11.1M accesses
> The throughput of memory.stat access improves.
> 
> The mode of memory.stat access latency after grouping by of 2 buckets:
> - without memory hogs: 64 usec => 16 usec
> -    with memory hogs: 64 usec =>  8 usec
> The memory.stat latency improves.

So, things are improving even without batching. I wonder if there are
less readers then how will this look like. Can you try with single
reader as well?

> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>
> Tested-by: Greg Thelen <gthelen@google.com>

Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Johannes Weiner March 19, 2025, 6:06 p.m. UTC | #10
On Wed, Mar 19, 2025 at 05:16:02PM +0000, Yosry Ahmed wrote:
> @@ -365,9 +352,8 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
>  void cgroup_rstat_flush_hold(struct cgroup *cgrp)
>  	__acquires(&cgroup_rstat_lock)
>  {
> -	might_sleep();
> +	cgroup_rstat_flush(cgrp);
>  	__cgroup_rstat_lock(cgrp, -1);
> -	cgroup_rstat_flush_locked(cgrp);
>  }

Might as well remove cgroup_rstat_flush_hold/release entirely? There
are no external users, and the concept seems moot when the lock is
dropped per default. cgroup_base_stat_cputime_show() can open-code the
lock/unlock to stabilize the counts while reading.

(btw, why do we not have any locking around the root stats in
cgroup_base_stat_cputime_show()? There isn't anything preventing a
reader from seeing all zeroes if another reader runs the memset() on
cgrp->bstat, is there? Or double times...)
Yosry Ahmed March 19, 2025, 6:35 p.m. UTC | #11
On Wed, Mar 19, 2025 at 02:06:43PM -0400, Johannes Weiner wrote:
> On Wed, Mar 19, 2025 at 05:16:02PM +0000, Yosry Ahmed wrote:
> > @@ -365,9 +352,8 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
> >  void cgroup_rstat_flush_hold(struct cgroup *cgrp)
> >  	__acquires(&cgroup_rstat_lock)
> >  {
> > -	might_sleep();
> > +	cgroup_rstat_flush(cgrp);
> >  	__cgroup_rstat_lock(cgrp, -1);
> > -	cgroup_rstat_flush_locked(cgrp);
> >  }
> 
> Might as well remove cgroup_rstat_flush_hold/release entirely? There
> are no external users, and the concept seems moot when the lock is
> dropped per default. cgroup_base_stat_cputime_show() can open-code the
> lock/unlock to stabilize the counts while reading.
 
Yeah I missed the fact that the users are internal because the functions
are not static. I also don't see the point of keeping them.

Tejun/Greg, should I send a patch on top of this one or do you prefer
sending a new version?


> (btw, why do we not have any locking around the root stats in
> cgroup_base_stat_cputime_show()? There isn't anything preventing a
> reader from seeing all zeroes if another reader runs the memset() on
> cgrp->bstat, is there? Or double times...)


(I think root_cgroup_cputime() operates on a stack allocated bstat, not
cgrp->bstat)
Eric Dumazet March 19, 2025, 6:37 p.m. UTC | #12
On Wed, Mar 19, 2025 at 6:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:

> On Wed, Mar 19, 2025 at 12:13:30AM -0700, Greg Thelen wrote:
> > From: Eric Dumazet <edumazet@google.com>
> >
> > cgroup_rstat_flush_locked() grabs the irq safe cgroup_rstat_lock while
> > iterating all possible cpus. It only drops the lock if there is
> > scheduler or spin lock contention. If neither, then interrupts can be
> > disabled for a long time. On large machines this can disable interrupts
> > for a long enough time to drop network packets. On 400+ CPU machines
> > I've seen interrupt disabled for over 40 msec.
>
> Which kernel was this observed on in production?
>

I had these issues with the latest upstream kernels, and have seen a 80 ms
delay.

./trace-cgroup_rstat_flush_locked.sh
Attaching 3 probes...
^C

@cgroup_rstat_flush_locked_us:
[64, 128)              2 |@
  |
[128, 256)             3 |@
  |
[256, 512)             2 |@
  |
[512, 1K)              6 |@@@
  |
[1K, 2K)               8 |@@@@@
  |
[2K, 4K)               3 |@
  |
[4K, 8K)               0 |
   |
[8K, 16K)              0 |
   |
[16K, 32K)             4 |@@
   |
[32K, 64K)            83
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[64K, 128K)           74 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
   |

@max_us: 79214




Greg in his tests was able to get 45 ms "only"



>
> >
> > Prevent rstat from disabling interrupts while processing all possible
> > cpus. Instead drop and reacquire cgroup_rstat_lock for each cpu.
>
> Doing for each cpu might be too extreme. Have you tried with some
> batching?
>

Not really, modern platforms are doing the unlock/lock pair faster than
all the cache line misses done in the per-cpu loop.



>
> > This
> > approach was previously discussed in
> > https://lore.kernel.org/lkml/ZBz%2FV5a7%2F6PZeM7S@slm.duckdns.org/,
> > though this was in the context of an non-irq rstat spin lock.
> >
> > Benchmark this change with:
> > 1) a single stat_reader process with 400 threads, each reading a test
> >    memcg's memory.stat repeatedly for 10 seconds.
> > 2) 400 memory hog processes running in the test memcg and repeatedly
> >    charging memory until oom killed. Then they repeat charging and oom
> >    killing.
> >
>
> Though this benchmark seems too extreme but userspace holding off irqs
> for that long time is bad. BTW are these memory hoggers, creating anon
> memory or file memory? Is [z]swap enabled?
>
> For the long term, I think we can use make this work without disabling
> irqs, similar to how networking manages sock lock.
>
> > v6.14-rc6 with CONFIG_IRQSOFF_TRACER with stat_reader and hogs, finds
> > interrupts are disabled by rstat for 45341 usec:
> >   #  => started at: _raw_spin_lock_irq
> >   #  => ended at:   cgroup_rstat_flush
> >   #
> >   #
> >   #                    _------=> CPU#
> >   #                   / _-----=> irqs-off/BH-disabled
> >   #                  | / _----=> need-resched
> >   #                  || / _---=> hardirq/softirq
> >   #                  ||| / _--=> preempt-depth
> >   #                  |||| / _-=> migrate-disable
> >   #                  ||||| /     delay
> >   #  cmd     pid     |||||| time  |   caller
> >   #     \   /        ||||||  \    |    /
> >   stat_rea-96532    52d....    0us*: _raw_spin_lock_irq
> >   stat_rea-96532    52d.... 45342us : cgroup_rstat_flush
> >   stat_rea-96532    52d.... 45342us : tracer_hardirqs_on
> <-cgroup_rstat_flush
> >   stat_rea-96532    52d.... 45343us : <stack trace>
> >    => memcg1_stat_format
> >    => memory_stat_format
> >    => memory_stat_show
> >    => seq_read_iter
> >    => vfs_read
> >    => ksys_read
> >    => do_syscall_64
> >    => entry_SYSCALL_64_after_hwframe
> >
> > With this patch the CONFIG_IRQSOFF_TRACER doesn't find rstat to be the
> > longest holder. The longest irqs-off holder has irqs disabled for
> > 4142 usec, a huge reduction from previous 45341 usec rstat finding.
> >
> > Running stat_reader memory.stat reader for 10 seconds:
> > - without memory hogs: 9.84M accesses => 12.7M accesses
> > -    with memory hogs: 9.46M accesses => 11.1M accesses
> > The throughput of memory.stat access improves.
> >
> > The mode of memory.stat access latency after grouping by of 2 buckets:
> > - without memory hogs: 64 usec => 16 usec
> > -    with memory hogs: 64 usec =>  8 usec
> > The memory.stat latency improves.
>
> So, things are improving even without batching. I wonder if there are
> less readers then how will this look like. Can you try with single
> reader as well?
>
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Signed-off-by: Greg Thelen <gthelen@google.com>
> > Tested-by: Greg Thelen <gthelen@google.com>
>
> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
>
Tejun Heo March 19, 2025, 7:10 p.m. UTC | #13
On Wed, Mar 19, 2025 at 06:35:02PM +0000, Yosry Ahmed wrote:
> On Wed, Mar 19, 2025 at 02:06:43PM -0400, Johannes Weiner wrote:
> > On Wed, Mar 19, 2025 at 05:16:02PM +0000, Yosry Ahmed wrote:
> > > @@ -365,9 +352,8 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
> > >  void cgroup_rstat_flush_hold(struct cgroup *cgrp)
> > >  	__acquires(&cgroup_rstat_lock)
> > >  {
> > > -	might_sleep();
> > > +	cgroup_rstat_flush(cgrp);
> > >  	__cgroup_rstat_lock(cgrp, -1);
> > > -	cgroup_rstat_flush_locked(cgrp);
> > >  }
> > 
> > Might as well remove cgroup_rstat_flush_hold/release entirely? There
> > are no external users, and the concept seems moot when the lock is
> > dropped per default. cgroup_base_stat_cputime_show() can open-code the
> > lock/unlock to stabilize the counts while reading.
>  
> Yeah I missed the fact that the users are internal because the functions
> are not static. I also don't see the point of keeping them.
> 
> Tejun/Greg, should I send a patch on top of this one or do you prefer
> sending a new version?

Please send a patch on top.

Thanks.
Johannes Weiner March 19, 2025, 7:16 p.m. UTC | #14
On Wed, Mar 19, 2025 at 06:35:02PM +0000, Yosry Ahmed wrote:
> On Wed, Mar 19, 2025 at 02:06:43PM -0400, Johannes Weiner wrote:
> > (btw, why do we not have any locking around the root stats in
> > cgroup_base_stat_cputime_show()? There isn't anything preventing a
> > reader from seeing all zeroes if another reader runs the memset() on
> > cgrp->bstat, is there? Or double times...)
> 
> (I think root_cgroup_cputime() operates on a stack allocated bstat, not
> cgrp->bstat)

That was the case until:

commit b824766504e49f3fdcbb8c722e70996a78c3636e
Author: Chen Ridong <chenridong@huawei.com>
Date:   Thu Jul 4 14:01:19 2024 +0000

    cgroup/rstat: add force idle show helper

Now it's doing this:

void cgroup_base_stat_cputime_show(struct seq_file *seq)
{
	struct cgroup *cgrp = seq_css(seq)->cgroup;

	if (cgroup_parent(cgrp)) {
		...
	} else {
		/* cgrp->bstat of root is not actually used, reuse it */
		root_cgroup_cputime(&cgrp->bstat);
		usage = cgrp->bstat.cputime.sum_exec_runtime;
		utime = cgrp->bstat.cputime.utime;
		stime = cgrp->bstat.cputime.stime;
		ntime = cgrp->bstat.ntime;
	}
}

and then:

static void root_cgroup_cputime(struct cgroup_base_stat *bstat)
{
	struct task_cputime *cputime = &bstat->cputime;
	int i;

	memset(bstat, 0, sizeof(*bstat));

	/* various += on bstat and cputime members */
}

So all readers are mucking with the root cgroup's bstat.
Johannes Weiner March 19, 2025, 7:46 p.m. UTC | #15
On Wed, Mar 19, 2025 at 03:16:42PM -0400, Johannes Weiner wrote:
> On Wed, Mar 19, 2025 at 06:35:02PM +0000, Yosry Ahmed wrote:
> > On Wed, Mar 19, 2025 at 02:06:43PM -0400, Johannes Weiner wrote:
> > > (btw, why do we not have any locking around the root stats in
> > > cgroup_base_stat_cputime_show()? There isn't anything preventing a
> > > reader from seeing all zeroes if another reader runs the memset() on
> > > cgrp->bstat, is there? Or double times...)
> > 
> > (I think root_cgroup_cputime() operates on a stack allocated bstat, not
> > cgrp->bstat)
> 
> That was the case until:
> 
> commit b824766504e49f3fdcbb8c722e70996a78c3636e
> Author: Chen Ridong <chenridong@huawei.com>
> Date:   Thu Jul 4 14:01:19 2024 +0000

Nevermind, Tejun pointed me to the follow-up fix he's got already
queued up:

https://web.git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git/commit/?id=c4af66a95aa3bc1d4f607ebd4eea524fb58946e3

That brings it all back on the stack.
Greg Thelen March 20, 2025, 2:43 p.m. UTC | #16
On Wed, Mar 19, 2025 at 10:53 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Wed, Mar 19, 2025 at 12:13:30AM -0700, Greg Thelen wrote:
> > From: Eric Dumazet <edumazet@google.com>
> >
> > cgroup_rstat_flush_locked() grabs the irq safe cgroup_rstat_lock while
> > iterating all possible cpus. It only drops the lock if there is
> > scheduler or spin lock contention. If neither, then interrupts can be
> > disabled for a long time. On large machines this can disable interrupts
> > for a long enough time to drop network packets. On 400+ CPU machines
> > I've seen interrupt disabled for over 40 msec.
>
> Which kernel was this observed on in production?
>
> >
> > Prevent rstat from disabling interrupts while processing all possible
> > cpus. Instead drop and reacquire cgroup_rstat_lock for each cpu.
>
> Doing for each cpu might be too extreme. Have you tried with some
> batching?
>
> > This
> > approach was previously discussed in
> > https://lore.kernel.org/lkml/ZBz%2FV5a7%2F6PZeM7S@slm.duckdns.org/,
> > though this was in the context of an non-irq rstat spin lock.
> >
> > Benchmark this change with:
> > 1) a single stat_reader process with 400 threads, each reading a test
> >    memcg's memory.stat repeatedly for 10 seconds.
> > 2) 400 memory hog processes running in the test memcg and repeatedly
> >    charging memory until oom killed. Then they repeat charging and oom
> >    killing.
> >
>
> Though this benchmark seems too extreme but userspace holding off irqs
> for that long time is bad. BTW are these memory hoggers, creating anon
> memory or file memory? Is [z]swap enabled?

The memory hoggers were anon, without any form of swap.

I think the other questions were answered in other replies, but feel free to
re-ask and I'll provide details.


> For the long term, I think we can use make this work without disabling
> irqs, similar to how networking manages sock lock.
>
> > v6.14-rc6 with CONFIG_IRQSOFF_TRACER with stat_reader and hogs, finds
> > interrupts are disabled by rstat for 45341 usec:
> >   #  => started at: _raw_spin_lock_irq
> >   #  => ended at:   cgroup_rstat_flush
> >   #
> >   #
> >   #                    _------=> CPU#
> >   #                   / _-----=> irqs-off/BH-disabled
> >   #                  | / _----=> need-resched
> >   #                  || / _---=> hardirq/softirq
> >   #                  ||| / _--=> preempt-depth
> >   #                  |||| / _-=> migrate-disable
> >   #                  ||||| /     delay
> >   #  cmd     pid     |||||| time  |   caller
> >   #     \   /        ||||||  \    |    /
> >   stat_rea-96532    52d....    0us*: _raw_spin_lock_irq
> >   stat_rea-96532    52d.... 45342us : cgroup_rstat_flush
> >   stat_rea-96532    52d.... 45342us : tracer_hardirqs_on <-cgroup_rstat_flush
> >   stat_rea-96532    52d.... 45343us : <stack trace>
> >    => memcg1_stat_format
> >    => memory_stat_format
> >    => memory_stat_show
> >    => seq_read_iter
> >    => vfs_read
> >    => ksys_read
> >    => do_syscall_64
> >    => entry_SYSCALL_64_after_hwframe
> >
> > With this patch the CONFIG_IRQSOFF_TRACER doesn't find rstat to be the
> > longest holder. The longest irqs-off holder has irqs disabled for
> > 4142 usec, a huge reduction from previous 45341 usec rstat finding.
> >
> > Running stat_reader memory.stat reader for 10 seconds:
> > - without memory hogs: 9.84M accesses => 12.7M accesses
> > -    with memory hogs: 9.46M accesses => 11.1M accesses
> > The throughput of memory.stat access improves.
> >
> > The mode of memory.stat access latency after grouping by of 2 buckets:
> > - without memory hogs: 64 usec => 16 usec
> > -    with memory hogs: 64 usec =>  8 usec
> > The memory.stat latency improves.
>
> So, things are improving even without batching. I wonder if there are
> less readers then how will this look like. Can you try with single
> reader as well?
>
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Signed-off-by: Greg Thelen <gthelen@google.com>
> > Tested-by: Greg Thelen <gthelen@google.com>
>
> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
>
Mateusz Guzik March 26, 2025, 11:57 p.m. UTC | #17
On Wed, Mar 19, 2025 at 6:26 PM Tejun Heo <tj@kernel.org> wrote:
>
> On Wed, Mar 19, 2025 at 11:47:32AM +0100, Mateusz Guzik wrote:
> ...
> > Is not this going a little too far?
> >
> > the lock + irq trip is quite expensive in its own right and now is
> > going to be paid for each cpu, as in the total time spent executing
> > cgroup_rstat_flush_locked is going to go up.
>
> Lock/unlock when the cacheline is already on the cpu is pretty cheap and on
> modern x86 cpus at least irq on/off are also only a few cycles, so you
> probably wouldn't be able to tell the difference.
>

This does not line up with what I had seen on x86-64 uarchs up to this
point, including Sapphire Rapids, which I think still counts as
reasonably new.

Most notably I see smp_mb() as a factor on my profiles, which compiles
to lock add $0 to a stack pointer -- a very much cache hot variable.

However, the cost of an atomic op has a lot of to do with state
accumulated around it -- the more atomics in short succession, the
lesser the impact. That is to say it may be given code is slow enough
that adding a lock-prefixed instruction does not add notable overhead.

In order to not just handwave, here is overhead of __legitimize_mnt()
while performing a path lookup for access() on my test box. smp_mb()
is the stock variant. irq() merely toggles interrupts, it does not
provide any sanity for the routine, but lets me see the cost if it got
planted there for some magic reason. Finally the last bit is what I
expect the final routine to have -- merely a branch to account for bad
mounts (taking advantage of synchronize_rcu() on unmount).
smp_mb:         3.31%
irq:            1.44%
nothing:        0.63%

These would be higher if it was not for  the fact that memory
allocation (for path buffer) is dog slow.

And indeed I get over a 3% speed up in access() rate by not suffering
that smp_mb() [I note I don't have a viable patch for it, rather I
whacked it for benchmarking purposes to see if pursuing proper removal
is worth it]

I'll note though that profiling open()+close() shows virtually no
difference (less than 0.5%) -- but that's not an argument for atomics
being cheap, instead it is an argument for open() in particular being
slow (which it is, I'm working on it though).

If you want I can ship you the test case, diffs and kernel config to
reproduce on your own.









--
Mateusz Guzik <mjguzik gmail.com>
Mateusz Guzik March 27, 2025, 2:38 p.m. UTC | #18
On Wed, Mar 19, 2025 at 05:18:05PM +0000, Yosry Ahmed wrote:
> On Wed, Mar 19, 2025 at 11:47:32AM +0100, Mateusz Guzik wrote:
> > Is not this going a little too far?
> > 
> > the lock + irq trip is quite expensive in its own right and now is
> > going to be paid for each cpu, as in the total time spent executing
> > cgroup_rstat_flush_locked is going to go up.
> > 
> > Would your problem go away toggling this every -- say -- 8 cpus?
> 
> I was concerned about this too, and about more lock bouncing, but the
> testing suggests that this actually overall improves the latency of
> cgroup_rstat_flush_locked() (at least on tested HW).
> 
> So I don't think we need to do something like this unless a regression
> is observed.
> 

To my reading it reduces max time spent with irq disabled, which of
course it does -- after all it toggles it for every CPU.

Per my other e-mail in the thread the irq + lock trips remain not cheap
at least on Sapphire Rapids.

In my testing outlined below I see 11% increase in total execution time
with the irq + lock trip for every CPU in a 24-way vm.

So I stand by instead doing this every n CPUs, call it 8 or whatever.

How to repro:

I employed a poor-man's profiler like so:

bpftrace -e 'kprobe:cgroup_rstat_flush_locked { @start[tid] = nsecs; } kretprobe:cgroup_rstat_flush_locked /@start[tid]/ { print(nsecs - @start[tid]); delete(@start[tid]); } interval:s:60 { exit(); }'

This patch or not, execution time varies wildly even while the box is idle.

The above runs for a minute, collecting 23 samples (you may get
"lucky" and get one extra, in that case remove it for comparison).

A sysctl was added to toggle the new behavior vs old one. Patch at the
end.

"enabled"(1) means new behavior, "disabled"(0) means the old one.

Sum of nsecs (results piped to: awk '{ sum += $1 } END { print sum }'):
disabled:	903610
enabled:	1006833 (+11.4%)

Toggle at runtime with:
sysctl fs.magic_tunable=0 # disabled, no mandatory relocks
sysctl fs.magic_tunable=1 # enabled, relock for every CPU

I attached the stats I got for reference.

I patched v6.14 with the following:
diff --git a/fs/file_table.c b/fs/file_table.c
index c04ed94cdc4b..441f89421413 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -106,6 +106,8 @@ static int proc_nr_files(const struct ctl_table *table, int write, void *buffer,
 	return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
 }
 
+unsigned long magic_tunable;
+
 static const struct ctl_table fs_stat_sysctls[] = {
 	{
 		.procname	= "file-nr",
@@ -123,6 +125,16 @@ static const struct ctl_table fs_stat_sysctls[] = {
 		.extra1		= SYSCTL_LONG_ZERO,
 		.extra2		= SYSCTL_LONG_MAX,
 	},
+	{
+		.procname	= "magic_tunable",
+		.data		= &magic_tunable,
+		.maxlen		= sizeof(magic_tunable),
+		.mode		= 0644,
+		.proc_handler	= proc_doulongvec_minmax,
+		.extra1		= SYSCTL_LONG_ZERO,
+		.extra2		= SYSCTL_LONG_MAX,
+	},
+
 	{
 		.procname	= "nr_open",
 		.data		= &sysctl_nr_open,
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 3e01781aeb7b..f6444bf25b2f 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -299,6 +299,8 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
 	spin_unlock_irq(&cgroup_rstat_lock);
 }
 
+extern unsigned long magic_tunable;
+
 /* see cgroup_rstat_flush() */
 static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
 	__releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
@@ -323,12 +325,18 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
 			rcu_read_unlock();
 		}
 
-		/* play nice and yield if necessary */
-		if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
+		if (READ_ONCE(magic_tunable)) {
 			__cgroup_rstat_unlock(cgrp, cpu);
 			if (!cond_resched())
 				cpu_relax();
 			__cgroup_rstat_lock(cgrp, cpu);
+		} else {
+			if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
+				__cgroup_rstat_unlock(cgrp, cpu);
+				if (!cond_resched())
+					cpu_relax();
+				__cgroup_rstat_lock(cgrp, cpu);
+			}
 		}
 	}
 }
69869
30473
64670
30544
30950
36445
36235
29920
51179
35760
33424
42426
30177
31211
44974
34450
37871
72642
33016
29518
31800
35730
30326
63507
50113
36280
35148
63329
41232
51265
41341
41418
42824
35200
35550
54684
41597
55325
36120
48675
41179
39339
35794
38826
37411
40676
Yosry Ahmed March 27, 2025, 5:17 p.m. UTC | #19
On Thu, Mar 27, 2025 at 03:38:50PM +0100, Mateusz Guzik wrote:
> On Wed, Mar 19, 2025 at 05:18:05PM +0000, Yosry Ahmed wrote:
> > On Wed, Mar 19, 2025 at 11:47:32AM +0100, Mateusz Guzik wrote:
> > > Is not this going a little too far?
> > > 
> > > the lock + irq trip is quite expensive in its own right and now is
> > > going to be paid for each cpu, as in the total time spent executing
> > > cgroup_rstat_flush_locked is going to go up.
> > > 
> > > Would your problem go away toggling this every -- say -- 8 cpus?
> > 
> > I was concerned about this too, and about more lock bouncing, but the
> > testing suggests that this actually overall improves the latency of
> > cgroup_rstat_flush_locked() (at least on tested HW).
> > 
> > So I don't think we need to do something like this unless a regression
> > is observed.
> > 
> 
> To my reading it reduces max time spent with irq disabled, which of
> course it does -- after all it toggles it for every CPU.
> 
> Per my other e-mail in the thread the irq + lock trips remain not cheap
> at least on Sapphire Rapids.
> 
> In my testing outlined below I see 11% increase in total execution time
> with the irq + lock trip for every CPU in a 24-way vm.
> 
> So I stand by instead doing this every n CPUs, call it 8 or whatever.
> 
> How to repro:
> 
> I employed a poor-man's profiler like so:
> 
> bpftrace -e 'kprobe:cgroup_rstat_flush_locked { @start[tid] = nsecs; } kretprobe:cgroup_rstat_flush_locked /@start[tid]/ { print(nsecs - @start[tid]); delete(@start[tid]); } interval:s:60 { exit(); }'
> 
> This patch or not, execution time varies wildly even while the box is idle.
> 
> The above runs for a minute, collecting 23 samples (you may get
> "lucky" and get one extra, in that case remove it for comparison).
> 
> A sysctl was added to toggle the new behavior vs old one. Patch at the
> end.
> 
> "enabled"(1) means new behavior, "disabled"(0) means the old one.
> 
> Sum of nsecs (results piped to: awk '{ sum += $1 } END { print sum }'):
> disabled:	903610
> enabled:	1006833 (+11.4%)

IIUC this calculates the amount of elapsed time between start and
finish, not necessarily the function's own execution time. Is it
possible that the increase in time is due to more interrupts arriving
during the function execution (which is what we want), rather than more
time being spent on disabling/enabling IRQs?
Mateusz Guzik March 27, 2025, 5:47 p.m. UTC | #20
On Thu, Mar 27, 2025 at 6:17 PM Yosry Ahmed <yosry.ahmed@linux.dev> wrote:
>
> On Thu, Mar 27, 2025 at 03:38:50PM +0100, Mateusz Guzik wrote:
> > On Wed, Mar 19, 2025 at 05:18:05PM +0000, Yosry Ahmed wrote:
> > > On Wed, Mar 19, 2025 at 11:47:32AM +0100, Mateusz Guzik wrote:
> > > > Is not this going a little too far?
> > > >
> > > > the lock + irq trip is quite expensive in its own right and now is
> > > > going to be paid for each cpu, as in the total time spent executing
> > > > cgroup_rstat_flush_locked is going to go up.
> > > >
> > > > Would your problem go away toggling this every -- say -- 8 cpus?
> > >
> > > I was concerned about this too, and about more lock bouncing, but the
> > > testing suggests that this actually overall improves the latency of
> > > cgroup_rstat_flush_locked() (at least on tested HW).
> > >
> > > So I don't think we need to do something like this unless a regression
> > > is observed.
> > >
> >
> > To my reading it reduces max time spent with irq disabled, which of
> > course it does -- after all it toggles it for every CPU.
> >
> > Per my other e-mail in the thread the irq + lock trips remain not cheap
> > at least on Sapphire Rapids.
> >
> > In my testing outlined below I see 11% increase in total execution time
> > with the irq + lock trip for every CPU in a 24-way vm.
> >
> > So I stand by instead doing this every n CPUs, call it 8 or whatever.
> >
> > How to repro:
> >
> > I employed a poor-man's profiler like so:
> >
> > bpftrace -e 'kprobe:cgroup_rstat_flush_locked { @start[tid] = nsecs; } kretprobe:cgroup_rstat_flush_locked /@start[tid]/ { print(nsecs - @start[tid]); delete(@start[tid]); } interval:s:60 { exit(); }'
> >
> > This patch or not, execution time varies wildly even while the box is idle.
> >
> > The above runs for a minute, collecting 23 samples (you may get
> > "lucky" and get one extra, in that case remove it for comparison).
> >
> > A sysctl was added to toggle the new behavior vs old one. Patch at the
> > end.
> >
> > "enabled"(1) means new behavior, "disabled"(0) means the old one.
> >
> > Sum of nsecs (results piped to: awk '{ sum += $1 } END { print sum }'):
> > disabled:     903610
> > enabled:      1006833 (+11.4%)
>
> IIUC this calculates the amount of elapsed time between start and
> finish, not necessarily the function's own execution time. Is it
> possible that the increase in time is due to more interrupts arriving
> during the function execution (which is what we want), rather than more
> time being spent on disabling/enabling IRQs?

I can agree irq handlers have more opportunities to execute in the
toggling case and that the time accounted in the way above will
include them. I don't think explains it, but why not, let's test
without this problem.

I feel compelled to note atomics on x86-64 were expensive for as long
as the architecture was around so I'm confused what's up with the
resistance to the notion that they remain costly even with modern
uarchs. If anything, imo claims that they are cheap require strong
evidence.

That said, I modified the patch to add a section which issues
conditional relock if needed and smp_mb otherwise -- irqs remain
disabled, but we are still paying for a full fence. smp_mb is a lock
add $0 on the stack pointer. Note this has less work to do than what
was added in your patch.

It looks like this:
switch (READ_ONCE(magic_tunable)) {
case 1:
        __cgroup_rstat_unlock(cgrp, cpu);
        if (!cond_resched())
                cpu_relax();
        __cgroup_rstat_lock(cgrp, cpu);
        break;
case 2:
        if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
                __cgroup_rstat_unlock(cgrp, cpu);
                if (!cond_resched())
                        cpu_relax();
                __cgroup_rstat_lock(cgrp, cpu);
        } else {
                smp_mb();
        }
        break;
default:
        if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
                __cgroup_rstat_unlock(cgrp, cpu);
                if (!cond_resched())
                        cpu_relax();
                __cgroup_rstat_lock(cgrp, cpu);
        }
        break;
}

With this in place I'm seeing about 4% increase in execution time
measured the same way, so irq handlers sneaking in don't explain it.
Note smp_mb() alone is a smaller cost than the locked instruction +
func calls + irq trips. I also state I'm running this in a VM
(24-way), where paravirt spinlocks also issue a lock-prefixed
instruction to release the lock. I would say this very much justifies
the original claim of 11% with the patch as proposed.
Michal Koutný April 1, 2025, 3 p.m. UTC | #21
Hello Mateusz.

On Thu, Mar 27, 2025 at 06:47:56PM +0100, Mateusz Guzik <mjguzik@gmail.com> wrote:
> I feel compelled to note atomics on x86-64 were expensive for as long
> as the architecture was around so I'm confused what's up with the
> resistance to the notion that they remain costly even with modern
> uarchs. If anything, imo claims that they are cheap require strong
> evidence.

I don't there's strong resistance, your measurements show that it's not
negligible under given conditions.

The question is -- how much benefit would flushers have in practice with
coalesced unlock-locks. There is the approach now with releasing for
each CPU that is simple and benefits latency of irq dependants.

If you see practical issues with the limited throughputs of stat readers
(or flushers in general) because of this, please send a patch for
discussion that resolves it while preserving (some of) the irq freedom.

Also there is ongoing work of splitting up flushing per controller --
I'd like to see whether the given locks become "small" enough to require
no _irq exclusion at all during flushing.

Michal
Mateusz Guzik April 1, 2025, 3:46 p.m. UTC | #22
On Tue, Apr 1, 2025 at 5:00 PM Michal Koutný <mkoutny@suse.com> wrote:
> On Thu, Mar 27, 2025 at 06:47:56PM +0100, Mateusz Guzik <mjguzik@gmail.com> wrote:
> > I feel compelled to note atomics on x86-64 were expensive for as long
> > as the architecture was around so I'm confused what's up with the
> > resistance to the notion that they remain costly even with modern
> > uarchs. If anything, imo claims that they are cheap require strong
> > evidence.
>
> I don't there's strong resistance, your measurements show that it's not
> negligible under given conditions.
>
> The question is -- how much benefit would flushers have in practice with
> coalesced unlock-locks. There is the approach now with releasing for
> each CPU that is simple and benefits latency of irq dependants.
>

Toggling every n cpus instead of every single time is trivial to do.

I'm trying to avoid sending a patch in hopes of not getting CC'ed for
unrelated stuff later.

> If you see practical issues with the limited throughputs of stat readers
> (or flushers in general) because of this, please send a patch for
> discussion that resolves it while preserving (some of) the irq freedom.
>

This is some background maintenance work and it should do what's
feasible to not eat CPU.

The stock loop was behaving poorly in face of a high CPU count and it
makes excellent sense make it toggle the lock in *some* capacity.

I just don't believe going from 400+ CPUs straight to literally 1
every time is warranted. It seems the author felt justified with the
notion that it does not add overhead on contemporary hardware, but per
your own e-mail I demonstrated this does not hold.

Is this really going to suffer for toggling every 8 CPUs? that's a 50x
factor reduction

I would not be mailing here if the change was hard to do, but it
really is not. it's literally a counter in a loop getting checked.

> Also there is ongoing work of splitting up flushing per controller --
> I'd like to see whether the given locks become "small" enough to require
> no _irq exclusion at all during flushing.

the temp changes like the to stay for a long time.

tl;dr I don't believe going straight from 400 to 1 was properly
justified and I demonstrated how it hurts on a rather modest box. at
the same time a (likely) more than enough improvement over the stock
state can be trivially achieved while adding only a small fraction of
the overhead.

that said, there is bigger fish to fry elsewhere and I have no stake
in this code, so I'm not going to mail any further about this.
Michal Koutný April 1, 2025, 4:59 p.m. UTC | #23
On Tue, Apr 01, 2025 at 05:46:41PM +0200, Mateusz Guzik <mjguzik@gmail.com> wrote:
> Is this really going to suffer for toggling every 8 CPUs? that's a 50x
> factor reduction

As per the original patch, there's ~10x saving in max holding irqs-off,
naïevely thinking aggregating it flushing by 8 CPUs could reduce it to
(10/8) ~1.25x saving only.
(OTOH, it's not 400x effect, so it's not explained completely, not all
CPUs are same.) I can imagine the balanced value with this information
would be around 20 CPUs (sqrt(400)).
But the issue is it could as well be 4 or 32 or 8. Starting with 1 is
the simplest approach without introducing magic constants or heuristics.


> the temp changes like the to stay for a long time.

That'd mean that no one notices the performance impact there :-)
It can be easily changed later too.

> that said, there is bigger fish to fry elsewhere and I have no stake
> in this code, so I'm not going to mail any further about this.

Thank you for spending your effort on this, it's useful reference for
the future!

Michal
diff mbox series

Patch

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index aac91466279f..976c24b3671a 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -323,13 +323,11 @@  static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
 			rcu_read_unlock();
 		}
 
-		/* play nice and yield if necessary */
-		if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
-			__cgroup_rstat_unlock(cgrp, cpu);
-			if (!cond_resched())
-				cpu_relax();
-			__cgroup_rstat_lock(cgrp, cpu);
-		}
+		/* play nice and avoid disabling interrupts for a long time */
+		__cgroup_rstat_unlock(cgrp, cpu);
+		if (!cond_resched())
+			cpu_relax();
+		__cgroup_rstat_lock(cgrp, cpu);
 	}
 }