mbox series

[v2,0/2] Exposing nice CPU usage to userspace

Message ID 20240830141939.723729-1-joshua.hahnjy@gmail.com (mailing list archive)
Headers show
Series Exposing nice CPU usage to userspace | expand

Message

Joshua Hahn Aug. 30, 2024, 2:19 p.m. UTC
From: Joshua Hahn <joshua.hahn6@gmail.com>

v1 -> v2: Edited commit messages for clarity.

Niced CPU usage is a metric reported in host-level /prot/stat, but is
not reported in cgroup-level statistics in cpu.stat. However, when a
host contains multiple tasks across different workloads, it becomes
difficult to gauge how much of the task is being spent on niced
processes based on /proc/stat alone, since host-level metrics do not
provide this cgroup-level granularity.

Exposing this metric will allow users to accurately probe the niced CPU
metric for each workload, and make more informed decisions when
directing higher priority tasks.

Joshua Hahn (2):
  Tracking cgroup-level niced CPU time
  Selftests for niced CPU statistics

 include/linux/cgroup-defs.h               |  1 +
 kernel/cgroup/rstat.c                     | 16 ++++-
 tools/testing/selftests/cgroup/test_cpu.c | 72 +++++++++++++++++++++++
 3 files changed, 86 insertions(+), 3 deletions(-)

Comments

Michal Koutný Sept. 2, 2024, 3:45 p.m. UTC | #1
Hello Joshua.

On Fri, Aug 30, 2024 at 07:19:37AM GMT, Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> Exposing this metric will allow users to accurately probe the niced CPU
> metric for each workload, and make more informed decisions when
> directing higher priority tasks.

I'm afraid I can't still appreciate exposing this value:

- It makes (some) sense only on leave cgroups (where variously nice'd
  tasks are competing against each other). Not so much on inner node
  cgroups (where it's a mere sum but sibling cgroups could have different
  weights, so the absolute times would contribute differently).

- When all tasks have nice > 0 (or nice <= 0), it loses any information
  it could have had.

(Thus I don't know whether to commit to exposing that value via cgroups.)

I wonder, wouldn't your use case be equally served by some
post-processing [1] of /sys/kernel/debug/sched/debug info which is
already available?

Regards,
Michal

[1]
Your metric is supposed to represent
	Σ_i^tasks ∫_t is_nice(i, t) dt

If I try to address the second remark by looking at
	Σ_i^tasks ∫_t nice(i, t) dt

and that resembles (nice=0 ~ weight=1024)
	Σ_i^tasks ∫_t weight(i, t) dt

swap sum and int
	∫_t Σ_i^tasks weight(i, t) dt

where
	Σ_i^tasks weight(i, t) 

can be taken from
	/sys/kernel/debug/sched/debug:cfs_rq[0].load_avg

above is only for CPU nr=0. So processing would mean sampling that file
over all CPUs and time.
Tejun Heo Sept. 10, 2024, 9:01 p.m. UTC | #2
Hello, Michal.

On Mon, Sep 02, 2024 at 05:45:39PM +0200, Michal Koutný wrote:
> - It makes (some) sense only on leave cgroups (where variously nice'd
>   tasks are competing against each other). Not so much on inner node
>   cgroups (where it's a mere sum but sibling cgroups could have different
>   weights, so the absolute times would contribute differently).
>
> - When all tasks have nice > 0 (or nice <= 0), it loses any information
>   it could have had.

I think it's as useful as system-wide nice metric is. It's not a versatile
metric but is widely available and understood and people use it. Maybe a
workload is split across a sub-hierarchy and they wanna collect how much
lowpri threads are consuming. cpu.stats is available without cpu control
being enabled and people use it as a way to just aggregate metrics across a
portion of the system.

> (Thus I don't know whether to commit to exposing that value via cgroups.)
> 
> I wonder, wouldn't your use case be equally served by some
> post-processing [1] of /sys/kernel/debug/sched/debug info which is
> already available?
...
> above is only for CPU nr=0. So processing would mean sampling that file
> over all CPUs and time.

I think there are benefits to mirroring system wide metrics, at least ones
as widely spread as nice.

Thanks.