[v2,0/2] Exposing nice CPU usage to userspace

Message ID	20240830141939.723729-1-joshua.hahnjy@gmail.com (mailing list archive)
Headers	show Received: from mail-yb1-f176.google.com (mail-yb1-f176.google.com [209.85.219.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BB74D190674; Fri, 30 Aug 2024 14:19:41 +0000 (UTC) From: Joshua Hahn <joshua.hahnjy@gmail.com> To: tj@kernel.org Cc: cgroups@vger.kernel.org, hannes@cmpxchg.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, lizefan.x@bytedance.com, mkoutny@suse.com, shuah@kernel.org Subject: [PATCH v2 0/2] Exposing nice CPU usage to userspace Date: Fri, 30 Aug 2024 07:19:37 -0700 Message-ID: <20240830141939.723729-1-joshua.hahnjy@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Exposing nice CPU usage to userspace \| expand [v2,0/2] Exposing nice CPU usage to userspace [v2,1/2] Tracking cgroup-level niced CPU time [v2,2/2] Selftests for niced CPU statistics

Message ID

20240830141939.723729-1-joshua.hahnjy@gmail.com (mailing list archive)

Headers

From: Joshua Hahn <joshua.hahnjy@gmail.com>
To: tj@kernel.org
Cc: cgroups@vger.kernel.org,
	hannes@cmpxchg.org,
	linux-kernel@vger.kernel.org,
	linux-kselftest@vger.kernel.org,
	lizefan.x@bytedance.com,
	mkoutny@suse.com,
	shuah@kernel.org
Subject: [PATCH v2 0/2] Exposing nice CPU usage to userspace 
Date: Fri, 30 Aug 2024 07:19:37 -0700
Message-ID: <20240830141939.723729-1-joshua.hahnjy@gmail.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

Exposing nice CPU usage to userspace | expand

Message

Joshua Hahn Aug. 30, 2024, 2:19 p.m. UTC

From: Joshua Hahn <joshua.hahn6@gmail.com>

v1 -> v2: Edited commit messages for clarity.

Niced CPU usage is a metric reported in host-level /prot/stat, but is
not reported in cgroup-level statistics in cpu.stat. However, when a
host contains multiple tasks across different workloads, it becomes
difficult to gauge how much of the task is being spent on niced
processes based on /proc/stat alone, since host-level metrics do not
provide this cgroup-level granularity.

Exposing this metric will allow users to accurately probe the niced CPU
metric for each workload, and make more informed decisions when
directing higher priority tasks.

Joshua Hahn (2):
  Tracking cgroup-level niced CPU time
  Selftests for niced CPU statistics

 include/linux/cgroup-defs.h               |  1 +
 kernel/cgroup/rstat.c                     | 16 ++++-
 tools/testing/selftests/cgroup/test_cpu.c | 72 +++++++++++++++++++++++
 3 files changed, 86 insertions(+), 3 deletions(-)

Comments

Michal Koutný Sept. 2, 2024, 3:45 p.m. UTC | #1

Hello Joshua.

On Fri, Aug 30, 2024 at 07:19:37AM GMT, Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> Exposing this metric will allow users to accurately probe the niced CPU
> metric for each workload, and make more informed decisions when
> directing higher priority tasks.

I'm afraid I can't still appreciate exposing this value:

- It makes (some) sense only on leave cgroups (where variously nice'd
  tasks are competing against each other). Not so much on inner node
  cgroups (where it's a mere sum but sibling cgroups could have different
  weights, so the absolute times would contribute differently).

- When all tasks have nice > 0 (or nice <= 0), it loses any information
  it could have had.

(Thus I don't know whether to commit to exposing that value via cgroups.)

I wonder, wouldn't your use case be equally served by some
post-processing [1] of /sys/kernel/debug/sched/debug info which is
already available?

Regards,
Michal

[1]
Your metric is supposed to represent
	Σ_i^tasks ∫_t is_nice(i, t) dt

If I try to address the second remark by looking at
	Σ_i^tasks ∫_t nice(i, t) dt

and that resembles (nice=0 ~ weight=1024)
	Σ_i^tasks ∫_t weight(i, t) dt

swap sum and int
	∫_t Σ_i^tasks weight(i, t) dt

where
	Σ_i^tasks weight(i, t) 

can be taken from
	/sys/kernel/debug/sched/debug:cfs_rq[0].load_avg

above is only for CPU nr=0. So processing would mean sampling that file
over all CPUs and time.

Tejun Heo Sept. 10, 2024, 9:01 p.m. UTC | #2

Hello, Michal.

On Mon, Sep 02, 2024 at 05:45:39PM +0200, Michal Koutný wrote:
> - It makes (some) sense only on leave cgroups (where variously nice'd
>   tasks are competing against each other). Not so much on inner node
>   cgroups (where it's a mere sum but sibling cgroups could have different
>   weights, so the absolute times would contribute differently).
>
> - When all tasks have nice > 0 (or nice <= 0), it loses any information
>   it could have had.

I think it's as useful as system-wide nice metric is. It's not a versatile
metric but is widely available and understood and people use it. Maybe a
workload is split across a sub-hierarchy and they wanna collect how much
lowpri threads are consuming. cpu.stats is available without cpu control
being enabled and people use it as a way to just aggregate metrics across a
portion of the system.

> (Thus I don't know whether to commit to exposing that value via cgroups.)
> 
> I wonder, wouldn't your use case be equally served by some
> post-processing [1] of /sys/kernel/debug/sched/debug info which is
> already available?
...
> above is only for CPU nr=0. So processing would mean sampling that file
> over all CPUs and time.

I think there are benefits to mirroring system wide metrics, at least ones
as widely spread as nice.

Thanks.