[v6] sched: Consolidate cpufreq updates

Improve the interaction with cpufreq governors by making the
cpufreq_update_util() calls more intentional.

At the moment we send them when load is updated for CFS, bandwidth for
DL and at enqueue/dequeue for RT. But this can lead to too many updates
sent in a short period of time and potentially be ignored at a critical
moment due to the rate_limit_us in schedutil.

For example, simultaneous task enqueue on the CPU where 2nd task is
bigger and requires higher freq. The trigger to cpufreq_update_util() by
the first task will lead to dropping the 2nd request until tick. Or
another CPU in the same policy triggers a freq update shortly after.

Updates at enqueue for RT are not strictly required. Though they do help
to reduce the delay for switching the frequency and the potential
observation of lower frequency during this delay. But current logic
doesn't intentionally (at least to my understanding) try to speed up the
request.

To help reduce the amount of cpufreq updates and make them more
purposeful, consolidate them into these locations:

1. context_switch()
2. task_tick_fair()
3. update_blocked_averages()
4. on syscall that changes policy or uclamp values
5. on check_preempt_wakeup_fair() if wakeup preemption failed

The update at context switch should help guarantee that DL and RT get
the right frequency straightaway when they're RUNNING. As mentioned
though the update will happen slightly after enqueue_task(); though in
an ideal world these tasks should be RUNNING ASAP and this additional
delay should be negligible. For fair tasks we need to make sure we send
a single update for every decay for the root cfs_rq. Any changes to the
rq will be deferred until the next task is ready to run, or we hit TICK.
But we are guaranteed the task is running at a level that meets its
requirements after enqueue.

To guarantee RT and DL tasks updates are never missed, we add a new
SCHED_CPUFREQ_FORCE_UPDATE to ignore the rate_limit_us. If we are
already running at the right freq, the governor will end up doing
nothing, but we eliminate the risk of the task ending up accidentally
running at the wrong freq due to rate_limit_us.

Similarly for iowait boost, we ignore rate limits. We also handle a case
of a boost reset prematurely by adding a guard in sugov_iowait_apply()
to reduce the boost after 1ms which seems iowait boost mechanism relied
on rate_limit_us and cfs_rq.decay preventing any updates to happen soon
after iowait boost.

The new SCHED_CPUFREQ_FORCE_UPDATE should not impact the rate limit
time stamps otherwise we can end up delaying updates for normal
requests.

As a simple optimization, we avoid sending cpufreq updates when
switching from RT to another RT as RT tasks run at max freq by default.
If CONFIG_UCLAMP_TASK is enabled, we can do a simple check to see if
uclamp_min is different to avoid unnecessary cpufreq update as most RT
tasks are likely to be running at the same performance level, so we can
avoid unnecessary overhead of forced updates when there's nothing to do.

We also ensure to ignore cpufreq udpates for sugov workers at context
switch. It doesn't make sense for the kworker that applies the frequency
update (which is a DL task) to trigger a frequency update itself.

The update at task_tick_fair will guarantee that the governor will
follow any updates to load for tasks/CPU or due to new enqueues/dequeues
to the rq. Since DL and RT always run at constant frequencies and have
no load tracking, this is only required for fair tasks.

The update at update_blocked_averages() will ensure we decay frequency
as the CPU becomes idle for long enough.

If the currently running task changes its policy or uclamp values, we
ensure we follow up with cpufreq update to ensure we follow up with any
potential new perf requirements based on the new change.

To handle systems with long TICK where tasks could end up enqueued but
no preemption happens until TICK, we add an update in
check_preempt_wakeup_fair() if wake up preemption fails. This will send
special SCHED_CPUFREQ_TASK_ENQUEUED cpufreq update to tell the governor
that the state of the CPU has changed and it can consider an update if
it deems worthwhile. In schedutil this will do an update if no update
was done since sysctl_sched_base_slice which is our ideal slice length
for context switch.

Since we now DL tasks always ignore rate limit, remove
ignore_dl_rate_limit() function as it's no longer necessary.

Also move updating sg_cpu->last_update inside sugov_iowait_boost() where
this variable is associated.

Results of

	taskset 1 perf stat --repeat 10 -e cycles,instructions,task-clock perf bench sched pipe

on AMD 3900X to verify any potential overhead because of the addition at
context switch against v6.8.7 stable kernel

v6.8.7: schedutil:
------------------

 Performance counter stats for 'perf bench sched pipe' (10 runs):

       814,886,355      cycles:u                  #    0.073 GHz                      ( +-  1.79% )
        82,724,139      instructions:u            #    0.10  insn per cycle           ( +-  0.00% )
         11,112.19 msec task-clock:u              #    0.996 CPUs utilized            ( +-  0.18% )

           11.1575 +- 0.0207 seconds time elapsed  ( +-  0.19% )

v6.8.7: performance:
--------------------

 Performance counter stats for 'perf bench sched pipe' (10 runs):

       731,701,038      cycles:u                  #    0.067 GHz                      ( +-  2.27% )
        82,724,255      instructions:u            #    0.10  insn per cycle           ( +-  0.00% )
         10,830.95 msec task-clock:u              #    0.992 CPUs utilized            ( +-  0.14% )

           10.9172 +- 0.0150 seconds time elapsed  ( +-  0.14% )

v6.8.7+patch: schedutil:
------------------------

 Performance counter stats for 'perf bench sched pipe' (10 runs):

       814,294,812      cycles:u                  #    0.073 GHz                      ( +-  1.45% )
        82,724,229      instructions:u            #    0.10  insn per cycle           ( +-  0.00% )
         11,109.70 msec task-clock:u              #    1.000 CPUs utilized            ( +-  0.11% )

           11.1131 +- 0.0125 seconds time elapsed  ( +-  0.11% )

v6.8.7+patch: performance:
--------------------------

 Performance counter stats for 'perf bench sched pipe' (10 runs):

       849,621,311      cycles:u                  #    0.077 GHz                      ( +-  0.50% )
        82,724,306      instructions:u            #    0.10  insn per cycle           ( +-  0.00% )
         11,031.10 msec task-clock:u              #    0.996 CPUs utilized            ( +-  0.14% )

           11.0716 +- 0.0149 seconds time elapsed  ( +-  0.13% )

With performance governor we seem to be doing slightly worse. Comparing
perf diff of

	taskset 1 perf record perf bench sched pipe

    28.95%     -3.66%  [kernel.kallsyms]     [k] delay_halt_mwaitx
     1.49%     +1.08%  [kernel.kallsyms]     [k] update_load_avg
     1.14%     +0.86%  [kernel.kallsyms]     [k] native_sched_clock
     0.56%     +0.56%  [kernel.kallsyms]     [k] sched_clock_cpu
     7.34%     -0.45%  [kernel.kallsyms]     [k] native_write_msr
     8.54%     -0.42%  [kernel.kallsyms]     [k] native_read_msr
     1.07%     +0.34%  [kernel.kallsyms]     [k] pick_next_task_fair
     2.46%     -0.30%  [kernel.kallsyms]     [k] x86_pmu_disable_all
     1.73%     -0.28%  [kernel.kallsyms]     [k] amd_pmu_check_overflow
     0.36%     +0.27%  [kernel.kallsyms]     [k] sched_clock
     1.05%     -0.25%  [kernel.kallsyms]     [k] try_to_wake_up
     0.70%     -0.24%  [kernel.kallsyms]     [k] enqueue_task_fair
     0.88%     +0.20%  [kernel.kallsyms]     [k] pipe_read
     0.54%     +0.19%  [kernel.kallsyms]     [k] dequeue_entity
     0.37%     +0.17%  [kernel.kallsyms]     [k] aa_file_perm
     0.66%     +0.16%  [kernel.kallsyms]     [k] vfs_write
     0.11%     +0.15%  [kernel.kallsyms]     [k] __x64_sys_write
     0.35%     +0.15%  libc.so.6             [.] __GI___libc_write
     1.32%     +0.15%  [kernel.kallsyms]     [k] update_curr
     1.22%     +0.14%  [kernel.kallsyms]     [k] psi_task_switch
     0.58%     -0.14%  [kernel.kallsyms]     [k] enqueue_entity
     0.83%     +0.14%  perf                  [.] worker_thread
     0.63%     +0.13%  [kernel.kallsyms]     [k] dequeue_task_fair
     1.42%     -0.13%  [kernel.kallsyms]     [k] delay_halt
     0.24%     +0.13%  [kernel.kallsyms]     [k] __wake_up_common

It seems update_load_avg() is slightly worse. Earlier versions of the
patch didn't produce such a difference. I am not sure if this is
a problem or just attributed to minor binary difference and caching
effect.

Note worthy that we still have the following race condition on systems
that have shared policy:

* CPUs with shared policy can end up sending simultaneous cpufreq
  updates requests where the 2nd one will be unlucky and get blocked by
  the rate_limit_us (schedutil).

We can potentially address this limitation later, but it is out of the
scope of this patch.

Signed-off-by: Qais Yousef <qyousef@layalina.io>
---

Changes since v5:

	* Fix a bug where switching between RT and sugov tasks triggered an
	  endless cycle of cpufreq updates.
	* Only do cpufreq updates at tick for fair after verifying
	  rq->cfs.decayed
	* Remove optimization in update_load_avg() to avoid sending an update
	  if util hasn't changed that caused a bug when switching from Idle
	* Handle systems with long ticks by adding extra update on
	  check_preempt_wakeup_fair(). The idea is to rely on context switch
	  but still consider an update if wakeup preemption failed and no
	  update was sent since sysctl_sched_base_slice
	* Remove ignore_dl_rate_limit() as this function is now redundant
	* move sg_cpu->last_update = time inside sugov_iowait_boost()
	* Update commit message with new details and with perf diff output

Changes since v4:

	* Fix updating freq when uclamp changes before the dequeue/enqueue
	  dance. (Hongyan)
	* Rebased on top of tip/sched/core 6.10-rc1 and resolve some conflicts
	  due to code shuffling to syscalls.c. Added new function
	  update_cpufreq_current() to be used outside core.c when
	  task_current() requires cpufreq update.

Changes since v3:

	* Omit cpufreq updates at attach/detach_entity_load_avg(). They share
	  the update path from enqueue/dequeue which is not intended to trigger
	  an update. And task_change_group_fair() is not expected to cause the
	  root cfs_rq util to change significantly to warrant an immediate
	  update for enqueued tasks. Better defer for next context switch to
	  sample the state of the cpu taking all changes into account before
	  the next task is due to run.
	  Dietmar also pointed out a bug where we could send more updates vs
	  without the patch in this path as I wasn't sending the update for
	  cfs_rq == &rq->cfs.

Changes since v2:

	* Clean up update_cpufreq_ctx_switch() to reduce branches (Peter)
	* Fix issue with cpufreq updates missed on switching from idle (Vincent)
	* perf bench sched pipe regressed after fixing the switch from idle,
	  detect when util_avg has changed when cfs_rq->decayed to fix it
	* Ensure to issue cpufreq updates when task_current() switches
	  policy/uclamp values

Changes since v1:

	* Use taskset and measure with performance governor as Ingo suggested
	* Remove the static key as I found out we always register a function
	  for cpu_dbs in cpufreq_governor.c; and as Christian pointed out it
	  trigger a lock debug warning.
	* Improve detection of sugov workers by using SCHED_FLAG_SUGOV
	* Guard against NSEC_PER_MSEC instead of TICK_USEC to avoid prematurely
	  reducing iowait boost as the latter was a NOP and like
	  sugov_iowait_reset() like Christian pointed out.

v1 discussion: https://lore.kernel.org/all/20240324020139.1032473-1-qyousef@layalina.io/
v2 discussion: https://lore.kernel.org/lkml/20240505233103.168766-1-qyousef@layalina.io/
v3 discussion: https://lore.kernel.org/lkml/20240512190018.531820-1-qyousef@layalina.io/
v4 discussion: https://lore.kernel.org/lkml/20240516204802.846520-1-qyousef@layalina.io/
v5 discussion: https://lore.kernel.org/lkml/20240530104653.1234004-1-qyousef@layalina.io/

 include/linux/sched/cpufreq.h    |   4 +-
 kernel/sched/core.c              | 107 +++++++++++++++++++++++++++--
 kernel/sched/cpufreq_schedutil.c | 111 ++++++++++++++++++++-----------
 kernel/sched/deadline.c          |   4 --
 kernel/sched/fair.c              |  79 ++++++++--------------
 kernel/sched/rt.c                |   8 +--
 kernel/sched/sched.h             |   9 ++-
 kernel/sched/syscalls.c          |  26 ++++++--
 8 files changed, 232 insertions(+), 116 deletions(-)

Message ID	20240619201409.2071728-1-qyousef@layalina.io (mailing list archive)
State	New
Headers	show Received: from mail-wm1-f44.google.com (mail-wm1-f44.google.com [209.85.128.44]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E6D7274068 for <linux-pm@vger.kernel.org>; Wed, 19 Jun 2024 20:14:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.44 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1718828059; cv=none; b=ZDSVxxSOPehXmlCDaIisD1atHpd1wBZPiRy7pPgL8GyZHi/dE9sC5lSMiPaDY1DxjK8p/G9FsdEiPRYX5BEMfjQ+FuhdwMPPTXNYNXbsyzRtWGVmJmc8Uanr4gOUsiuZ1g7WEX7pDq6M18qy0k3Y1rlL1OzKTvf21Y7GUPZ91bE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1718828059; c=relaxed/simple; bh=AIfv95sj1Js/hLfn+ylEMc80Gwsb3Dp7cGfpiYUoglw=; h=From:To:Cc:Subject:Date:Message-Id:MIME-Version; b=TSU4EjtLKsicGAdJVz9I4FKFzn8IBfDWHQrKppTLXfSIevDJ1FPx7cmZuuH0MCdT4YqRTILDbMPtKNSPDxc4QyVxX0AC8X0b5DZ4Px3Hi65hJrtnkXBEVB387UkAXVF11hOAYRAOpGDuDvKPZOHwvY0QtVli+mdvRdjEnL56/JE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=layalina.io; spf=pass smtp.mailfrom=layalina.io; dkim=pass (2048-bit key) header.d=layalina-io.20230601.gappssmtp.com header.i=@layalina-io.20230601.gappssmtp.com header.b=B7bVuqog; arc=none smtp.client-ip=209.85.128.44 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=layalina.io Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=layalina.io Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=layalina-io.20230601.gappssmtp.com header.i=@layalina-io.20230601.gappssmtp.com header.b="B7bVuqog" Received: by mail-wm1-f44.google.com with SMTP id 5b1f17b1804b1-42172ed3487so1592515e9.0 for <linux-pm@vger.kernel.org>; Wed, 19 Jun 2024 13:14:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=layalina-io.20230601.gappssmtp.com; s=20230601; t=1718828053; x=1719432853; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=WOxL5pQLlkUl/uDE8d6cm6J5apnf8wsa0xS1dcBi1Co=; b=B7bVuqoguQC1/C707wm+4piP7sZPpbtwTyLjMsPhZcyTFBT1OC8fEoAxPjkCikAQex e7T2WlfvhQIF2ssn5XPMpOCNTht9X99txTo6rbYQjsWUTUYyki/PVXuYwO4uk4MmX/QE +mJ6bbt9AQ0QUav6ZT+SD+xaVZtpf9zrgUAjyiwumwsT2cfEtmyVp90w3SljW4Qw8VJi c+GdIXRZZPojXHPibRjxE9amDO8VJoCdv8mCL8U+dwY6qHDq1bAjk4/nIs5d6mMGW2Eq mxW2bf5tS2U92j6pvbKXBGGvel7Dpqw89trYFKwbmCW6XrmRkj1h4HQeSKuyF/59QKYS pTxw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718828053; x=1719432853; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=WOxL5pQLlkUl/uDE8d6cm6J5apnf8wsa0xS1dcBi1Co=; b=j5XFgQjXjMRjApymf16LwHYsXdoc3dQJrfMSG2aRkdoTvSuNj1T+6jMGNc8jHiCK8m K/nOFy/ls7Oi/wkdLgouu3ULn0SvTvy+NCCKzjWaYkFwZa5tCPIoT6tqxO9Dc+LhmGOG BM7XvyLBUAEgd8+ErTjqJSAnlfzhKmVY9SJqO3gFIVJ2zaCyEXXGcQG2q4LtgIVTa68I 3zfkDbBlLfFpEndlNCHwhiwppBXweuy30MxQinVlXFqOooFZ202z/0CsDvVu1T+qg+U8 mjAh/S61IU9tuXa3oBjh1xeA+yZoI09dEtgd1MAk6cwnxxrpCOcYgE5ghpCoMW0uBB3G sy+g== X-Forwarded-Encrypted: i=1; AJvYcCX6jJnWQ0IUNWwQ873mnmxqOx9/1juP9MaTLdt5oVlFOmKmyAzq/K4IWWADhscz27WNJOZKQpEcVVZlZD8pShJ7/nZBz5nr+Jc= X-Gm-Message-State: AOJu0Yz1l/fHvhIjcZYHqkENAS+c8ZqfO3bCsVnwA/Cx/7RZ0sqlXx6k u/GMTIpGemK0FJH3PgdNMzkb8qVFMDpLIDABOVhkEo4hqTTBZGCvWb3Dn6kFQcQ= X-Google-Smtp-Source: AGHT+IGEaB4CfLZsElE/Lsz18hlzJs6YE22G1vB1ZbzY6bLjcKgPbd3MEsgKIOBS23cKbOVvmLCB+g== X-Received: by 2002:a05:600c:1608:b0:421:8e64:5f23 with SMTP id 5b1f17b1804b1-4247507a472mr22991935e9.5.1718828052847; Wed, 19 Jun 2024 13:14:12 -0700 (PDT) Received: from airbuntu.. (host81-157-90-255.range81-157.btcentralplus.com. [81.157.90.255]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-4247d0be818sm1303095e9.15.2024.06.19.13.14.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 19 Jun 2024 13:14:12 -0700 (PDT) From: Qais Yousef <qyousef@layalina.io> To: "Rafael J. Wysocki" <rafael@kernel.org>, Viresh Kumar <viresh.kumar@linaro.org>, Ingo Molnar <mingo@kernel.org>, Peter Zijlstra <peterz@infradead.org>, Vincent Guittot <vincent.guittot@linaro.org>, Juri Lelli <juri.lelli@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org>, Dietmar Eggemann <dietmar.eggemann@arm.com>, Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>, Daniel Bristot de Oliveira <bristot@redhat.com>, Valentin Schneider <vschneid@redhat.com>, Christian Loehle <christian.loehle@arm.com>, Hongyan Xia <hongyan.xia2@arm.com>, John Stultz <jstultz@google.com>, linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org, Qais Yousef <qyousef@layalina.io> Subject: [PATCH v6] sched: Consolidate cpufreq updates Date: Wed, 19 Jun 2024 21:14:09 +0100 Message-Id: <20240619201409.2071728-1-qyousef@layalina.io> X-Mailer: git-send-email 2.34.1 Precedence: bulk X-Mailing-List: linux-pm@vger.kernel.org List-Id: <linux-pm.vger.kernel.org> List-Subscribe: <mailto:linux-pm+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-pm+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[v6] sched: Consolidate cpufreq updates \| expand [v6] sched: Consolidate cpufreq updates

[v6] sched: Consolidate cpufreq updates

Commit Message

Comments

Patch