mbox series

[RFC,0/7] sched: cpufreq: Remove magic margins

Message ID 20230827233203.1315953-1-qyousef@layalina.io (mailing list archive)
Headers show
Series sched: cpufreq: Remove magic margins | expand

Message

Qais Yousef Aug. 27, 2023, 11:31 p.m. UTC
Since the introduction of EAS and schedutil, we had two magic 0.8 and 1.25
margins applied in fits_capacity() and apply_dvfs_headroom().

As reported two years ago in

	https://lore.kernel.org/lkml/1623855954-6970-1-git-send-email-yt.chang@mediatek.com/

these values are not good fit for all systems and people do feel the need to
modify them regularly out of tree.

Equally recent discussion in PELT HALFLIFE thread highlighted the need for
a way to tune system response time to achieve better perf, power and thermal
characteristic for a given system

	https://lore.kernel.org/lkml/20220829055450.1703092-1-dietmar.eggemann@arm.com/

fits_capacity() and apply_dvfs_headroom() are not a suitable tunables. Attempt
to use PELF HALFLIFE highlighted that there's room to do better, which I hope
my proposal helps to achieve that.

This series attempts to address these issues by first removing the magic
'margins' from those two areas that has proved to be problematic in practice;
and, at least in Android world, they're being modified out of tree on regular
basis.

I attempted to tackle the problem by trying to find the answer to the question
what would really go wrong if we don't have these margins or headrooms?

The simplest answers I found is that for fits_capacity() if we do a bad
decision, then the task will become misfit and will have to wait for the next
load balance to move it to the correct CPU. Hence I thought a reasonable
definition is that fits_capacity() should be a function of tick and the
headroom should cater for the fact that if a task continues to run without
sleep, then as long as by the time we hit the tick (load balance) and it still
fits, then it should not be considered a misfit and should be acceptable to run
on this CPU.

For the dvfs headroom, the worst can happen is that util grows above
capacity@OPP before we get a chance to send a request to hardware to modify the
freq. Which means we should make sure the frequency selection provides
enough headroom to cater for the fact that if the task continues to run without
sleep, then the current frequency should provide a capacity@OPP higher than
util after rate_limit_us of cpufeq transition delay.

To achieve this, we need a new function to help us with predicting, or
approximate, the util given a delta of runtime. Which is what introduced in
patches 1 and 2.

Removing these margins doesn't actually fix the problem of being able to tune
the system response. To do that we introduce a new tunable to schedutil called
response_time_ms which dictates how long it takes the policy to go from minimum
to maximum performance point. This value reflects the time it takes PELT to
grow to the capacity of the CPUs in that policy (which can be different in case
of HMP). It should be a direct presentation of PELT ramp-up time, hence more
meaningful from tuning perspective as an absolute value of how long it takes to
saturate the policy.  It should be much easier for userspace to reason about an
appropriate number given this absolute value. It can be expanded or shrunk to
slow or speed up the response time. Ultimately leading to appropriate perf,
power and thermal trade-off for the system.

In case of slowing the response time, there's inherit limitation that util_avg
saturates at 1024. Which means the impact of slowing down after a certain
degree would be to lose the top freqs. I think this limitation can be overcome
but not sure how yet. Suggestions would be appreciated meanwhile.

To further help tune the system, we introduce PELT HALFLIFE multiplier as
a boot time parameter. This parameter has an impact on how fast we migrate, so
should compensate for whoever needed to tune fits_capacity(); and it has great
impact on default response_time_ms. Particularly it gives a natural faster rise
time when the system gets busy, AND fall time when the system goes back to
idle. It is coarse grain response control that can be coupled with finer grain
control via schedutil's response_time_ms.

I believe (hope) by making the behavior of fits_capacity() and
apply_dvfs_headroom() more deterministic, and scalable across systems, to be
a true function of their natural limitations and combined with the new, and
hopefully sensible, tunable to allow managing the reactiveness of the system to
achieve what the user/sysadmin perceives as the best perf, power and thermal
trade-off should address the class of problems at hand hopefully in
deterministic and sensible/user friendly manner.

I'm not a pelt guru, so help in making sure that approximate_util_avg() and
approximate_runtime() are reasonable and correct would be appreciated.

The remainder of the patches should hopefully be straightforward. There are
some pending question that you'll find in various TODOs/XXX that I'd appreciate
feedback on.

Not tested comprehensively. But booted on Pixel 6 running mainline-ish kernel.

I could see the following as default output for response_time_ms:

	# grep . /sys/devices/system/cpu/cpufreq/policy*/schedutil/*
	/sys/devices/system/cpu/cpufreq/policy0/schedutil/rate_limit_us:2000
	/sys/devices/system/cpu/cpufreq/policy0/schedutil/response_time_ms:13
	/sys/devices/system/cpu/cpufreq/policy4/schedutil/rate_limit_us:2000
	/sys/devices/system/cpu/cpufreq/policy4/schedutil/response_time_ms:42
	/sys/devices/system/cpu/cpufreq/policy6/schedutil/rate_limit_us:2000
	/sys/devices/system/cpu/cpufreq/policy6/schedutil/response_time_ms:176

Note how the little core has a very short saturation time given its small
capacity in practice. fits_capacity() being defined as a function of TICK_US
means that 1/3rd of its top performance would be ignored (when EAS is active
- !overutilized) - which is desirable since a lot of workloads suffer in terms
of perf by staying for too long on the littles and given our relatively high
TICK_US values, the earlier move is good.

The biggest policy though has a saturation of 176 which I didn't expect. My
measurement in the past where that we need at least 200ms with 32ms PELF HF.
Maybe I have a bug or my old measurements are now invalid for some reason.

When I set PELT HALFLIFE to 16ms I get:

	# grep . /sys/devices/system/cpu/cpufreq/policy*/schedutil/*
	/sys/devices/system/cpu/cpufreq/policy0/schedutil/rate_limit_us:2000
	/sys/devices/system/cpu/cpufreq/policy0/schedutil/response_time_ms:7
	/sys/devices/system/cpu/cpufreq/policy4/schedutil/rate_limit_us:2000
	/sys/devices/system/cpu/cpufreq/policy4/schedutil/response_time_ms:21
	/sys/devices/system/cpu/cpufreq/policy6/schedutil/rate_limit_us:2000
	/sys/devices/system/cpu/cpufreq/policy6/schedutil/response_time_ms:79

and for 8ms:

	# grep . /sys/devices/system/cpu/cpufreq/policy*/schedutil/*
	/sys/devices/system/cpu/cpufreq/policy0/schedutil/rate_limit_us:2000
	/sys/devices/system/cpu/cpufreq/policy0/schedutil/response_time_ms:4
	/sys/devices/system/cpu/cpufreq/policy4/schedutil/rate_limit_us:2000
	/sys/devices/system/cpu/cpufreq/policy4/schedutil/response_time_ms:10
	/sys/devices/system/cpu/cpufreq/policy6/schedutil/rate_limit_us:2000
	/sys/devices/system/cpu/cpufreq/policy6/schedutil/response_time_ms:34

policy6 (big core) numbers aren't halving properly. Something to investigate.

I ran speedometer tests too and I could see the score changes as I make
response_time_ms faster/slower or modify PELT HF. I could see the freq
residency also shifts according to my changes where top frequencies get higher
residencies as I speed it up, or they are never reached/reduced residency when
I slow it down.

Finally at the end of the series I modify the default cpufreq transition delay
to be 2ms. I found on several on my Arm based systems I end up with this
default value, and 10ms is too high nowadays even for a low end system.
I haven't done a full surveillance to be honest, but 10ms I think is way too
high for the majority of the systems out there - even 2ms can be a bit high
for a large class of systems. Suggestions for other values are welcome!

This series is based on the tip/sched/core with the below series applied

	https://lore.kernel.org/lkml/20230820210640.585311-1-qyousef@layalina.io/

Many thanks

--
Qais Yousef

Qais Yousef (6):
  sched/pelt: Add a new function to approximate the future util_avg
    value
  sched/pelt: Add a new function to approximate runtime to reach given
    util
  sched/fair: Remove magic margin in fits_capacity()
  sched: cpufreq: Remove magic 1.25 headroom from apply_dvfs_headroom()
  sched/schedutil: Add a new tunable to dictate response time
  cpufreq: Change default transition delay to 2ms

Vincent Donnefort (1):
  sched/pelt: Introduce PELT multiplier

 Documentation/admin-guide/pm/cpufreq.rst | 19 ++++-
 drivers/cpufreq/cpufreq.c                |  4 +-
 kernel/sched/core.c                      |  5 +-
 kernel/sched/cpufreq_schedutil.c         | 80 ++++++++++++++++++++-
 kernel/sched/fair.c                      | 21 +++++-
 kernel/sched/pelt.c                      | 89 ++++++++++++++++++++++++
 kernel/sched/pelt.h                      | 42 +++++++++--
 kernel/sched/sched.h                     | 30 +++++---
 8 files changed, 265 insertions(+), 25 deletions(-)

Comments

Lukasz Luba Sept. 6, 2023, 9:18 a.m. UTC | #1
Hi Qais,

On 8/28/23 00:31, Qais Yousef wrote:
> Since the introduction of EAS and schedutil, we had two magic 0.8 and 1.25
> margins applied in fits_capacity() and apply_dvfs_headroom().
> 
> As reported two years ago in
> 
> 	https://lore.kernel.org/lkml/1623855954-6970-1-git-send-email-yt.chang@mediatek.com/
> 
> these values are not good fit for all systems and people do feel the need to
> modify them regularly out of tree.

That is true, in Android kernel those are known 'features'. Furthermore,
in my game testing it looks like higher margins do help to shrink
number of dropped frames, while on other types of workloads (e.g.
those that you have in the link above) the 0% shows better energy.

I remember also the results from MTK regarding the PELT HALF_LIFE

https://lore.kernel.org/all/0f82011994be68502fd9833e499749866539c3df.camel@mediatek.com/

The numbers for 8ms half_life where showing really nice improvement
for the 'min fps' metric. I got similar with higher margin.

IMO we can derive quite important information from those different
experiments:
More sustainable workloads like "Yahoo browser" don't need margin.
More unpredictable workloads like "Fortnite" (shooter game with 'open
world') need some decent margin.

The problem is that the periodic task can be 'noisy'. The low-pass
filter which is our exponentially weighted moving avg PELT will
'smooth' the measured values. It will block sudden 'spikes' since
they are high-frequency changes. Those sudden 'spikes' are
the task activations where we need to compute a bit longer, e.g.
there was explosion in the game. The 25% margin helps us to
be ready for this 'noisy' task - the CPU frequency is higher
(and capacity). So if a sudden need for longer computation
is seen, then we have enough 'idle time' (~25% idle) to serve this
properly and not loose the frame.

The margin helps in two ways for 'noisy' workloads:
1. in fits_capacity() to avoid a CPU which couldn't handle it
    and prefers CPUs with higher capacity
2. it asks for longer 'idle time' e.g. 25-40% (depends on margin) to
    serve sudden computation need

IIUC, your proposal is to:
1. extend the low-pass filter to some higher frequency, so we
    could see those 'spikes' - that's the PELT HALF_LIFE boot
    parameter for 8ms
1.1. You are likely to have a 'gift' from the Util_est
      which picks the max util_avg values and maintains them
      for a while. That's why the 8ms PELT information can last longer
      and you can get higher frequency and longer idle time.
2. Plumb in this new idea of dvfs_update_delay as the new
    'margin' - this I don't understand

For the 2. I don't see that the dvfs HW characteristics are best
for this problem purpose. We can have a really fast DVFS HW,
but we need some decent spare idle time in some workloads, which
are two independent issues IMO. You might get the higher
idle time thanks to 1.1. but this is a 'side effect'.

Could you explain a bit more why this dvfs_update_delay is
crucial here?

Regards,
Lukasz
Qais Yousef Sept. 6, 2023, 9:18 p.m. UTC | #2
Hi Lukasz

On 09/06/23 10:18, Lukasz Luba wrote:
> Hi Qais,
> 
> On 8/28/23 00:31, Qais Yousef wrote:
> > Since the introduction of EAS and schedutil, we had two magic 0.8 and 1.25
> > margins applied in fits_capacity() and apply_dvfs_headroom().
> > 
> > As reported two years ago in
> > 
> > 	https://lore.kernel.org/lkml/1623855954-6970-1-git-send-email-yt.chang@mediatek.com/
> > 
> > these values are not good fit for all systems and people do feel the need to
> > modify them regularly out of tree.
> 
> That is true, in Android kernel those are known 'features'. Furthermore,
> in my game testing it looks like higher margins do help to shrink
> number of dropped frames, while on other types of workloads (e.g.
> those that you have in the link above) the 0% shows better energy.

Do you keep margins high for all types of CPU? I think the littles are the
problematic ones which higher margins helps as this means you move away from
them quickly.

> 
> I remember also the results from MTK regarding the PELT HALF_LIFE
> 
> https://lore.kernel.org/all/0f82011994be68502fd9833e499749866539c3df.camel@mediatek.com/
> 
> The numbers for 8ms half_life where showing really nice improvement
> for the 'min fps' metric. I got similar with higher margin.
> 
> IMO we can derive quite important information from those different
> experiments:
> More sustainable workloads like "Yahoo browser" don't need margin.
> More unpredictable workloads like "Fortnite" (shooter game with 'open
> world') need some decent margin.

Yeah. So the point is that while we should have a sensible default, but there
isn't a one size fits all. But the question is how the user/sysadmin should
control this? This series is what I propose of course :)

I also think the current forced/fixed margin values enforce a policy that is
clearly not a good default on many systems. With no alternative in hand but to
hack their own solutions.

> 
> The problem is that the periodic task can be 'noisy'. The low-pass

Hehe. That's because they're not really periodic ;-)

I think the model of a periodic task is not suitable for most workloads. All
of them are dynamic and how much they need to do at each wake up can very
significantly over 10s of ms.

> filter which is our exponentially weighted moving avg PELT will
> 'smooth' the measured values. It will block sudden 'spikes' since
> they are high-frequency changes. Those sudden 'spikes' are
> the task activations where we need to compute a bit longer, e.g.
> there was explosion in the game. The 25% margin helps us to
> be ready for this 'noisy' task - the CPU frequency is higher
> (and capacity). So if a sudden need for longer computation
> is seen, then we have enough 'idle time' (~25% idle) to serve this
> properly and not loose the frame.
> 
> The margin helps in two ways for 'noisy' workloads:
> 1. in fits_capacity() to avoid a CPU which couldn't handle it
>    and prefers CPUs with higher capacity
> 2. it asks for longer 'idle time' e.g. 25-40% (depends on margin) to
>    serve sudden computation need
> 
> IIUC, your proposal is to:
> 1. extend the low-pass filter to some higher frequency, so we
>    could see those 'spikes' - that's the PELT HALF_LIFE boot
>    parameter for 8ms

That's another way to look at it, yes. We can control how reactive we'd like
the system to be for changes.

> 1.1. You are likely to have a 'gift' from the Util_est
>      which picks the max util_avg values and maintains them
>      for a while. That's why the 8ms PELT information can last longer
>      and you can get higher frequency and longer idle time.

This is probably controversial statement. But I am not in favour of util_est.
I need to collect the data, but I think we're better with 16ms PELT HALFLIFE as
default instead. But I will need to do a separate investigation on that.

> 2. Plumb in this new idea of dvfs_update_delay as the new
>    'margin' - this I don't understand
> 
> For the 2. I don't see that the dvfs HW characteristics are best
> for this problem purpose. We can have a really fast DVFS HW,
> but we need some decent spare idle time in some workloads, which
> are two independent issues IMO. You might get the higher
> idle time thanks to 1.1. but this is a 'side effect'.
> 
> Could you explain a bit more why this dvfs_update_delay is
> crucial here?

I'm not sure why you relate this to idle time. And the word margin is a bit
overloaded here. so I suppose you're referring to the one we have in
map_util_perf() or apply_dvfs_headroom(). And I suppose you assume this extra
headroom will result in idle time, but this is not necessarily true IMO.

My rationale is simply that DVFS based on util should follow util_avg as-is.
But as pointed out in different discussions happened elsewhere, we need to
provide a headroom for this util to grow as if we were to be exact and the task
continues to run, then likely the util will go above the current OPP before we
get a chance to change it again. If we do have an ideal hardware that takes
0 time to change frequency, then this headroom IMO is not needed because
frequency will follow us as util grows. Assuming here that util updates
instantaneously as the task continues to run.

So instead of a constant 25% headroom; I redefine this to be a function of the
hardware delay. If we take a decision now to choose which OPP, then it should
be based on util_avg value after taking into account how much it'll grow before
we take the next decision (which the dvfs_update_delay). We don't need any more
than that.

Maybe we need to take into account how often we call update_load_avg(). I'm not
sure about this yet.

If the user wants to have faster response time, then the new knobs are the way
to control that. But the headroom should be small enough to make sure we don't
overrun until the next decision point. Not less, and not more.


Thanks!

--
Qais Yousef
Lukasz Luba Sept. 7, 2023, 7:48 a.m. UTC | #3
On 9/6/23 22:18, Qais Yousef wrote:
> Hi Lukasz
> 
> On 09/06/23 10:18, Lukasz Luba wrote:
>> Hi Qais,
>>
>> On 8/28/23 00:31, Qais Yousef wrote:
>>> Since the introduction of EAS and schedutil, we had two magic 0.8 and 1.25
>>> margins applied in fits_capacity() and apply_dvfs_headroom().
>>>
>>> As reported two years ago in
>>>
>>> 	https://lore.kernel.org/lkml/1623855954-6970-1-git-send-email-yt.chang@mediatek.com/
>>>
>>> these values are not good fit for all systems and people do feel the need to
>>> modify them regularly out of tree.
>>
>> That is true, in Android kernel those are known 'features'. Furthermore,
>> in my game testing it looks like higher margins do help to shrink
>> number of dropped frames, while on other types of workloads (e.g.
>> those that you have in the link above) the 0% shows better energy.
> 
> Do you keep margins high for all types of CPU? I think the littles are the
> problematic ones which higher margins helps as this means you move away from
> them quickly.

That's true, for the Littles higher margins helps to evacuate tasks
sooner. I have experiments showing good results with 60% margin on
Littles, while on Big & Mid 20%, 30%. The Littles still have also
tasks in cgroups cpumask which are quite random, so they cannot
migrate, but have a bit higher 'idle time' headroom.


> 
>>
>> I remember also the results from MTK regarding the PELT HALF_LIFE
>>
>> https://lore.kernel.org/all/0f82011994be68502fd9833e499749866539c3df.camel@mediatek.com/
>>
>> The numbers for 8ms half_life where showing really nice improvement
>> for the 'min fps' metric. I got similar with higher margin.
>>
>> IMO we can derive quite important information from those different
>> experiments:
>> More sustainable workloads like "Yahoo browser" don't need margin.
>> More unpredictable workloads like "Fortnite" (shooter game with 'open
>> world') need some decent margin.
> 
> Yeah. So the point is that while we should have a sensible default, but there
> isn't a one size fits all. But the question is how the user/sysadmin should
> control this? This series is what I propose of course :)
> 
> I also think the current forced/fixed margin values enforce a policy that is
> clearly not a good default on many systems. With no alternative in hand but to
> hack their own solutions.

I see.

> 
>>
>> The problem is that the periodic task can be 'noisy'. The low-pass
> 
> Hehe. That's because they're not really periodic ;-)

They are periodic in a sense, they wake up every 16ms, but sometimes
they have more work. It depends what is currently going in the game
and/or sometimes the data locality (might not be in cache).

Although, that's for games, other workloads like youtube play or this
one 'Yahoo browser' (from your example) are more 'predictable' (after
the start up period). And I really like the potential energy saving
there :)

> 
> I think the model of a periodic task is not suitable for most workloads. All
> of them are dynamic and how much they need to do at each wake up can very
> significantly over 10s of ms.

Might be true, the model was built a few years ago when there wasn't
such dynamic game scenario with high FPS on mobiles. This could still
be tuned with your new design IIUC (no need extra hooks in Android).

> 
>> filter which is our exponentially weighted moving avg PELT will
>> 'smooth' the measured values. It will block sudden 'spikes' since
>> they are high-frequency changes. Those sudden 'spikes' are
>> the task activations where we need to compute a bit longer, e.g.
>> there was explosion in the game. The 25% margin helps us to
>> be ready for this 'noisy' task - the CPU frequency is higher
>> (and capacity). So if a sudden need for longer computation
>> is seen, then we have enough 'idle time' (~25% idle) to serve this
>> properly and not loose the frame.
>>
>> The margin helps in two ways for 'noisy' workloads:
>> 1. in fits_capacity() to avoid a CPU which couldn't handle it
>>     and prefers CPUs with higher capacity
>> 2. it asks for longer 'idle time' e.g. 25-40% (depends on margin) to
>>     serve sudden computation need
>>
>> IIUC, your proposal is to:
>> 1. extend the low-pass filter to some higher frequency, so we
>>     could see those 'spikes' - that's the PELT HALF_LIFE boot
>>     parameter for 8ms
> 
> That's another way to look at it, yes. We can control how reactive we'd like
> the system to be for changes.

Which make sense in context to what I said above (newer gaming).

> 
>> 1.1. You are likely to have a 'gift' from the Util_est
>>       which picks the max util_avg values and maintains them
>>       for a while. That's why the 8ms PELT information can last longer
>>       and you can get higher frequency and longer idle time.
> 
> This is probably controversial statement. But I am not in favour of util_est.
> I need to collect the data, but I think we're better with 16ms PELT HALFLIFE as
> default instead. But I will need to do a separate investigation on that.

I like util_est, sometimes it helps ;)

> 
>> 2. Plumb in this new idea of dvfs_update_delay as the new
>>     'margin' - this I don't understand
>>
>> For the 2. I don't see that the dvfs HW characteristics are best
>> for this problem purpose. We can have a really fast DVFS HW,
>> but we need some decent spare idle time in some workloads, which
>> are two independent issues IMO. You might get the higher
>> idle time thanks to 1.1. but this is a 'side effect'.
>>
>> Could you explain a bit more why this dvfs_update_delay is
>> crucial here?
> 
> I'm not sure why you relate this to idle time. And the word margin is a bit
> overloaded here. so I suppose you're referring to the one we have in
> map_util_perf() or apply_dvfs_headroom(). And I suppose you assume this extra
> headroom will result in idle time, but this is not necessarily true IMO.
> 
> My rationale is simply that DVFS based on util should follow util_avg as-is.
> But as pointed out in different discussions happened elsewhere, we need to
> provide a headroom for this util to grow as if we were to be exact and the task
> continues to run, then likely the util will go above the current OPP before we
> get a chance to change it again. If we do have an ideal hardware that takes

Yes, this is another requirement to have +X% margin. When the tasks are
growing, we don't know their final util_avg and we give them a bit more
cycles.
IMO we have to be ready always for such situation in the scheduler,
haven't we?

> 0 time to change frequency, then this headroom IMO is not needed because
> frequency will follow us as util grows. Assuming here that util updates
> instantaneously as the task continues to run.
> 
> So instead of a constant 25% headroom; I redefine this to be a function of the
> hardware delay. If we take a decision now to choose which OPP, then it should
> be based on util_avg value after taking into account how much it'll grow before
> we take the next decision (which the dvfs_update_delay). We don't need any more
> than that.
> 
> Maybe we need to take into account how often we call update_load_avg(). I'm not
> sure about this yet.
> 
> If the user wants to have faster response time, then the new knobs are the way
> to control that. But the headroom should be small enough to make sure we don't
> overrun until the next decision point. Not less, and not more.

For ideal workloads (rt-app) or those 'calm' yes, we could save energy
(as you pointed for this 0% margin energy values). I do like this 10%
energy saving in some DoU scenarios. I couldn't catch the idea with
feeding the dvfs response information into this equation. We might
discuss this offline ;)

Cheers,
Lukasz
Peter Zijlstra Sept. 7, 2023, 11:53 a.m. UTC | #4
On Thu, Sep 07, 2023 at 08:48:08AM +0100, Lukasz Luba wrote:

> > Hehe. That's because they're not really periodic ;-)
> 
> They are periodic in a sense, they wake up every 16ms, but sometimes
> they have more work. It depends what is currently going in the game
> and/or sometimes the data locality (might not be in cache).
> 
> Although, that's for games, other workloads like youtube play or this
> one 'Yahoo browser' (from your example) are more 'predictable' (after
> the start up period). And I really like the potential energy saving
> there :)

So everything media is fundamentally periodic, you're hard tied to the
framerate / audio-buffer size etc..

Also note that the traditional periodic task model from the real-time
community has the notion of WCET, which completely covers this
fluctuation in frame-to-frame work, it only considers the absolute worst
case.

Now, practically, that stinks, esp. when you care about batteries, but
it does not mean these tasks are not periodic.

Many extentions to the periodic task model are possible, including
things like average runtime with bursts etc.. all have their trade-offs.
Lukasz Luba Sept. 7, 2023, 1:06 p.m. UTC | #5
On 9/7/23 12:53, Peter Zijlstra wrote:
> On Thu, Sep 07, 2023 at 08:48:08AM +0100, Lukasz Luba wrote:
> 
>>> Hehe. That's because they're not really periodic ;-)
>>
>> They are periodic in a sense, they wake up every 16ms, but sometimes
>> they have more work. It depends what is currently going in the game
>> and/or sometimes the data locality (might not be in cache).
>>
>> Although, that's for games, other workloads like youtube play or this
>> one 'Yahoo browser' (from your example) are more 'predictable' (after
>> the start up period). And I really like the potential energy saving
>> there :)
> 
> So everything media is fundamentally periodic, you're hard tied to the
> framerate / audio-buffer size etc..

Agree

> 
> Also note that the traditional periodic task model from the real-time
> community has the notion of WCET, which completely covers this
> fluctuation in frame-to-frame work, it only considers the absolute worst
> case.

That's good point, the WCET here. IMO shorter PELT e.g. 8ms allows us
to 'see' a bit more that information: the worst case in fluctuation of
a particular task. Then this 'seen' value is maintained in util_est
for a while. That's why (probably) I see a better 95-, 99-percentile
numbers for frames rendering time.

> 
> Now, practically, that stinks, esp. when you care about batteries, but
> it does not mean these tasks are not periodic.

Totally agree they are periodic.

> 
> Many extentions to the periodic task model are possible, including
> things like average runtime with bursts etc.. all have their trade-offs.

Was that maybe proposed somewhere on LKML (the other models)?

I can recall one idea - WALT.
IIRC ~2016/2017 the WALT proposal and some discussion/conferences, it
didn't get positive feedback [2].

I don't know if you remember those numbers back than, e.g. video 1080p
playback was using ~10% less energy... Those 10%-15% are still important
for us ;)

Regards,
Lukasz

[1] 
https://lore.kernel.org/all/1477638642-17428-1-git-send-email-markivx@codeaurora.org/
Peter Zijlstra Sept. 7, 2023, 1:08 p.m. UTC | #6
On Mon, Aug 28, 2023 at 12:31:56AM +0100, Qais Yousef wrote:

> Equally recent discussion in PELT HALFLIFE thread highlighted the need for
> a way to tune system response time to achieve better perf, power and thermal
> characteristic for a given system
> 
> 	https://lore.kernel.org/lkml/20220829055450.1703092-1-dietmar.eggemann@arm.com/
> 

> To further help tune the system, we introduce PELT HALFLIFE multiplier as
> a boot time parameter. This parameter has an impact on how fast we migrate, so
> should compensate for whoever needed to tune fits_capacity(); and it has great
> impact on default response_time_ms. Particularly it gives a natural faster rise
> time when the system gets busy, AND fall time when the system goes back to
> idle. It is coarse grain response control that can be coupled with finer grain
> control via schedutil's response_time_ms.

You're misrepresenting things... The outcome of that thread above was
that PELT halftime was not the primary problem. Specifically:

  https://lore.kernel.org/lkml/424e2c81-987d-f10e-106d-8b4c611768bc@arm.com/

mentions that the only thing that gaming nonsense cares about is DVFS
ramp-up.

None of the other PELT users mattered one bit.

Also, ISTR a fair amount of this was workload dependent. So a solution
that has per-task configurability -- like UTIL_EST_FASTER, seems more
suitable.


I'm *really* hesitant on adding all these mostly random knobs -- esp.
without strong justification -- which you don't present. You mostly seem
to justify things with: people do random hack, we should legitimize them
hacks.

Like the last time around, I want the actual problem explained. The
problem is not that random people on the internet do random things to
their kernel.
Peter Zijlstra Sept. 7, 2023, 1:26 p.m. UTC | #7
On Wed, Sep 06, 2023 at 10:18:50PM +0100, Qais Yousef wrote:

> This is probably controversial statement. But I am not in favour of util_est.
> I need to collect the data, but I think we're better with 16ms PELT HALFLIFE as
> default instead. But I will need to do a separate investigation on that.

I think util_est makes perfect sense, where PELT has to fundamentally
decay non-running / non-runnable tasks in order to provide a temporal
average, DVFS might be best served with a termporal max filter.
Peter Zijlstra Sept. 7, 2023, 1:29 p.m. UTC | #8
On Thu, Sep 07, 2023 at 02:06:15PM +0100, Lukasz Luba wrote:

> > Many extentions to the periodic task model are possible, including
> > things like average runtime with bursts etc.. all have their trade-offs.
> 
> Was that maybe proposed somewhere on LKML (the other models)?

RT literatur mostly methinks. Replacing WCET with a statistical model of
sorts is not uncommon, the argument goes that not everybody will have
their worst case at the same time and lows and highs can commonly cancel
out and this way we can cram a little more on the system.

Typically this is proposed in the context of soft-realtime systems.
Lukasz Luba Sept. 7, 2023, 1:33 p.m. UTC | #9
On 9/7/23 14:29, Peter Zijlstra wrote:
> On Thu, Sep 07, 2023 at 02:06:15PM +0100, Lukasz Luba wrote:
> 
>>> Many extentions to the periodic task model are possible, including
>>> things like average runtime with bursts etc.. all have their trade-offs.
>>
>> Was that maybe proposed somewhere on LKML (the other models)?
> 
> RT literatur mostly methinks. Replacing WCET with a statistical model of
> sorts is not uncommon, the argument goes that not everybody will have
> their worst case at the same time and lows and highs can commonly cancel
> out and this way we can cram a little more on the system.
> 
> Typically this is proposed in the context of soft-realtime systems.

Thanks Peter, I will dive into some books...
Peter Zijlstra Sept. 7, 2023, 1:38 p.m. UTC | #10
On Thu, Sep 07, 2023 at 02:33:49PM +0100, Lukasz Luba wrote:
> 
> 
> On 9/7/23 14:29, Peter Zijlstra wrote:
> > On Thu, Sep 07, 2023 at 02:06:15PM +0100, Lukasz Luba wrote:
> > 
> > > > Many extentions to the periodic task model are possible, including
> > > > things like average runtime with bursts etc.. all have their trade-offs.
> > > 
> > > Was that maybe proposed somewhere on LKML (the other models)?
> > 
> > RT literatur mostly methinks. Replacing WCET with a statistical model of
> > sorts is not uncommon, the argument goes that not everybody will have
> > their worst case at the same time and lows and highs can commonly cancel
> > out and this way we can cram a little more on the system.
> > 
> > Typically this is proposed in the context of soft-realtime systems.
> 
> Thanks Peter, I will dive into some books...

I would look at academic papers, not sure any of that ever made it to
books, Daniel would know I suppose.
Lukasz Luba Sept. 7, 2023, 1:45 p.m. UTC | #11
On 9/7/23 14:38, Peter Zijlstra wrote:
> On Thu, Sep 07, 2023 at 02:33:49PM +0100, Lukasz Luba wrote:
>>
>>
>> On 9/7/23 14:29, Peter Zijlstra wrote:
>>> On Thu, Sep 07, 2023 at 02:06:15PM +0100, Lukasz Luba wrote:
>>>
>>>>> Many extentions to the periodic task model are possible, including
>>>>> things like average runtime with bursts etc.. all have their trade-offs.
>>>>
>>>> Was that maybe proposed somewhere on LKML (the other models)?
>>>
>>> RT literatur mostly methinks. Replacing WCET with a statistical model of
>>> sorts is not uncommon, the argument goes that not everybody will have
>>> their worst case at the same time and lows and highs can commonly cancel
>>> out and this way we can cram a little more on the system.
>>>
>>> Typically this is proposed in the context of soft-realtime systems.
>>
>> Thanks Peter, I will dive into some books...
> 
> I would look at academic papers, not sure any of that ever made it to
> books, Daniel would know I suppose.

Good hint, thanks!
Lukasz Luba Sept. 7, 2023, 1:57 p.m. UTC | #12
On 9/7/23 14:26, Peter Zijlstra wrote:
> On Wed, Sep 06, 2023 at 10:18:50PM +0100, Qais Yousef wrote:
> 
>> This is probably controversial statement. But I am not in favour of util_est.
>> I need to collect the data, but I think we're better with 16ms PELT HALFLIFE as
>> default instead. But I will need to do a separate investigation on that.
> 
> I think util_est makes perfect sense, where PELT has to fundamentally
> decay non-running / non-runnable tasks in order to provide a temporal
> average, DVFS might be best served with a termporal max filter.
> 
> 

Since we are here...
Would you allow to have a configuration for
the util_est shifter: UTIL_EST_WEIGHT_SHIFT ?

I've found other values than '2' better in some scenarios. That helps
to prevent a big task to 'down' migrate from a Big CPU (1024) to some
Mid CPU (~500-700 capacity) or even Little (~120-300).
Peter Zijlstra Sept. 7, 2023, 2:29 p.m. UTC | #13
On Thu, Sep 07, 2023 at 02:57:26PM +0100, Lukasz Luba wrote:
> 
> 
> On 9/7/23 14:26, Peter Zijlstra wrote:
> > On Wed, Sep 06, 2023 at 10:18:50PM +0100, Qais Yousef wrote:
> > 
> > > This is probably controversial statement. But I am not in favour of util_est.
> > > I need to collect the data, but I think we're better with 16ms PELT HALFLIFE as
> > > default instead. But I will need to do a separate investigation on that.
> > 
> > I think util_est makes perfect sense, where PELT has to fundamentally
> > decay non-running / non-runnable tasks in order to provide a temporal
> > average, DVFS might be best served with a termporal max filter.
> > 
> > 
> 
> Since we are here...
> Would you allow to have a configuration for
> the util_est shifter: UTIL_EST_WEIGHT_SHIFT ?
> 
> I've found other values than '2' better in some scenarios. That helps
> to prevent a big task to 'down' migrate from a Big CPU (1024) to some
> Mid CPU (~500-700 capacity) or even Little (~120-300).

Larger values, I'm thinking you're after? Those would cause the new
contribution to weight less, making the function more smooth, right?

What task characteristic is tied to this? That is, this seems trivial to
modify per-task.
Lukasz Luba Sept. 7, 2023, 2:42 p.m. UTC | #14
On 9/7/23 15:29, Peter Zijlstra wrote:
> On Thu, Sep 07, 2023 at 02:57:26PM +0100, Lukasz Luba wrote:
>>
>>
>> On 9/7/23 14:26, Peter Zijlstra wrote:
>>> On Wed, Sep 06, 2023 at 10:18:50PM +0100, Qais Yousef wrote:
>>>
>>>> This is probably controversial statement. But I am not in favour of util_est.
>>>> I need to collect the data, but I think we're better with 16ms PELT HALFLIFE as
>>>> default instead. But I will need to do a separate investigation on that.
>>>
>>> I think util_est makes perfect sense, where PELT has to fundamentally
>>> decay non-running / non-runnable tasks in order to provide a temporal
>>> average, DVFS might be best served with a termporal max filter.
>>>
>>>
>>
>> Since we are here...
>> Would you allow to have a configuration for
>> the util_est shifter: UTIL_EST_WEIGHT_SHIFT ?
>>
>> I've found other values than '2' better in some scenarios. That helps
>> to prevent a big task to 'down' migrate from a Big CPU (1024) to some
>> Mid CPU (~500-700 capacity) or even Little (~120-300).
> 
> Larger values, I'm thinking you're after? Those would cause the new
> contribution to weight less, making the function more smooth, right?

Yes, more smooth, because we only use the 'ewma' goodness for decaying
part (not the raising [1]).

> 
> What task characteristic is tied to this? That is, this seems trivial to
> modify per-task.

In particular Speedometer test and the main browser task, which reaches
~900util, but sometimes vanish and waits for other background tasks
to do something. In the meantime it can decay and wake-up on
Mid/Little (which can cause a penalty to score up to 5-10% vs. if
we pin the task to big CPUs). So, a longer util_est helps to avoid
at least very bad down migration to Littles...

[1] https://elixir.bootlin.com/linux/v6.5.1/source/kernel/sched/fair.c#L4442
Peter Zijlstra Sept. 7, 2023, 8:16 p.m. UTC | #15
On Thu, Sep 07, 2023 at 03:42:13PM +0100, Lukasz Luba wrote:

> > What task characteristic is tied to this? That is, this seems trivial to
> > modify per-task.
> 
> In particular Speedometer test and the main browser task, which reaches
> ~900util, but sometimes vanish and waits for other background tasks
> to do something. In the meantime it can decay and wake-up on
> Mid/Little (which can cause a penalty to score up to 5-10% vs. if
> we pin the task to big CPUs). So, a longer util_est helps to avoid
> at least very bad down migration to Littles...

Do they do a few short activations (wakeup/sleeps) while waiting? That
would indeed completely ruin things since the EWMA thing is activation
based.

I wonder if there's anything sane we can do here...
Qais Yousef Sept. 8, 2023, 12:17 a.m. UTC | #16
On 09/07/23 15:08, Peter Zijlstra wrote:
> On Mon, Aug 28, 2023 at 12:31:56AM +0100, Qais Yousef wrote:
> 
> > Equally recent discussion in PELT HALFLIFE thread highlighted the need for
> > a way to tune system response time to achieve better perf, power and thermal
> > characteristic for a given system
> > 
> > 	https://lore.kernel.org/lkml/20220829055450.1703092-1-dietmar.eggemann@arm.com/
> > 
> 
> > To further help tune the system, we introduce PELT HALFLIFE multiplier as
> > a boot time parameter. This parameter has an impact on how fast we migrate, so
> > should compensate for whoever needed to tune fits_capacity(); and it has great
> > impact on default response_time_ms. Particularly it gives a natural faster rise
> > time when the system gets busy, AND fall time when the system goes back to
> > idle. It is coarse grain response control that can be coupled with finer grain
> > control via schedutil's response_time_ms.
> 
> You're misrepresenting things... The outcome of that thread above was

Sorry if I did. My PoV might have gotten skewed. I'm not intending to mislead
for sure. I actually was hesitant about adding the PELT patch initially, but
it did feel that the two topics are connected. Margins are causing problems
because they end up wasting power. So there's a desire to slow current response
down. But this PELT story wanted to speed things up. And this polar opposite is
what I think the distilled problem.

> that PELT halftime was not the primary problem. Specifically:
> 
>   https://lore.kernel.org/lkml/424e2c81-987d-f10e-106d-8b4c611768bc@arm.com/
> 
> mentions that the only thing that gaming nonsense cares about is DVFS
> ramp-up.
> 
> None of the other PELT users mattered one bit.

I actually latched to Vincent response about a boot time parameter makes sense.

Just to be clear, my main issue here with the current hardcoded values of the
'margins'. And the fact they go too fast is my main problem.

The way I saw PELT fits in this story is to help lower end systems who don't
have a lot of oomph. For reasonably powerful system; I don't see a necessity to
change this and DVFS is what matters, I agree.

It was my attempt to draw a full picture and cover the full spectrum. I don't
think PELT halfllife plays a role in powerful systems. But under-powered ones,
I think it will help; and that's why I was depicting it as coarse grain
control.

I think I did try to present similar arguments on that thread.

> 
> Also, ISTR a fair amount of this was workload dependent. So a solution
> that has per-task configurability -- like UTIL_EST_FASTER, seems more
> suitable.

But for the 0.8 and 1.25 margin problems, actually the problem is that 25% is
too aggressive/fast and wastes power. I'm actually slowing things down as
a result of this series. And I'm expecting some not to be happy about it on
their systems. The response_time_ms was my way to give back control. I didn't
see how I can make things faster and slower at the same time without making
decisions on behalf of the user/sysadmin.

So the connection I see between PELT and the margins or headrooms in
fits_capacity() and map_util_perf()/dvfs_headroom is that they expose the need
to manage the perf/power trade-off of the system.

Particularly the default is not good for the modern systems, Cortex-X is too
powerful but we still operate within the same power and thermal budgets.

And what was a high end A78 is a mid core today. So if you look at today's
mobile world topology we really have a tiy+big+huge combination of cores. The
bigs are called mids, but they're very capable. Fits capacity forces migration
to the 'huge' cores too soon with that 80% margin. While the 80% might be too
small for the tiny ones as some workloads really struggle there if they hang on
for too long. It doesn't help that these systems ship with 4ms tick. Something
more to consider changing I guess.

And the 25% headroom forces near max frequency to be used when the workload is
happily hovering in the 750 region. I did force the frequency to be lower and
the workload is happily running - we don't need the extra 25% headroom
enforced unconditionally.

UTIL_EST_FASTER moves in one direction. And it's a constant response too, no?
I didn't get the per-task configurability part. AFAIU we can't turn off these
sched-features if they end up causing power issues. That what makes me hesitant
about them. There's a bias towards perf. But some systems prefer to save power
at the expense of perf. There's a lot of grey areas in between to what
perceived as a suitable trade-off for perf vs power. There are cases like above
where actually you can lower freqs without hit on perf. But most of the time
it's a trade-off; and some do decide to drop perf in favour of power. Keep in
mind battery capacity differs between systems with the same SoC even. Some ship
to enable more perf, others are more constrained and opt to be more efficient.

Sorry I didn't explain things well in the cover letter.

> I'm *really* hesitant on adding all these mostly random knobs -- esp.
> without strong justification -- which you don't present. You mostly seem
> to justify things with: people do random hack, we should legitimize them
> hacks.

I share your sentiment and I am trying to find out what's the right thing to do
really. I am open to explore other territories. But from what I see there's
a real need to give users the power to tune how responsive their system needs
to be. I can't see how we can have one size that fits all here given the
different capabilities of the systems and the desired outcome (I want more perf
vs more efficiency).

> Like the last time around, I want the actual problem explained. The
> problem is not that random people on the internet do random things to
> their kernel.

The problem is that those 0.8 and 1.25 margins forces unsuitable default. The
case I see the most is it is causing wasting power that tuning it down regains
this power at no perf cost or small one. Others actually do tune it for faster
response, but I can't cover this case in detail. All I know is lower end
systems do struggle as they don't have enough oomph. I also saw comparison on
phoronix where schedutil is not doing as good still - which tells me it seems
server systems do prefer to ramp up faster too. I think that PELT thread is
a variation of the same problem.

So one of the things I saw is a workload where it spends majority of the time
in 600-750 util_avg range. Rarely ramps up to max. But the workload under uses
the medium cores and runs at a lot higher freqs than it really needs on bigs.
We don't end up utilizing our resources properly.

Happy to go and dig for more data/info if this is not good enough :)

There's a question that I'm struggling with if I may ask. Why is it perceived
our constant response time (practically ~200ms to go from 0 to max) as a good
fit for all use cases? Capability of systems differs widely in terms of what
performance you get at say a util of 512. Or in other words how much work is
done in a unit of time differs between system, but we still represent that work
in a constant way. A task ran for 10ms on powerful System A would have done
a lot more work than running on poor System B for the same 10ms. But util will
still rise the same for both cases. If someone wants to allow this task to be
able to do more on those 10ms, it seems natural to be able to control this
response time. It seems this thinking is flawed for some reason and I'd
appreciate a help to understand why. I think a lot of us perceive this problem
this way.

Hopefully uclamp will help address these issues in a better way. ADPF gives
apps a way to access it reasonably now. Unity announced support for ADPF, so
hopefully games and other workloads can learn to be smarter overtime. But the
spectrum of workloads to cover is still big, and adoption will take time. And
there are still lessons to be learnt and improvements to make. I expect this
effort to take time before it's the norm.  And thinking of desktop systems;
distros like Debian for example still don't enable uclamp by default on their
kernels. I sent asking to enable it and it got added to wishlist.. Actually
even schedutil is not enabled by default on my Pine64 running Armbian nor on my
Mac Mini with M1 chip running Asahi Linux Ubuntu.  You'd think big.LITTLE
systems should have EAS written all over them, but not sure if this is
accidental omission or ondemand is actually perceived as better. I think my
intel systems also don't run schedutil by default still too.  They're not
waking up on lan now to double check though (yep, saving power :D).

Happy to go and try to dig more info if this is still not clear enough.


Thanks!

--
Qais Yousef
Dietmar Eggemann Sept. 8, 2023, 7:40 a.m. UTC | #17
On 08/09/2023 02:17, Qais Yousef wrote:
> On 09/07/23 15:08, Peter Zijlstra wrote:
>> On Mon, Aug 28, 2023 at 12:31:56AM +0100, Qais Yousef wrote:

[...]

> But for the 0.8 and 1.25 margin problems, actually the problem is that 25% is
> too aggressive/fast and wastes power. I'm actually slowing things down as
> a result of this series. And I'm expecting some not to be happy about it on
> their systems. The response_time_ms was my way to give back control. I didn't
> see how I can make things faster and slower at the same time without making
> decisions on behalf of the user/sysadmin.
> 
> So the connection I see between PELT and the margins or headrooms in
> fits_capacity() and map_util_perf()/dvfs_headroom is that they expose the need
> to manage the perf/power trade-off of the system.
> 
> Particularly the default is not good for the modern systems, Cortex-X is too
> powerful but we still operate within the same power and thermal budgets.
> 
> And what was a high end A78 is a mid core today. So if you look at today's
> mobile world topology we really have a tiy+big+huge combination of cores. The
> bigs are called mids, but they're very capable. Fits capacity forces migration
> to the 'huge' cores too soon with that 80% margin. While the 80% might be too
> small for the tiny ones as some workloads really struggle there if they hang on
> for too long. It doesn't help that these systems ship with 4ms tick. Something
> more to consider changing I guess.

If this is the problem then you could simply make the margin (headroom)
a function of cpu_capacity_orig?

[...]

> There's a question that I'm struggling with if I may ask. Why is it perceived
> our constant response time (practically ~200ms to go from 0 to max) as a good
> fit for all use cases? Capability of systems differs widely in terms of what
> performance you get at say a util of 512. Or in other words how much work is
> done in a unit of time differs between system, but we still represent that work
> in a constant way. A task ran for 10ms on powerful System A would have done

PELT (util_avg) is uarch & frequency invariant.

So e.g. a task with util_avg = 256 could have a runtime/period

on big CPU (capacity = 1024) of 4ms/16ms

on little CPU (capacity = 512) of 8ms/16ms

The amount of work in invariant (so we can compare between asymmetric
capacity CPUs) but the runtime obviously differs according to the capacity.

[...]
Peter Zijlstra Sept. 8, 2023, 10:25 a.m. UTC | #18
On Fri, Sep 08, 2023 at 01:17:25AM +0100, Qais Yousef wrote:

> Just to be clear, my main issue here with the current hardcoded values of the
> 'margins'. And the fact they go too fast is my main problem.

So I stripped the whole margin thing from my reply because I didn't want
to comment on that yet, but yes, I can see how those might be a problem,
and you're changing them into something dynamic, not just removing them.

The tunables is what I worry most about. The moment we expose knobs it
becomes really hard to change things later.

> UTIL_EST_FASTER moves in one direction. And it's a constant response too, no?

The idea of UTIL_EST_FASTER was that we run a PELT sum on the current
activation runtime, all runtime since wakeup and take the max of this
extra sum and the regular thing.

On top of that this extra PELT sum can/has a time multiplier and thus
ramps up faster (this multiplies could be per task). Nb.:

	util_est_fast = faster_est_approx(delta * 2);

is a state-less expression -- by making

	util_est_fast = faster_est_approx(delta * curr->se.faster_mult);

only the current task is affected.

> I didn't get the per-task configurability part. AFAIU we can't turn off these
> sched-features if they end up causing power issues. That what makes me hesitant
> about them. 

See above, the extra sum is (fundamentally) per task, the multiplier
could be per task, if you set the multiplier to <=1, you'll never gain on
the existing sum and the max filter makes that the feature is
effectively disabled for the one task.

It of course gets us the problem of how to set the new multiplier... ;-)

> There's a bias towards perf. But some systems prefer to save power
> at the expense of perf. There's a lot of grey areas in between to what
> perceived as a suitable trade-off for perf vs power. There are cases like above
> where actually you can lower freqs without hit on perf. But most of the time
> it's a trade-off; and some do decide to drop perf in favour of power. Keep in
> mind battery capacity differs between systems with the same SoC even. Some ship
> to enable more perf, others are more constrained and opt to be more efficient.

It always depends on the workload too -- you want different trade-offs
for different tasks.

> > I'm *really* hesitant on adding all these mostly random knobs -- esp.
> > without strong justification -- which you don't present. You mostly seem
> > to justify things with: people do random hack, we should legitimize them
> > hacks.
> 
> I share your sentiment and I am trying to find out what's the right thing to do
> really. I am open to explore other territories. But from what I see there's
> a real need to give users the power to tune how responsive their system needs
> to be. I can't see how we can have one size that fits all here given the
> different capabilities of the systems and the desired outcome (I want more perf
> vs more efficiency).

This is true; but we also cannot keep adding random knobs. Knobs that
are very specific are hard constraints we've got to live with. Take for
instance uclamp, that's not something we can ever get rid of I think
(randomly picking on uclamp, not saying I'm hating on it).

From an actual interface POV, the unit-less generic energy-vs-perf knob
is of course ideal, one global and one per task and then we can fill out
the details as we see fit.  System integrators (you say users, but
really, not a single actual user will use any of this) can muck about
and see what works for them.

(even hardware has these things today, we get 0-255 values that do
'something' uarch specific)

> The problem is that those 0.8 and 1.25 margins forces unsuitable default. The
> case I see the most is it is causing wasting power that tuning it down regains
> this power at no perf cost or small one. Others actually do tune it for faster
> response, but I can't cover this case in detail. All I know is lower end
> systems do struggle as they don't have enough oomph. I also saw comparison on
> phoronix where schedutil is not doing as good still - which tells me it seems
> server systems do prefer to ramp up faster too. I think that PELT thread is
> a variation of the same problem.
> 
> So one of the things I saw is a workload where it spends majority of the time
> in 600-750 util_avg range. Rarely ramps up to max. But the workload under uses
> the medium cores and runs at a lot higher freqs than it really needs on bigs.
> We don't end up utilizing our resources properly.

So that is actually a fairly solid argument for changing things up, if
the margin causes us to neglect mid cores then that needs fixing. But I
don't think that means we need a tunable. After all, the system knows it
has 3 capacities, it just needs to be better at mapping workloads to
them.

It knows how much 'room' there is between a mid and a large. If 1.25*mid
> large we in trouble etc..

> There's a question that I'm struggling with if I may ask. Why is it perceived
> our constant response time (practically ~200ms to go from 0 to max) as a good
> fit for all use cases? Capability of systems differs widely in terms of what
> performance you get at say a util of 512. Or in other words how much work is
> done in a unit of time differs between system, but we still represent that work
> in a constant way. A task ran for 10ms on powerful System A would have done
> a lot more work than running on poor System B for the same 10ms. But util will
> still rise the same for both cases. If someone wants to allow this task to be
> able to do more on those 10ms, it seems natural to be able to control this
> response time. It seems this thinking is flawed for some reason and I'd
> appreciate a help to understand why. I think a lot of us perceive this problem
> this way.

I think part of the problem is that todays servers are tomorrow's
smartphones. Back when we started all this PELT nonsense computers in
general were less powerful than they are now, yet todays servers are no
less busy than they were back then.

Give us compute, we'll fill it.

Now, smartphones in particular are media devices, but a large part of
the server farms are indirectly interactive too, you don't want your
search query to take too long, or your bookface page stuck loading, or
you twatter message about your latest poop not being insta read by your
mates.

That is, much of what we do with the computers, ever more powerful or
not, is eventually measured in human time perception.

So yeah, 200ms.

Remember, all this PELT nonsense was created for cgroups, to distribute
shared between CPUs as load demanded. I think for that purpose it still
sorta makes sense.

Ofc we've added a few more users over time, because if you have this
data, might as well use it etc. I'm not sure we really sat down and
analyzed if the timing all made sense.

And as I argued elsewhere, PELT is a running average, but DVFS might be
better suited with a max filter.

> Happy to go and try to dig more info if this is still not clear enough.

So I'm not generally opposed to changing things -- but I much prefer to
have an actual problem driving that change :-)
Daniel Bristot de Oliveira Sept. 8, 2023, 12:51 p.m. UTC | #19
On 9/7/23 15:45, Lukasz Luba wrote:
>>>> RT literatur mostly methinks. Replacing WCET with a statistical model of
>>>> sorts is not uncommon, the argument goes that not everybody will have
>>>> their worst case at the same time and lows and highs can commonly cancel
>>>> out and this way we can cram a little more on the system.
>>>>
>>>> Typically this is proposed in the context of soft-realtime systems.
>>>
>>> Thanks Peter, I will dive into some books...
>>
>> I would look at academic papers, not sure any of that ever made it to
>> books, Daniel would know I suppose.
> 
> Good hint, thanks!

The key-words that came to my mind are:

	- mk-firm, where you accept m tasks will make their deadline
	           every k execution - like, because you run too long.
	- mixed criticality with pWCET (probabilistic execution time) or
		  average execution time + an sporadic tail execution time for
		  the low criticality part.

mk-firm smells like 2005's.. mixed criticality as 2015's..present.

You will probably find more papers than books. Read the papers
as a source for inspiration... not necessarily as a definitive
solution. They generally proposed too restrictive task models.

-- Daniel
Qais Yousef Sept. 8, 2023, 1:33 p.m. UTC | #20
On 09/08/23 12:25, Peter Zijlstra wrote:
> On Fri, Sep 08, 2023 at 01:17:25AM +0100, Qais Yousef wrote:
> 
> > Just to be clear, my main issue here with the current hardcoded values of the
> > 'margins'. And the fact they go too fast is my main problem.
> 
> So I stripped the whole margin thing from my reply because I didn't want
> to comment on that yet, but yes, I can see how those might be a problem,
> and you're changing them into something dynamic, not just removing them.

The main difficulty is that if you try to apply those patches on their own, I'm
sure you'll notice a difference. So if we were to take this alone and put them
on linux-next; I expect a few regression reports for those who run with
schedutil. Any ST oriented workload will not be happy. But if we compensate to
reduce the regression, my problem will re-appear, just for a different reason.
So whack-a-mole.

I didn't know how to make both happy without being dynamic, hence the RFC to
hopefully get some help and insights on how to resolve this. I think I'm
hovering around the right solutions, but not sure if I'm there yet. Some
implementation details certainly still need ironing out.

I genuinely think that we should be more conservative in adding those hardcoded
numbers without making them a function of real limitation.

TEO util threshold for instance has a similar problem to these margins.
I backported them to 5.10 and 5.15 and not long after I had to introduce knobs
to allow tuning them as power regression reports surfaced. The good news it
wasn't a full revert; the bad news those numbers seemed best for a class for
workloads on a particular system, but on another system and different
workloads, the reality will be different. And of course because Android has out
of tree patches; I need to spend a good amount of time before I can report back
properly to ensure the root cause is identified correctly. I will risk
a potentially incorrect statement, but I do hold to question the validity of
these hardcoded numbers on all systems and all workloads.

I am not sure we can avoid being dynamic; and personally I prefer to delegate
more to userspace and make it their problem to manage this dynamism. But by
providing the right tools of course :) I think they need to earn their best
perf/watt too; not let the kernel do all the dirty work, hehe.

> The tunables is what I worry most about. The moment we expose knobs it
> becomes really hard to change things later.

I'm not particularly attached to them to be honest. But at the same time I am
not sure if we can get away without giving the user some power to decide. What
I propose what seemed the most sensible way to do it. But really open to
explore alternatives and I do indeed need help to find this.

Generally; I think userspace expects too much automagic and the programming
model is ancient and not portable and we end up overcompensating for that in
the kernel. So giving them some power is necessary the way I see it, but the
shape and form it should take is debatable for sure. I don't claim to have the
right answer but happy to explore and experiment to get the right ones
identified and done :-)

> 
> > UTIL_EST_FASTER moves in one direction. And it's a constant response too, no?
> 
> The idea of UTIL_EST_FASTER was that we run a PELT sum on the current
> activation runtime, all runtime since wakeup and take the max of this
> extra sum and the regular thing.
> 
> On top of that this extra PELT sum can/has a time multiplier and thus
> ramps up faster (this multiplies could be per task). Nb.:
> 
> 	util_est_fast = faster_est_approx(delta * 2);
> 
> is a state-less expression -- by making
> 
> 	util_est_fast = faster_est_approx(delta * curr->se.faster_mult);
> 
> only the current task is affected.

Okay; maybe I didn't understand this fully and will go back and study it more.

Maybe the word faster is what makes me worried as I really see faster is not
what people want on a class of systems; or at least CPUs if you think of HMP.
Taming the beast is a more difficult problem in this class of systems.

So if I get it correctly; we will slow things down by removing these margins,
but people who suffer from this slow down will need to use util_est_faster to
regain the difference, right?

> 
> > I didn't get the per-task configurability part. AFAIU we can't turn off these
> > sched-features if they end up causing power issues. That what makes me hesitant
> > about them. 
> 
> See above, the extra sum is (fundamentally) per task, the multiplier
> could be per task, if you set the multiplier to <=1, you'll never gain on
> the existing sum and the max filter makes that the feature is
> effectively disabled for the one task.

Gotch ya. I think this could work, but it also seems to overlap with what we
can get already with uclamp. If we can tell this task needs a faster
multiplier, we can tell that it needs better uclamp_min and do that instead?
When should we use one over the other if we add both?

The challenging bit in practice is when we need to get some generic auto
response for all these workloads that just expect the system to give them what
they want without collaboration. I really do hope we can provide alternative to
make these expectations obselete and just be able to tell userspace your app is
not portable, go fix it; but we're not there yet.

And another selfish reason; analysing workloads is harder with these. We have
a lot of mechanisms on top of each others and reasoning about a cause of
a power issue in particular becomes a lot harder when something goes wrong on
one of these and gets bubbled up in subtle ways. Perf issues tend to be more
obvious; but if something cause power or bad thermals, then finding out if
there's sub optimality is hard. And if I find one, fixing it will be hard too.

> It of course gets us the problem of how to set the new multiplier... ;-)

I am actually trying to write a proposal for a generic QoS interface that we
potentially can plumb these things into (main motivation is wake up latency
control with eevdf - but seems you might be pushing something out soon). My
perception of the reality is that userspace is stuck on old programming model
and *a lot* of bad habits. But I think it is about time for it to get smarter
and more collaborative. But this necessities we give some mechanisms to enable
this smarter approach.

> 
> > There's a bias towards perf. But some systems prefer to save power
> > at the expense of perf. There's a lot of grey areas in between to what
> > perceived as a suitable trade-off for perf vs power. There are cases like above
> > where actually you can lower freqs without hit on perf. But most of the time
> > it's a trade-off; and some do decide to drop perf in favour of power. Keep in
> > mind battery capacity differs between systems with the same SoC even. Some ship
> > to enable more perf, others are more constrained and opt to be more efficient.
> 
> It always depends on the workload too -- you want different trade-offs
> for different tasks.

Indeed. We are trying to push for better classification of workloads so that we
can tell with reasonable confidence a certain trade-off is better for them.
Which what it really helps with is enable better use of resources with the
pre-knowledge that the current user experience won't be impacted.

Again, I really ultimately would love to see userspace becoming smarter and
delegate the task of writing portable and scalable software that works across
systems without the need for guess work and hand tuning. I think we're making
good steps in that direction, but we still need a lot more effort.

> 
> > > I'm *really* hesitant on adding all these mostly random knobs -- esp.
> > > without strong justification -- which you don't present. You mostly seem
> > > to justify things with: people do random hack, we should legitimize them
> > > hacks.
> > 
> > I share your sentiment and I am trying to find out what's the right thing to do
> > really. I am open to explore other territories. But from what I see there's
> > a real need to give users the power to tune how responsive their system needs
> > to be. I can't see how we can have one size that fits all here given the
> > different capabilities of the systems and the desired outcome (I want more perf
> > vs more efficiency).
> 
> This is true; but we also cannot keep adding random knobs. Knobs that
> are very specific are hard constraints we've got to live with. Take for
> instance uclamp, that's not something we can ever get rid of I think
> (randomly picking on uclamp, not saying I'm hating on it).

I'm really open to explore alterantives. But need help to find them. I'm also
trying to simplify kernel responsibilities by delegating more to uerspace.
It could be a personal mental hung up, but I can't see how can we have one size
fits all. Almost all types of systems are expected to do a lot of varying
workloads and both hardware and software are moving at faster pace, but
programming model is pretty much the same.

The response_time_ms in schedutil seemed a reasonable knob to me as it directly
tells the user how fast they rampup for that policy. It can be done once at
boot, or if someone has knowledge about workloads they can be smart and find
the best ones for them on a particular system. The good news for us in the
kernel is that we won't care. uclamp for really smart per task control, and
this knob for some hand tuning for those who don't have alternatives is the way
I see it.

> 
> From an actual interface POV, the unit-less generic energy-vs-perf knob

I can see this working for mobile as SoC vendors/OEM can get energy for their
systems and define these curves properly.

But average joe will lose out. For example M1 mac mini doesn't have energy
model actually defined. I do have energy meter so I hope to be able to do some
measurement, but not sure if I can get accurate numbers out.

x86 and other archs don't tend to produce as good energy-vs-perf curves like we
tend to see in mobile world (maybe they do and I'm just ignorant, apologies if
this ends up being a bad blanket statement).

Don't you think we could end up making the bar high to define this knob? It is
less intuitive too, but this is less of a problem maybe.

> is of course ideal, one global and one per task and then we can fill out
> the details as we see fit.  System integrators (you say users, but

Can certainly look at that and it sounds reasonable to me, par the issues above
about it requires more effort and good class of Linux users might not see these
definitions on their systems as there's no real system integrators for a large
class of desktop/laptop systems. It'd be nice to make the programming
experience coherent and readily available, if possible. I think these systems
are losing out.

> really, not a single actual user will use any of this) can muck about
> and see what works for them.

Yes I mean system integrator. I use users maybe bcause I do think of desktops
too as the integrator for them is the end users. I do hope to see more vendors
do ship tuned Linux desktops/laptops like we see in Android world. Servers
probably have an army of people managing them anyway.

> 
> (even hardware has these things today, we get 0-255 values that do
> 'something' uarch specific)

Ah, could I get some pointers please?

> 
> > The problem is that those 0.8 and 1.25 margins forces unsuitable default. The
> > case I see the most is it is causing wasting power that tuning it down regains
> > this power at no perf cost or small one. Others actually do tune it for faster
> > response, but I can't cover this case in detail. All I know is lower end
> > systems do struggle as they don't have enough oomph. I also saw comparison on
> > phoronix where schedutil is not doing as good still - which tells me it seems
> > server systems do prefer to ramp up faster too. I think that PELT thread is
> > a variation of the same problem.
> > 
> > So one of the things I saw is a workload where it spends majority of the time
> > in 600-750 util_avg range. Rarely ramps up to max. But the workload under uses
> > the medium cores and runs at a lot higher freqs than it really needs on bigs.
> > We don't end up utilizing our resources properly.
> 
> So that is actually a fairly solid argument for changing things up, if
> the margin causes us to neglect mid cores then that needs fixing. But I
> don't think that means we need a tunable. After all, the system knows it
> has 3 capacities, it just needs to be better at mapping workloads to
> them.

We can fix the misfit capacity without a tunable, I believe. I just know from
past discussions that those low end systems like these to be large. And the
PELT boot time is to help address this potential issue. Happy to leave it out
and leave it to someone who cares to come and complain. But from theoertical
point of view I can see the problem of slow response on those systems. And
capacities don't tell us much if this is a high end SoC or lower end SoC. Nor
util or anything else we have in the system today, to my knowledge at least.

> 
> It knows how much 'room' there is between a mid and a large. If 1.25*mid

Ideally we should end up distributing on mids and bigs for the capacity region
that overlaps.

I do see that the needs to have the margin is related to misfit migration and
we can fix it by improving the definition of this relationship. I'm not sure if
I implemented it in the best way, but I think the definition I'm proposing
makes sense and removes guess work. If the task is 600 and fits in both mids
and bigs, why should we skip the mids as a candidate if no misfit can happen
very soon by next tick? If current implementation is expensive I think I can
make it cheaper. But if no misfit can happen within tick, I think we need to
consider those CPUs as candidates.

On a slightly related problem that I avoided bringing up but maybe a good time
now. I see the definition of overutilized is stale too. It is a wrapper around
fits_capacity(), or misfit detection. It is very easy for a single busy task to
trigger overutilized. And if this task is background and capped by cpuset to
little cores, then we end up overutilized until it decides to go back to sleep.
Not ideal. I think the definition needs revisiting too, but I have no idea how
yet. It should be more of a function of the current system state rather than
tightly coupled with misfit detection. EAS is disabled when we're overutilized
and default spreading behavior can be expensive in terms of power.

> > large we in trouble etc..
> 
> > There's a question that I'm struggling with if I may ask. Why is it perceived
> > our constant response time (practically ~200ms to go from 0 to max) as a good
> > fit for all use cases? Capability of systems differs widely in terms of what
> > performance you get at say a util of 512. Or in other words how much work is
> > done in a unit of time differs between system, but we still represent that work
> > in a constant way. A task ran for 10ms on powerful System A would have done
> > a lot more work than running on poor System B for the same 10ms. But util will
> > still rise the same for both cases. If someone wants to allow this task to be
> > able to do more on those 10ms, it seems natural to be able to control this
> > response time. It seems this thinking is flawed for some reason and I'd
> > appreciate a help to understand why. I think a lot of us perceive this problem
> > this way.
> 
> I think part of the problem is that todays servers are tomorrow's
> smartphones. Back when we started all this PELT nonsense computers in
> general were less powerful than they are now, yet todays servers are no
> less busy than they were back then.
> 
> Give us compute, we'll fill it.

Hehe, yep!

> 
> Now, smartphones in particular are media devices, but a large part of
> the server farms are indirectly interactive too, you don't want your
> search query to take too long, or your bookface page stuck loading, or
> you twatter message about your latest poop not being insta read by your
> mates.
> 
> That is, much of what we do with the computers, ever more powerful or
> not, is eventually measured in human time perception.

Sadly I do think a lot of workloads make bad assumptions about hardware and
kernel services and I think the past trend has been to compensate for this in
the kernel but the true problem IMHO is that our current programming model is
stale and programs are carrying old bad habits that are no longer valid.

As a simple example a lot has struggled with HMP systems as they were assuming
if I have X cores then I can spawn X threads and do my awesome parallel work.

They of course got caught out badly. They used affinity later to be smart about
which cores, but then as noted earlier the bigs are power hungry and now they
can easily end up in power and thermal issues because the past assumptions are
no longer true.

By the way even the littles can be power hungry at top frequencies. So any form
of pinning done is causing problems. They just can't make assumptions. But what
to do instead then?

ADPF and uclamp is the way to address this and make portable software and
that's what being pushed for. But flushing these old habits out will take time.
Beside I think we still have ironing out work to be done.

Generally even in desktop/laptop/server, programmers seem to think they're the
only active app and get greedy when creating tasks ending up tripping over
themselves. We need a smart middleware to manage these; or a new programming
model to abstract these details. I don't know how, but the status queue is
that programming model is lagging behind.

I think Windows and Mac OS/iOS do provide some more tightly integrated
interfaces for apps; see Grand Central Dispatcher for instance on apple OSes.

> 
> So yeah, 200ms.
> 
> Remember, all this PELT nonsense was created for cgroups, to distribute
> shared between CPUs as load demanded. I think for that purpose it still
> sorta makes sense.
> 
> Ofc we've added a few more users over time, because if you have this
> data, might as well use it etc. I'm not sure we really sat down and
> analyzed if the timing all made sense.

I think if we want to distill the problems to its basic form, it's a timing
issue. Too fast, we lose power. Too slow, we lose perf. And we don't have a way
to scale perf per systems. ie; what absolute perf we end up getting we don't
know and I'm not sure if we can provide at all without hardware extensions. So
that's why we end up scaling time, and end up with those related knobs.

> 
> And as I argued elsewhere, PELT is a running average, but DVFS might be
> better suited with a max filter.

Sorry didn't catch up with all other replies yet, but I will think how to
incorporate all of that in. I think the major issue is that we do need to both
speed up and slow down. And as long as we are able to achieve that I'm really
fine to explore options. What I'm presenting here is what truly seemed to me
the best. But I need help and feedback to do better :-)

> 
> > Happy to go and try to dig more info if this is still not clear enough.
> 
> So I'm not generally opposed to changing things -- but I much prefer to
> have an actual problem driving that change :-)

Good to know. If you think the info I shared are still not good enough, I can
look for more examples. I think my main goal here is really to discuss the
problem, and my proposed solution is a way to demonstrate both the problem and
a possible way to fix it so I'm not just complaining, but actively looking for
fixes too :-) I don't claim to have all the right answers, but certainly happy
to follow this through to make sure we fix the problem properly. Hopefully not
just for me, but for all those who've been struggling with similar problems.


Thanks!

--
Qais Yousef
Peter Zijlstra Sept. 8, 2023, 1:58 p.m. UTC | #21
On Fri, Sep 08, 2023 at 02:33:36PM +0100, Qais Yousef wrote:

> > > UTIL_EST_FASTER moves in one direction. And it's a constant response too, no?
> > 
> > The idea of UTIL_EST_FASTER was that we run a PELT sum on the current
> > activation runtime, all runtime since wakeup and take the max of this
> > extra sum and the regular thing.
> > 
> > On top of that this extra PELT sum can/has a time multiplier and thus
> > ramps up faster (this multiplies could be per task). Nb.:
> > 
> > 	util_est_fast = faster_est_approx(delta * 2);
> > 
> > is a state-less expression -- by making
> > 
> > 	util_est_fast = faster_est_approx(delta * curr->se.faster_mult);
> > 
> > only the current task is affected.
> 
> Okay; maybe I didn't understand this fully and will go back and study it more.
> 
> Maybe the word faster is what makes me worried as I really see faster is not
> what people want on a class of systems; or at least CPUs if you think of HMP.
> Taming the beast is a more difficult problem in this class of systems.

The faster refers to the ramp-up. Which was the issue identified in that
earlier thread. The game thing wanted to ramp up more agressive.
Peter Zijlstra Sept. 8, 2023, 1:59 p.m. UTC | #22
On Fri, Sep 08, 2023 at 02:33:36PM +0100, Qais Yousef wrote:

> > (even hardware has these things today, we get 0-255 values that do
> > 'something' uarch specific)
> 
> Ah, could I get some pointers please?

Intel HWP.EPP and AMD CPPC EPP I think.. both intel-pstate and
amd-pstate have EPP thingies.
Qais Yousef Sept. 8, 2023, 2:07 p.m. UTC | #23
On 09/08/23 09:40, Dietmar Eggemann wrote:
> On 08/09/2023 02:17, Qais Yousef wrote:
> > On 09/07/23 15:08, Peter Zijlstra wrote:
> >> On Mon, Aug 28, 2023 at 12:31:56AM +0100, Qais Yousef wrote:
> 
> [...]
> 
> > But for the 0.8 and 1.25 margin problems, actually the problem is that 25% is
> > too aggressive/fast and wastes power. I'm actually slowing things down as
> > a result of this series. And I'm expecting some not to be happy about it on
> > their systems. The response_time_ms was my way to give back control. I didn't
> > see how I can make things faster and slower at the same time without making
> > decisions on behalf of the user/sysadmin.
> > 
> > So the connection I see between PELT and the margins or headrooms in
> > fits_capacity() and map_util_perf()/dvfs_headroom is that they expose the need
> > to manage the perf/power trade-off of the system.
> > 
> > Particularly the default is not good for the modern systems, Cortex-X is too
> > powerful but we still operate within the same power and thermal budgets.
> > 
> > And what was a high end A78 is a mid core today. So if you look at today's
> > mobile world topology we really have a tiy+big+huge combination of cores. The
> > bigs are called mids, but they're very capable. Fits capacity forces migration
> > to the 'huge' cores too soon with that 80% margin. While the 80% might be too
> > small for the tiny ones as some workloads really struggle there if they hang on
> > for too long. It doesn't help that these systems ship with 4ms tick. Something
> > more to consider changing I guess.
> 
> If this is the problem then you could simply make the margin (headroom)
> a function of cpu_capacity_orig?

I don't see what you mean. instead of capacity_of() but keep the 80%?

Again, I could be delusional and misunderstanding everything, but what I really
see fits_capacity() is about is misfit detection. But a task is not really
misfit until it actually has a util above the capacity of the CPU. Now due to
implementation details there can be a delay between the task crossing this
capacity and being able to move it. Which what I believe this headroom is
trying to achieve.

I think we can better define this by tying this headroom to the worst case
scenario it takes to actually move this misfit task to the right CPU. If it can
continue to run without being impacted with this delay and crossing the
capacity of the CPU it is on, then we should not trigger misfit IMO.

> 
> [...]
> 
> > There's a question that I'm struggling with if I may ask. Why is it perceived
> > our constant response time (practically ~200ms to go from 0 to max) as a good
> > fit for all use cases? Capability of systems differs widely in terms of what
> > performance you get at say a util of 512. Or in other words how much work is
> > done in a unit of time differs between system, but we still represent that work
> > in a constant way. A task ran for 10ms on powerful System A would have done
> 
> PELT (util_avg) is uarch & frequency invariant.
> 
> So e.g. a task with util_avg = 256 could have a runtime/period
> 
> on big CPU (capacity = 1024) of 4ms/16ms
> 
> on little CPU (capacity = 512) of 8ms/16ms
> 
> The amount of work in invariant (so we can compare between asymmetric
> capacity CPUs) but the runtime obviously differs according to the capacity.

Yep!


Cheers

--
Qais Yousef
Qais Yousef Sept. 8, 2023, 2:11 p.m. UTC | #24
On 09/08/23 15:59, Peter Zijlstra wrote:
> On Fri, Sep 08, 2023 at 02:33:36PM +0100, Qais Yousef wrote:
> 
> > > (even hardware has these things today, we get 0-255 values that do
> > > 'something' uarch specific)
> > 
> > Ah, could I get some pointers please?
> 
> Intel HWP.EPP and AMD CPPC EPP I think.. both intel-pstate and
> amd-pstate have EPP thingies.

Okay, thanks!

So do you see tying this to the presence of some hardware mechanisms and
provide a fallback for the other systems to define it somehow would be the best
way to explore this?


Cheers

--
Qais Yousef
Qais Yousef Sept. 10, 2023, 6:14 p.m. UTC | #25
On 09/07/23 08:48, Lukasz Luba wrote:

> They are periodic in a sense, they wake up every 16ms, but sometimes
> they have more work. It depends what is currently going in the game
> and/or sometimes the data locality (might not be in cache).
> 
> Although, that's for games, other workloads like youtube play or this
> one 'Yahoo browser' (from your example) are more 'predictable' (after
> the start up period). And I really like the potential energy saving
> there :)

It is more complicated than that from what I've seen. Userspace is sadly
bloated and the relationship between the tasks are a lot more complex. They
talk to other frame work elements, other hardware, have network elements coming
in, and specifically for gaming, could be preparing multiple frames in
parallel. The task wake up and sleep time is not that periodic. It can busy
loop for periods of time, other wake up for short periods of time (pattern of
which might not be on point as it interacts with other elements in a serial
manner where one prepared something and can take variable time every wake up to
prepare it before handing it over to the next task).

Browsers can be tricky as well as when user scrolls what elements appear and
what java script will execute and how heavy it is, and how many tabs are have
webpages being opened and how the user is moving between them.

It is organized chaos :-)

> 
> > 
> > I think the model of a periodic task is not suitable for most workloads. All
> > of them are dynamic and how much they need to do at each wake up can very
> > significantly over 10s of ms.
> 
> Might be true, the model was built a few years ago when there wasn't
> such dynamic game scenario with high FPS on mobiles. This could still
> be tuned with your new design IIUC (no need extra hooks in Android).

It is my perception of course. But I think generally, not just for gaming,
there are a lot of elements interacting with each others in a complex way.
The wake up time and length is determined by these complex elements; and it is
a very dynamic interaction where they could get into steady state for a very
short period of time but could change quickly. As an extreme example a player
could be standing in empty room doing nothing but another player in another
part of the world launches a rocket on this room and we'd get to know when the
network packet arrives that we have to draw a big explosion.

A lot of workloads are interactive and these moments of interactions are the
challenging ones.

> 
> > 
> > > 2. Plumb in this new idea of dvfs_update_delay as the new
> > >     'margin' - this I don't understand
> > > 
> > > For the 2. I don't see that the dvfs HW characteristics are best
> > > for this problem purpose. We can have a really fast DVFS HW,
> > > but we need some decent spare idle time in some workloads, which
> > > are two independent issues IMO. You might get the higher
> > > idle time thanks to 1.1. but this is a 'side effect'.
> > > 
> > > Could you explain a bit more why this dvfs_update_delay is
> > > crucial here?
> > 
> > I'm not sure why you relate this to idle time. And the word margin is a bit
> > overloaded here. so I suppose you're referring to the one we have in
> > map_util_perf() or apply_dvfs_headroom(). And I suppose you assume this extra
> > headroom will result in idle time, but this is not necessarily true IMO.
> > 
> > My rationale is simply that DVFS based on util should follow util_avg as-is.
> > But as pointed out in different discussions happened elsewhere, we need to
> > provide a headroom for this util to grow as if we were to be exact and the task
> > continues to run, then likely the util will go above the current OPP before we
> > get a chance to change it again. If we do have an ideal hardware that takes
> 
> Yes, this is another requirement to have +X% margin. When the tasks are
> growing, we don't know their final util_avg and we give them a bit more
> cycles.
> IMO we have to be ready always for such situation in the scheduler,
> haven't we?

Yes we should. I think I am not ignoring this part. Hope I clarified things
offline.


Cheers

--
Qais Yousef
Qais Yousef Sept. 10, 2023, 6:20 p.m. UTC | #26
On 09/07/23 13:53, Peter Zijlstra wrote:
> On Thu, Sep 07, 2023 at 08:48:08AM +0100, Lukasz Luba wrote:
> 
> > > Hehe. That's because they're not really periodic ;-)
> > 
> > They are periodic in a sense, they wake up every 16ms, but sometimes
> > they have more work. It depends what is currently going in the game
> > and/or sometimes the data locality (might not be in cache).
> > 
> > Although, that's for games, other workloads like youtube play or this
> > one 'Yahoo browser' (from your example) are more 'predictable' (after
> > the start up period). And I really like the potential energy saving
> > there :)
> 
> So everything media is fundamentally periodic, you're hard tied to the
> framerate / audio-buffer size etc..
> 
> Also note that the traditional periodic task model from the real-time
> community has the notion of WCET, which completely covers this
> fluctuation in frame-to-frame work, it only considers the absolute worst
> case.
> 
> Now, practically, that stinks, esp. when you care about batteries, but
> it does not mean these tasks are not periodic.

piecewise periodic?

> Many extentions to the periodic task model are possible, including
> things like average runtime with bursts etc.. all have their trade-offs.

The challenge we have is the endless number of workloads we need to cater for..
Or you think one of these models can actually scale to that?


Thanks!

--
Qais Yousef
Qais Yousef Sept. 10, 2023, 6:46 p.m. UTC | #27
On 09/07/23 15:26, Peter Zijlstra wrote:
> On Wed, Sep 06, 2023 at 10:18:50PM +0100, Qais Yousef wrote:
> 
> > This is probably controversial statement. But I am not in favour of util_est.
> > I need to collect the data, but I think we're better with 16ms PELT HALFLIFE as
> > default instead. But I will need to do a separate investigation on that.
> 
> I think util_est makes perfect sense, where PELT has to fundamentally

My concern about it is that it has inherit bias towards PERF. In the soap of
tasks running in the system, not all of which care about perf. The key tasks
tend to be the minority, I'd say. Again, I need to do more investigations but
my worry mainly comes from that and what impact it could have on power.

In an ideal world where userspace is fully uclamp aware, we shouldn't need it.
The task can set uclamp_min to make sure it sees the right performance at wake
up.

And depends on the outcome of this discussion, we might need to introduce
something to help speed up/slow down migration more selectively. So it could
become redundant control.

> decay non-running / non-runnable tasks in order to provide a temporal
> average, DVFS might be best served with a termporal max filter.

Ah, I certainly don't think DVFS need PELT HALFLIFE to be controlled. I only
see it being useful on HMP systems, under-powered specifically, that really
need faster *migration* times. Maybe we can find a better way to control this.
I'll think about it.

Not sure about the temporal max. Isn't this a bias towards perf first too?


Thanks!

--
Qais Yousef
Qais Yousef Sept. 10, 2023, 7:06 p.m. UTC | #28
On 09/07/23 15:42, Lukasz Luba wrote:
> 
> 
> On 9/7/23 15:29, Peter Zijlstra wrote:
> > On Thu, Sep 07, 2023 at 02:57:26PM +0100, Lukasz Luba wrote:
> > > 
> > > 
> > > On 9/7/23 14:26, Peter Zijlstra wrote:
> > > > On Wed, Sep 06, 2023 at 10:18:50PM +0100, Qais Yousef wrote:
> > > > 
> > > > > This is probably controversial statement. But I am not in favour of util_est.
> > > > > I need to collect the data, but I think we're better with 16ms PELT HALFLIFE as
> > > > > default instead. But I will need to do a separate investigation on that.
> > > > 
> > > > I think util_est makes perfect sense, where PELT has to fundamentally
> > > > decay non-running / non-runnable tasks in order to provide a temporal
> > > > average, DVFS might be best served with a termporal max filter.
> > > > 
> > > > 
> > > 
> > > Since we are here...
> > > Would you allow to have a configuration for
> > > the util_est shifter: UTIL_EST_WEIGHT_SHIFT ?
> > > 
> > > I've found other values than '2' better in some scenarios. That helps
> > > to prevent a big task to 'down' migrate from a Big CPU (1024) to some
> > > Mid CPU (~500-700 capacity) or even Little (~120-300).
> > 
> > Larger values, I'm thinking you're after? Those would cause the new
> > contribution to weight less, making the function more smooth, right?
> 
> Yes, more smooth, because we only use the 'ewma' goodness for decaying
> part (not the raising [1]).
> 
> > 
> > What task characteristic is tied to this? That is, this seems trivial to
> > modify per-task.
> 
> In particular Speedometer test and the main browser task, which reaches
> ~900util, but sometimes vanish and waits for other background tasks
> to do something. In the meantime it can decay and wake-up on
> Mid/Little (which can cause a penalty to score up to 5-10% vs. if
> we pin the task to big CPUs). So, a longer util_est helps to avoid
> at least very bad down migration to Littles...

Warning, this is not a global win! We do want tasks in general to downmigrate
when they sleep. Would be great to avoid biasing towards perf first by default
to fix these special cases.

As I mentioned in other reply, there's a perf/power/thermal impact of these
decisions and it's not a global win. Some might want this to improve their
scores, others might not want that and rather get the worse score but keep
their power budget in check. And it will highly depend on the workload and the
system. Which we can't test everyone of them :(

We did give the power to userspace via uclamp which should make this problem
fixable. And this is readily available. We can't basically know in the kernel
when one way is better than the other without being told explicitly IMHO.

If you try to boot with faster PELT HALFLIFE, would this give you the same
perf/power trade-off?


Thanks

--
Qais Yousef

> 
> [1] https://elixir.bootlin.com/linux/v6.5.1/source/kernel/sched/fair.c#L4442
Qais Yousef Sept. 10, 2023, 9:17 p.m. UTC | #29
On 09/08/23 14:33, Qais Yousef wrote:
> On 09/08/23 12:25, Peter Zijlstra wrote:
> > On Fri, Sep 08, 2023 at 01:17:25AM +0100, Qais Yousef wrote:
> > 
> > > Just to be clear, my main issue here with the current hardcoded values of the
> > > 'margins'. And the fact they go too fast is my main problem.
> > 
> > So I stripped the whole margin thing from my reply because I didn't want
> > to comment on that yet, but yes, I can see how those might be a problem,
> > and you're changing them into something dynamic, not just removing them.
> 
> The main difficulty is that if you try to apply those patches on their own, I'm
> sure you'll notice a difference. So if we were to take this alone and put them
> on linux-next; I expect a few regression reports for those who run with
> schedutil. Any ST oriented workload will not be happy. But if we compensate to
> reduce the regression, my problem will re-appear, just for a different reason.
> So whack-a-mole.

Sorry I just realized that the dynamic thing was about the margin, not the new
knob.

My answer above still holds to some extent. But yes, I meant to write that I'm
removing magic hardcoded numbers from the margins.


Cheers

--
Qais Yousef
Lukasz Luba Sept. 12, 2023, 11:51 a.m. UTC | #30
Hi Peter,

On 9/7/23 21:16, Peter Zijlstra wrote:
> On Thu, Sep 07, 2023 at 03:42:13PM +0100, Lukasz Luba wrote:
> 
>>> What task characteristic is tied to this? That is, this seems trivial to
>>> modify per-task.
>>
>> In particular Speedometer test and the main browser task, which reaches
>> ~900util, but sometimes vanish and waits for other background tasks
>> to do something. In the meantime it can decay and wake-up on
>> Mid/Little (which can cause a penalty to score up to 5-10% vs. if
>> we pin the task to big CPUs). So, a longer util_est helps to avoid
>> at least very bad down migration to Littles...
> 
> Do they do a few short activations (wakeup/sleeps) while waiting? That
> would indeed completely ruin things since the EWMA thing is activation
> based.
> 
> I wonder if there's anything sane we can do here...

My apologies for a delay, I have tried to push the graphs for you.

The experiment is on pixel7*. It's running the browser on the phone
with the test 'Speedometer 2.0'. It's a web test (you can also run on
your phone) available here, no need to install anything:
https://browserbench.org/Speedometer2.0/

Here is the Jupiter notebook [1], with plots of the signals:
- top 20 tasks' (based on runtime) utilization
- Util EST signals for the top 20 tasks, with the longer decaying ewma
   filter (which is the 'red' plot called 'ewma')
- the main task (comm=CrRendererMain) Util, Util EST and task residency
   (which tires to stick to CPUs 6,7* )
- the test score was 144.6 (while with fast decay ewma is ~134), so
   staying at big cpus (helps the score in this case)

(the plots are interactive, you can zoom in with the icon 'Box Zoom')
(e.g. you can zoom in the task activation plot which is also linked
with the 'Util EST' on top, for this main task)

You can see the util signal of that 'CrRendererMain' task and those
utilization drops in time, which I was referring to. When the util
drops below some threshold, the task might 'fit' into smaller CPU,
which could be prevented automatically byt maintaining the util est
for longer (but not for all).

I do like your idea that Util EST might be per-task. I'm going to
check this, because that might help to get rid of the overutilized state
which is probably because small tasks are also 'bigger' for longer.

If this util est have chance to fly upstream, I could send an RFC if
you don't mind.

Regards,
Lukasz

*CPUs 6,7 - big (1024 capacity), CPUs 4,5 Mid (768 capacity), CPUs 0-3
Littles (~150 capacity)

[1] 
https://nbviewer.org/github/lukaszluba-arm/lisa/blob/public_tests/p7_wa_speedometer2_small_size.ipynb
Lukasz Luba Sept. 12, 2023, 11:57 a.m. UTC | #31
Hi Daniel,

On 9/8/23 13:51, Daniel Bristot de Oliveira wrote:
> On 9/7/23 15:45, Lukasz Luba wrote:
>>>>> RT literatur mostly methinks. Replacing WCET with a statistical model of
>>>>> sorts is not uncommon, the argument goes that not everybody will have
>>>>> their worst case at the same time and lows and highs can commonly cancel
>>>>> out and this way we can cram a little more on the system.
>>>>>
>>>>> Typically this is proposed in the context of soft-realtime systems.
>>>>
>>>> Thanks Peter, I will dive into some books...
>>>
>>> I would look at academic papers, not sure any of that ever made it to
>>> books, Daniel would know I suppose.
>>
>> Good hint, thanks!
> 
> The key-words that came to my mind are:
> 
> 	- mk-firm, where you accept m tasks will make their deadline
> 	           every k execution - like, because you run too long.
> 	- mixed criticality with pWCET (probabilistic execution time) or
> 		  average execution time + an sporadic tail execution time for
> 		  the low criticality part.
> 
> mk-firm smells like 2005's.. mixed criticality as 2015's..present.
> 
> You will probably find more papers than books. Read the papers
> as a source for inspiration... not necessarily as a definitive
> solution. They generally proposed too restrictive task models.
> 
> -- Daniel
> 

Thanks for describing this context! That would save my time and avoid
maybe sinking in this unknown water. As you said I might tread that
as inspiration, since I don't fight with life-critical system,
but a phone which needs 'nice user experience' (hopefully there are
no people who disagree) ;)

Regards,
Lukasz
Vincent Guittot Sept. 12, 2023, 2:01 p.m. UTC | #32
Hi Lukasz,

On Tue, 12 Sept 2023 at 13:51, Lukasz Luba <lukasz.luba@arm.com> wrote:
>
> Hi Peter,
>
> On 9/7/23 21:16, Peter Zijlstra wrote:
> > On Thu, Sep 07, 2023 at 03:42:13PM +0100, Lukasz Luba wrote:
> >
> >>> What task characteristic is tied to this? That is, this seems trivial to
> >>> modify per-task.
> >>
> >> In particular Speedometer test and the main browser task, which reaches
> >> ~900util, but sometimes vanish and waits for other background tasks
> >> to do something. In the meantime it can decay and wake-up on
> >> Mid/Little (which can cause a penalty to score up to 5-10% vs. if
> >> we pin the task to big CPUs). So, a longer util_est helps to avoid
> >> at least very bad down migration to Littles...
> >
> > Do they do a few short activations (wakeup/sleeps) while waiting? That
> > would indeed completely ruin things since the EWMA thing is activation
> > based.
> >
> > I wonder if there's anything sane we can do here...
>
> My apologies for a delay, I have tried to push the graphs for you.
>
> The experiment is on pixel7*. It's running the browser on the phone
> with the test 'Speedometer 2.0'. It's a web test (you can also run on
> your phone) available here, no need to install anything:
> https://browserbench.org/Speedometer2.0/
>
> Here is the Jupiter notebook [1], with plots of the signals:
> - top 20 tasks' (based on runtime) utilization
> - Util EST signals for the top 20 tasks, with the longer decaying ewma
>    filter (which is the 'red' plot called 'ewma')
> - the main task (comm=CrRendererMain) Util, Util EST and task residency
>    (which tires to stick to CPUs 6,7* )
> - the test score was 144.6 (while with fast decay ewma is ~134), so
>    staying at big cpus (helps the score in this case)
>
> (the plots are interactive, you can zoom in with the icon 'Box Zoom')
> (e.g. you can zoom in the task activation plot which is also linked
> with the 'Util EST' on top, for this main task)
>
> You can see the util signal of that 'CrRendererMain' task and those
> utilization drops in time, which I was referring to. When the util
> drops below some threshold, the task might 'fit' into smaller CPU,
> which could be prevented automatically byt maintaining the util est
> for longer (but not for all).

I was looking at your nice chart and I wonder if you could also add
the runnable _avg of the tasks ?

My 1st impression is that the decrease happens when your task starts
to share the CPU with some other tasks and this ends up with a
decrease of its utilization because util_avg doesn't take into account
the waiting time so typically task with an utilization of 1024, will
see its utilization decrease because of other tasks running on the
same cpu. This would explain the drop that you can see.

 I wonder if we should not take into account the runnable_avg when
applying the ewm on util_est ? so the util_est will not decrease
because of time sharing with other

>
> I do like your idea that Util EST might be per-task. I'm going to
> check this, because that might help to get rid of the overutilized state
> which is probably because small tasks are also 'bigger' for longer.
>
> If this util est have chance to fly upstream, I could send an RFC if
> you don't mind.
>
> Regards,
> Lukasz
>
> *CPUs 6,7 - big (1024 capacity), CPUs 4,5 Mid (768 capacity), CPUs 0-3
> Littles (~150 capacity)
>
> [1]
> https://nbviewer.org/github/lukaszluba-arm/lisa/blob/public_tests/p7_wa_speedometer2_small_size.ipynb
Peter Zijlstra Sept. 12, 2023, 2:28 p.m. UTC | #33
On Tue, Sep 12, 2023 at 12:51:52PM +0100, Lukasz Luba wrote:

> You can see the util signal of that 'CrRendererMain' task and those
> utilization drops in time, which I was referring to. When the util
> drops below some threshold, the task might 'fit' into smaller CPU,
> which could be prevented automatically byt maintaining the util est
> for longer (but not for all).

Right, so right at those util_est dips it has some small activations.
It's like a poll loop or something instead of a full block waiting for
things to happen.

And yeah, that'll destroy util_est in a hurry :/

> I do like your idea that Util EST might be per-task. I'm going to
> check this, because that might help to get rid of the overutilized state
> which is probably because small tasks are also 'bigger' for longer.
> 
> If this util est have chance to fly upstream, I could send an RFC if
> you don't mind.

The biggest stumbling block I see is the user interface; some generic
QoS hints based thing that allows us to do random things -- like tune
the above might do, dunno.
Dietmar Eggemann Sept. 12, 2023, 5:18 p.m. UTC | #34
On 08/09/2023 16:07, Qais Yousef wrote:
> On 09/08/23 09:40, Dietmar Eggemann wrote:
>> On 08/09/2023 02:17, Qais Yousef wrote:
>>> On 09/07/23 15:08, Peter Zijlstra wrote:
>>>> On Mon, Aug 28, 2023 at 12:31:56AM +0100, Qais Yousef wrote:

[...]

>>> And what was a high end A78 is a mid core today. So if you look at today's
>>> mobile world topology we really have a tiy+big+huge combination of cores. The
>>> bigs are called mids, but they're very capable. Fits capacity forces migration
>>> to the 'huge' cores too soon with that 80% margin. While the 80% might be too
>>> small for the tiny ones as some workloads really struggle there if they hang on
>>> for too long. It doesn't help that these systems ship with 4ms tick. Something
>>> more to consider changing I guess.
>>
>> If this is the problem then you could simply make the margin (headroom)
>> a function of cpu_capacity_orig?
> 
> I don't see what you mean. instead of capacity_of() but keep the 80%?
> 
> Again, I could be delusional and misunderstanding everything, but what I really
> see fits_capacity() is about is misfit detection. But a task is not really
> misfit until it actually has a util above the capacity of the CPU. Now due to
> implementation details there can be a delay between the task crossing this
> capacity and being able to move it. Which what I believe this headroom is
> trying to achieve.
> 
> I think we can better define this by tying this headroom to the worst case
> scenario it takes to actually move this misfit task to the right CPU. If it can
> continue to run without being impacted with this delay and crossing the
> capacity of the CPU it is on, then we should not trigger misfit IMO.


Instead of:

  fits_capacity(unsigned long util, unsigned long capacity)

      return approximate_util_avg(util, TICK_USEC) < capacity;

just make 1280 in:

  #define fits_capacity(cap, max) ((cap) * 1280 < (max) * 1024)

dependent on cpu's capacity_orig or the capacity diff to the next higher
capacity_orig.

Typical example today: {little-medium-big capacity_orig} = {128, 896, 1024}

896÷128 = 7

1024/896 = 1.14

to achieve higher margin on little and lower margin on medium.

[...]
Lukasz Luba Sept. 13, 2023, 9:53 a.m. UTC | #35
Hi Vincent,

On 9/12/23 15:01, Vincent Guittot wrote:
> Hi Lukasz,
> 
> On Tue, 12 Sept 2023 at 13:51, Lukasz Luba <lukasz.luba@arm.com> wrote:
>>
>> Hi Peter,
>>
>> On 9/7/23 21:16, Peter Zijlstra wrote:
>>> On Thu, Sep 07, 2023 at 03:42:13PM +0100, Lukasz Luba wrote:
>>>
>>>>> What task characteristic is tied to this? That is, this seems trivial to
>>>>> modify per-task.
>>>>
>>>> In particular Speedometer test and the main browser task, which reaches
>>>> ~900util, but sometimes vanish and waits for other background tasks
>>>> to do something. In the meantime it can decay and wake-up on
>>>> Mid/Little (which can cause a penalty to score up to 5-10% vs. if
>>>> we pin the task to big CPUs). So, a longer util_est helps to avoid
>>>> at least very bad down migration to Littles...
>>>
>>> Do they do a few short activations (wakeup/sleeps) while waiting? That
>>> would indeed completely ruin things since the EWMA thing is activation
>>> based.
>>>
>>> I wonder if there's anything sane we can do here...
>>
>> My apologies for a delay, I have tried to push the graphs for you.
>>
>> The experiment is on pixel7*. It's running the browser on the phone
>> with the test 'Speedometer 2.0'. It's a web test (you can also run on
>> your phone) available here, no need to install anything:
>> https://browserbench.org/Speedometer2.0/
>>
>> Here is the Jupiter notebook [1], with plots of the signals:
>> - top 20 tasks' (based on runtime) utilization
>> - Util EST signals for the top 20 tasks, with the longer decaying ewma
>>     filter (which is the 'red' plot called 'ewma')
>> - the main task (comm=CrRendererMain) Util, Util EST and task residency
>>     (which tires to stick to CPUs 6,7* )
>> - the test score was 144.6 (while with fast decay ewma is ~134), so
>>     staying at big cpus (helps the score in this case)
>>
>> (the plots are interactive, you can zoom in with the icon 'Box Zoom')
>> (e.g. you can zoom in the task activation plot which is also linked
>> with the 'Util EST' on top, for this main task)
>>
>> You can see the util signal of that 'CrRendererMain' task and those
>> utilization drops in time, which I was referring to. When the util
>> drops below some threshold, the task might 'fit' into smaller CPU,
>> which could be prevented automatically byt maintaining the util est
>> for longer (but not for all).
> 
> I was looking at your nice chart and I wonder if you could also add
> the runnable _avg of the tasks ?

Yes, I will try today or tomorrow to add such plots as well.

> 
> My 1st impression is that the decrease happens when your task starts
> to share the CPU with some other tasks and this ends up with a
> decrease of its utilization because util_avg doesn't take into account
> the waiting time so typically task with an utilization of 1024, will
> see its utilization decrease because of other tasks running on the
> same cpu. This would explain the drop that you can see.
> 
>   I wonder if we should not take into account the runnable_avg when
> applying the ewm on util_est ? so the util_est will not decrease
> because of time sharing with other

Yes, that sounds a good idea. Let me provide those plots so we could
go further with the analysis. I will try to capture if that happens
to that particular task on CPU (if there are some others as well).


Thanks for jumping in to the discussion!

Lukasz
Qais Yousef Sept. 16, 2023, 7:38 p.m. UTC | #36
On 09/12/23 19:18, Dietmar Eggemann wrote:
> On 08/09/2023 16:07, Qais Yousef wrote:
> > On 09/08/23 09:40, Dietmar Eggemann wrote:
> >> On 08/09/2023 02:17, Qais Yousef wrote:
> >>> On 09/07/23 15:08, Peter Zijlstra wrote:
> >>>> On Mon, Aug 28, 2023 at 12:31:56AM +0100, Qais Yousef wrote:
> 
> [...]
> 
> >>> And what was a high end A78 is a mid core today. So if you look at today's
> >>> mobile world topology we really have a tiy+big+huge combination of cores. The
> >>> bigs are called mids, but they're very capable. Fits capacity forces migration
> >>> to the 'huge' cores too soon with that 80% margin. While the 80% might be too
> >>> small for the tiny ones as some workloads really struggle there if they hang on
> >>> for too long. It doesn't help that these systems ship with 4ms tick. Something
> >>> more to consider changing I guess.
> >>
> >> If this is the problem then you could simply make the margin (headroom)
> >> a function of cpu_capacity_orig?
> > 
> > I don't see what you mean. instead of capacity_of() but keep the 80%?
> > 
> > Again, I could be delusional and misunderstanding everything, but what I really
> > see fits_capacity() is about is misfit detection. But a task is not really
> > misfit until it actually has a util above the capacity of the CPU. Now due to
> > implementation details there can be a delay between the task crossing this
> > capacity and being able to move it. Which what I believe this headroom is
> > trying to achieve.
> > 
> > I think we can better define this by tying this headroom to the worst case
> > scenario it takes to actually move this misfit task to the right CPU. If it can
> > continue to run without being impacted with this delay and crossing the
> > capacity of the CPU it is on, then we should not trigger misfit IMO.
> 
> 
> Instead of:
> 
>   fits_capacity(unsigned long util, unsigned long capacity)
> 
>       return approximate_util_avg(util, TICK_USEC) < capacity;
> 
> just make 1280 in:
> 
>   #define fits_capacity(cap, max) ((cap) * 1280 < (max) * 1024)
> 
> dependent on cpu's capacity_orig or the capacity diff to the next higher
> capacity_orig.
> 
> Typical example today: {little-medium-big capacity_orig} = {128, 896, 1024}
> 
> 896÷128 = 7
> 
> 1024/896 = 1.14
> 
> to achieve higher margin on little and lower margin on medium.

I am not keen on this personally. I think these numbers are random to me and
why they help (or not help) is not clear to me at least.

I do believe that the only reason why we want to move before a task util
crosses the capacity of the CPU is tied down to the misfit load balance to be
able to move the task. Because until the task crosses the capacity, it is
getting its computational demand per our PELT representation. But since load
balance is not an immediate action (especially on our platforms where it is
4ms, something I hope we can change); we need to preemptively exclude the CPU
as a misfit when we know the task will get 'stuck' on this CPU and not get its
computational demand (as per our representation of course).

I think this removes all guess work and provides a very meaningful decision
making process that I think will scale transparently so we utilize our
resources the best we can.

We can probably optimize the code to avoid the call to approximate_util_avg()
if this is a problem.

Why do you think the ratio of cpu capacities gives more meaningful method to
judge misfit?


Thanks!

--
Qais Yousef