diff mbox

[4/8] cpufreq/schedutil: sysfs capacity margin tunable

Message ID 1457932932-28444-5-git-send-email-mturquette+renesas@baylibre.com (mailing list archive)
State RFC, archived
Headers show

Commit Message

Michael Turquette March 14, 2016, 5:22 a.m. UTC
With the addition of the global cfs capacity margin helpers in patch,
"sched/cpufreq: new cfs capacity margin helpers", we can now export
sysfs tunables from the schedutil governor. This allows privileged users
to tune the value more easily.

The margin value is global to cfs, not per-policy. As such schedutil
does not store any state about the margin. Schedutil restores the margin
value to its default value when exiting.

Signed-off-by: Michael Turquette <mturquette+renesas@baylibre.com>
---
 drivers/cpufreq/cpufreq_schedutil.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

Comments

Peter Zijlstra March 15, 2016, 9:20 p.m. UTC | #1
On Sun, Mar 13, 2016 at 10:22:08PM -0700, Michael Turquette wrote:
> With the addition of the global cfs capacity margin helpers in patch,
> "sched/cpufreq: new cfs capacity margin helpers", we can now export
> sysfs tunables from the schedutil governor. This allows privileged users
> to tune the value more easily.
> 
> The margin value is global to cfs, not per-policy. As such schedutil
> does not store any state about the margin. Schedutil restores the margin
> value to its default value when exiting.

Yuck sysfs.. I would really rather we did not expose this per default.
And certainly not in this weird form.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra March 15, 2016, 9:48 p.m. UTC | #2
On Tue, Mar 15, 2016 at 02:40:43PM -0700, Michael Turquette wrote:
> Quoting Peter Zijlstra (2016-03-15 14:20:47)
> > On Sun, Mar 13, 2016 at 10:22:08PM -0700, Michael Turquette wrote:
> > > With the addition of the global cfs capacity margin helpers in patch,
> > > "sched/cpufreq: new cfs capacity margin helpers", we can now export
> > > sysfs tunables from the schedutil governor. This allows privileged users
> > > to tune the value more easily.
> > > 
> > > The margin value is global to cfs, not per-policy. As such schedutil
> > > does not store any state about the margin. Schedutil restores the margin
> > > value to its default value when exiting.
> > 
> > Yuck sysfs.. I would really rather we did not expose this per default.
> > And certainly not in this weird form.
> 
> I'm happy to change capacity_margin to up_threshold and use a
> percentage.
> 
> The sysfs approach has two benefits. First, it is aligned with cpufreq
> user expectations. Second, there has been rough consensus that this
> value should be tunable and sysfs gets us there quickly and painlessly.
> We're already exporting rate_limit_us for schedutil via sysfs. Is there
> a better way interface you can recommend?

It really depends on how tunable you want this to be. Do we always want
this to be a tunable, or just now while we're playing about with the
whole thing?

The problem with exposing it in sysfs is that you cannot take it out
again, it becomes ABI.

What we do for all the scheduler tunables (pretty much every time we
have to take a value out of thin air), is we make them const for
!SCHED_DEBUG builds, but have them as sysctl for SCHED_DEBUG builds
(although we should probably move them to /debug/sched/ or somesuch).

That way you get better code generation (compile time constants rule)
for !debug builds, while having the 'joy' of poking at your number on
debug builds.


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Muckle March 16, 2016, 3:36 a.m. UTC | #3
On 03/15/2016 03:37 PM, Michael Turquette wrote:
>>>> Yuck sysfs.. I would really rather we did not expose this per default.
>>>> > > > And certainly not in this weird form.
>>> > > 
>>> > > I'm happy to change capacity_margin to up_threshold and use a
>>> > > percentage.
>>> > > 
>>> > > The sysfs approach has two benefits. First, it is aligned with cpufreq
>>> > > user expectations. Second, there has been rough consensus that this
>>> > > value should be tunable and sysfs gets us there quickly and painlessly.
>>> > > We're already exporting rate_limit_us for schedutil via sysfs. Is there
>>> > > a better way interface you can recommend?
>> > 
>> > It really depends on how tunable you want this to be. Do we always want
>> > this to be a tunable, or just now while we're playing about with the
>> > whole thing?
>
> I had considered this myself, and I really think that Steve and Juri
> should chime in as they have spent more time tuning and running the
> numbers.
> 
> I'm inclined to think that a debug version would be good enough, as I
> don't imagine this value being changed at run-time by some userspace
> daemon or something.
> 
> Then again, maybe this knob will be part of the mythical
> power-vs-performance slider?

Patrick Bellasi's schedtune series [0] (which I think is the referenced
mythical slider) aims to provide a more sophisticated interface for
tuning scheduler-driven frequency selection. In addition to a global
boost value it includes a cgroup controller as well for per-task tuning.

I would definitely expect the margin/boost value to be modified at
runtime, for example if the battery is running low, or the user wants
100% performance for a while, or the userspace framework wants to
temporarily tailor the performance level for a particular set of tasks, etc.

[0] http://article.gmane.org/gmane.linux.kernel/2022959
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra March 16, 2016, 8:05 a.m. UTC | #4
On Tue, Mar 15, 2016 at 08:36:57PM -0700, Steve Muckle wrote:
> > Then again, maybe this knob will be part of the mythical
> > power-vs-performance slider?
> 
> Patrick Bellasi's schedtune series [0] (which I think is the referenced
> mythical slider) aims to provide a more sophisticated interface for
> tuning scheduler-driven frequency selection. In addition to a global
> boost value it includes a cgroup controller as well for per-task tuning.
> 
> I would definitely expect the margin/boost value to be modified at
> runtime, for example if the battery is running low, or the user wants
> 100% performance for a while, or the userspace framework wants to
> temporarily tailor the performance level for a particular set of tasks, etc.

OK, so how about we start with it as a debug knob, and once we have
experience and feel like it is indeed a useful runtime knob, we upgrade
it to ABI.

The problem with starting out as ABI is that its hard to take away
again.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Juri Lelli March 16, 2016, 10:02 a.m. UTC | #5
Hi,

On 16/03/16 09:05, Peter Zijlstra wrote:
> On Tue, Mar 15, 2016 at 08:36:57PM -0700, Steve Muckle wrote:
> > > Then again, maybe this knob will be part of the mythical
> > > power-vs-performance slider?
> > 
> > Patrick Bellasi's schedtune series [0] (which I think is the referenced
> > mythical slider) aims to provide a more sophisticated interface for
> > tuning scheduler-driven frequency selection. In addition to a global
> > boost value it includes a cgroup controller as well for per-task tuning.
> > 
> > I would definitely expect the margin/boost value to be modified at
> > runtime, for example if the battery is running low, or the user wants
> > 100% performance for a while, or the userspace framework wants to
> > temporarily tailor the performance level for a particular set of tasks, etc.
> 
> OK, so how about we start with it as a debug knob, and once we have
> experience and feel like it is indeed a useful runtime knob, we upgrade
> it to ABI.
> 

I tend to agree here. To me the margin is something that we need to make
this thing work and to get acceptable performance out of the box. So we
can play with it while debugging, but I consider the schedtune slider as
the way to tune the system at runtime.

Best,

- Juri
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki March 16, 2016, 12:45 p.m. UTC | #6
On Wed, Mar 16, 2016 at 9:05 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Mar 15, 2016 at 08:36:57PM -0700, Steve Muckle wrote:
>> > Then again, maybe this knob will be part of the mythical
>> > power-vs-performance slider?
>>
>> Patrick Bellasi's schedtune series [0] (which I think is the referenced
>> mythical slider) aims to provide a more sophisticated interface for
>> tuning scheduler-driven frequency selection. In addition to a global
>> boost value it includes a cgroup controller as well for per-task tuning.
>>
>> I would definitely expect the margin/boost value to be modified at
>> runtime, for example if the battery is running low, or the user wants
>> 100% performance for a while, or the userspace framework wants to
>> temporarily tailor the performance level for a particular set of tasks, etc.
>
> OK, so how about we start with it as a debug knob, and once we have
> experience and feel like it is indeed a useful runtime knob, we upgrade
> it to ABI.
>
> The problem with starting out as ABI is that its hard to take away
> again.

Agreed, plus it is quite hard to get ABI right from the outset.  Even
if we decide on a sysfs knob, it still is unclear what exactly should
be represented by it in what units etc.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Muckle March 16, 2016, 5:55 p.m. UTC | #7
On 03/16/2016 03:02 AM, Juri Lelli wrote:
> Hi,
> 
> On 16/03/16 09:05, Peter Zijlstra wrote:
>> On Tue, Mar 15, 2016 at 08:36:57PM -0700, Steve Muckle wrote:
>>>> Then again, maybe this knob will be part of the mythical
>>>> power-vs-performance slider?
>>>
>>> Patrick Bellasi's schedtune series [0] (which I think is the referenced
>>> mythical slider) aims to provide a more sophisticated interface for
>>> tuning scheduler-driven frequency selection. In addition to a global
>>> boost value it includes a cgroup controller as well for per-task tuning.
>>>
>>> I would definitely expect the margin/boost value to be modified at
>>> runtime, for example if the battery is running low, or the user wants
>>> 100% performance for a while, or the userspace framework wants to
>>> temporarily tailor the performance level for a particular set of tasks, etc.
>>
>> OK, so how about we start with it as a debug knob, and once we have
>> experience and feel like it is indeed a useful runtime knob, we upgrade
>> it to ABI.
>>
> 
> I tend to agree here. To me the margin is something that we need to make
> this thing work and to get acceptable performance out of the box. So we
> can play with it while debugging, but I consider the schedtune slider as
> the way to tune the system at runtime.

Could the default schedtune value not serve as the out of the box margin?

Regardless I agree that a debug interface is the way to go for now while
we figure things out.

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Turquette March 16, 2016, 10:03 p.m. UTC | #8
Quoting Steve Muckle (2016-03-15 20:36:57)
> On 03/15/2016 03:37 PM, Michael Turquette wrote:
> >>>> Yuck sysfs.. I would really rather we did not expose this per default.
> >>>> > > > And certainly not in this weird form.
> >>> > > 
> >>> > > I'm happy to change capacity_margin to up_threshold and use a
> >>> > > percentage.
> >>> > > 
> >>> > > The sysfs approach has two benefits. First, it is aligned with cpufreq
> >>> > > user expectations. Second, there has been rough consensus that this
> >>> > > value should be tunable and sysfs gets us there quickly and painlessly.
> >>> > > We're already exporting rate_limit_us for schedutil via sysfs. Is there
> >>> > > a better way interface you can recommend?
> >> > 
> >> > It really depends on how tunable you want this to be. Do we always want
> >> > this to be a tunable, or just now while we're playing about with the
> >> > whole thing?
> >
> > I had considered this myself, and I really think that Steve and Juri
> > should chime in as they have spent more time tuning and running the
> > numbers.
> > 
> > I'm inclined to think that a debug version would be good enough, as I
> > don't imagine this value being changed at run-time by some userspace
> > daemon or something.
> > 
> > Then again, maybe this knob will be part of the mythical
> > power-vs-performance slider?
> 
> Patrick Bellasi's schedtune series [0] (which I think is the referenced
> mythical slider) aims to provide a more sophisticated interface for
> tuning scheduler-driven frequency selection. In addition to a global
> boost value it includes a cgroup controller as well for per-task tuning.

/me spends 15 seconds looking schedtune

> 
> I would definitely expect the margin/boost value to be modified at
> runtime, for example if the battery is running low, or the user wants
> 100% performance for a while, or the userspace framework wants to
> temporarily tailor the performance level for a particular set of tasks, etc.

Right, and it looks like schedtune is a kernel solution, not userspace
solution. The following three interfaces from patch #2 could be used by
schedtune:

unsigned long cpufreq_get_cfs_capacity_margin(void);
void cpufreq_set_cfs_capacity_margin(unsigned long margin);
void cpufreq_reset_cfs_capacity_margin(void);

Then we can let schedtune worry about the userspace abi.

So I'll keep the basic idea of this patch, but explore making it
debuggy, instead of sysfsy.

Regards,
Mike

> 
> [0] http://article.gmane.org/gmane.linux.kernel/2022959
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Turquette March 16, 2016, 10:05 p.m. UTC | #9
Quoting Steve Muckle (2016-03-16 10:55:49)
> On 03/16/2016 03:02 AM, Juri Lelli wrote:
> > Hi,
> > 
> > On 16/03/16 09:05, Peter Zijlstra wrote:
> >> On Tue, Mar 15, 2016 at 08:36:57PM -0700, Steve Muckle wrote:
> >>>> Then again, maybe this knob will be part of the mythical
> >>>> power-vs-performance slider?
> >>>
> >>> Patrick Bellasi's schedtune series [0] (which I think is the referenced
> >>> mythical slider) aims to provide a more sophisticated interface for
> >>> tuning scheduler-driven frequency selection. In addition to a global
> >>> boost value it includes a cgroup controller as well for per-task tuning.
> >>>
> >>> I would definitely expect the margin/boost value to be modified at
> >>> runtime, for example if the battery is running low, or the user wants
> >>> 100% performance for a while, or the userspace framework wants to
> >>> temporarily tailor the performance level for a particular set of tasks, etc.
> >>
> >> OK, so how about we start with it as a debug knob, and once we have
> >> experience and feel like it is indeed a useful runtime knob, we upgrade
> >> it to ABI.
> >>
> > 
> > I tend to agree here. To me the margin is something that we need to make
> > this thing work and to get acceptable performance out of the box. So we
> > can play with it while debugging, but I consider the schedtune slider as
> > the way to tune the system at runtime.
> 
> Could the default schedtune value not serve as the out of the box margin?

It can. Let's keep the kernel interfaces in patch #2 for changing the
margin/threshold, and schedtune can call these interfaces.

Regards,
Mike

> 
> Regardless I agree that a debug interface is the way to go for now while
> we figure things out.
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Juri Lelli March 17, 2016, 9:40 a.m. UTC | #10
On 16/03/16 10:55, Steve Muckle wrote:
> On 03/16/2016 03:02 AM, Juri Lelli wrote:
> > Hi,
> > 
> > On 16/03/16 09:05, Peter Zijlstra wrote:
> >> On Tue, Mar 15, 2016 at 08:36:57PM -0700, Steve Muckle wrote:
> >>>> Then again, maybe this knob will be part of the mythical
> >>>> power-vs-performance slider?
> >>>
> >>> Patrick Bellasi's schedtune series [0] (which I think is the referenced
> >>> mythical slider) aims to provide a more sophisticated interface for
> >>> tuning scheduler-driven frequency selection. In addition to a global
> >>> boost value it includes a cgroup controller as well for per-task tuning.
> >>>
> >>> I would definitely expect the margin/boost value to be modified at
> >>> runtime, for example if the battery is running low, or the user wants
> >>> 100% performance for a while, or the userspace framework wants to
> >>> temporarily tailor the performance level for a particular set of tasks, etc.
> >>
> >> OK, so how about we start with it as a debug knob, and once we have
> >> experience and feel like it is indeed a useful runtime knob, we upgrade
> >> it to ABI.
> >>
> > 
> > I tend to agree here. To me the margin is something that we need to make
> > this thing work and to get acceptable performance out of the box. So we
> > can play with it while debugging, but I consider the schedtune slider as
> > the way to tune the system at runtime.
> 
> Could the default schedtune value not serve as the out of the box margin?
> 

I'm not sure I understand you here. For me schedtune should be disabled
by default, so I'd say that it doesn't introduce any additional margin
by default. But we still need a margin to make the governor work without
schedtune in the mix.

> Regardless I agree that a debug interface is the way to go for now while
> we figure things out.
> 

Looks good to me.

Best,

- Juri
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Muckle March 17, 2016, 1:55 p.m. UTC | #11
On 03/17/2016 02:40 AM, Juri Lelli wrote:
>> Could the default schedtune value not serve as the out of the box margin?
>>
> I'm not sure I understand you here. For me schedtune should be disabled
> by default, so I'd say that it doesn't introduce any additional margin
> by default. But we still need a margin to make the governor work without
> schedtune in the mix.

Why not have schedtune be enabled always, and use it to add the margin?
It seems like it'd simplify things.

I haven't looked at the schedtune code at all so I don't know whether
this makes sense given its current implementation. But conceptually I
don't know why we'd need or want one margin in schedutil which will be
tunable, and then another mechanism for tuning as well.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick Bellasi March 17, 2016, 3:53 p.m. UTC | #12
On 17-Mar 06:55, Steve Muckle wrote:
> On 03/17/2016 02:40 AM, Juri Lelli wrote:
> >> Could the default schedtune value not serve as the out of the box margin?
> >>
> > I'm not sure I understand you here. For me schedtune should be disabled
> > by default, so I'd say that it doesn't introduce any additional margin
> > by default. But we still need a margin to make the governor work without
> > schedtune in the mix.
> 
> Why not have schedtune be enabled always, and use it to add the margin?
> It seems like it'd simplify things.

Actually one of the effects we noticed when SchedTune and SchedFreq
are both in use is that we have a sort of "double boosting" effect.

SchedTune boosts the CPU utilization signal, thus already providing a
sort of margin for the selection of the OPP. This margin overlaps with
the SchedFreq margin, which in turns could results in the selection of
an OPP even more higher than required (with boost already accouned).
 
> I haven't looked at the schedtune code at all so I don't know whether
> this makes sense given its current implementation.

The current implementation requires review, of course ;-)
Last (and only) posting is based on top of SchedFreq code, as it was
at that time.

> But conceptually I don't know why we'd need or want one margin in
> schedutil which will be tunable, and then another mechanism for
> tuning as well.

I agree with Steve on the conceptual standpoint. The main goal of
SchedTune is actually to provide a "single tunable" to bias many
different subsystem in a "consistent" way. Thus, from a conceptual
standpoint, IMO it makes sens to investigate better how the boost value
can be linked with SchedFreq.

A possible option can be to:
1. use an hardcoded margin (M) defined by SchedFreq
   this margin is used to trigger OPP jumps
   when SchedTune _is not_ in use
2. "compose" the M margin with a boost value defined margin (B)
   when SchedTune _is_ in use

This means, e.g.
  schedfreq_margin = max(M, B)
Thus:
a) non boosted tasks (and in general when SchedTune is not in use)
   gets OPPs jumps based on the hardcoded M margin
b) boosted tasks can get more aggressive OPPs jumps based on the B
   margin

While the M margin is hardcoded, the B one is defined via CGroups
depending on the how much tasks needs to be boosted.
Juri Lelli March 17, 2016, 5:54 p.m. UTC | #13
Hi,

On 17/03/16 15:53, Patrick Bellasi wrote:
> On 17-Mar 06:55, Steve Muckle wrote:
> > On 03/17/2016 02:40 AM, Juri Lelli wrote:
> > >> Could the default schedtune value not serve as the out of the box margin?
> > >>
> > > I'm not sure I understand you here. For me schedtune should be disabled
> > > by default, so I'd say that it doesn't introduce any additional margin
> > > by default. But we still need a margin to make the governor work without
> > > schedtune in the mix.
> > 
> > Why not have schedtune be enabled always, and use it to add the margin?
> > It seems like it'd simplify things.
> 
> Actually one of the effects we noticed when SchedTune and SchedFreq
> are both in use is that we have a sort of "double boosting" effect.
> 
> SchedTune boosts the CPU utilization signal, thus already providing a
> sort of margin for the selection of the OPP. This margin overlaps with
> the SchedFreq margin, which in turns could results in the selection of
> an OPP even more higher than required (with boost already accouned).
>  
> > I haven't looked at the schedtune code at all so I don't know whether
> > this makes sense given its current implementation.
> 
> The current implementation requires review, of course ;-)
> Last (and only) posting is based on top of SchedFreq code, as it was
> at that time.
> 
> > But conceptually I don't know why we'd need or want one margin in
> > schedutil which will be tunable, and then another mechanism for
> > tuning as well.
> 
> I agree with Steve on the conceptual standpoint. The main goal of
> SchedTune is actually to provide a "single tunable" to bias many
> different subsystem in a "consistent" way. Thus, from a conceptual
> standpoint, IMO it makes sens to investigate better how the boost value
> can be linked with SchedFreq.
> 
> A possible option can be to:
> 1. use an hardcoded margin (M) defined by SchedFreq
>    this margin is used to trigger OPP jumps
>    when SchedTune _is not_ in use
> 2. "compose" the M margin with a boost value defined margin (B)
>    when SchedTune _is_ in use
> 
> This means, e.g.
>   schedfreq_margin = max(M, B)
> Thus:
> a) non boosted tasks (and in general when SchedTune is not in use)
>    gets OPPs jumps based on the hardcoded M margin
> b) boosted tasks can get more aggressive OPPs jumps based on the B
>    margin
> 
> While the M margin is hardcoded, the B one is defined via CGroups
> depending on the how much tasks needs to be boosted.
> 

Makes sense to me. And I think M margin is the one we don't want to make
part of the ABI and only play with it under DEBUG.

Best,

- Juri
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Turquette March 17, 2016, 6:56 p.m. UTC | #14
Quoting Juri Lelli (2016-03-17 10:54:07)
> Hi,
> 
> On 17/03/16 15:53, Patrick Bellasi wrote:
> > On 17-Mar 06:55, Steve Muckle wrote:
> > > On 03/17/2016 02:40 AM, Juri Lelli wrote:
> > > >> Could the default schedtune value not serve as the out of the box margin?
> > > >>
> > > > I'm not sure I understand you here. For me schedtune should be disabled
> > > > by default, so I'd say that it doesn't introduce any additional margin
> > > > by default. But we still need a margin to make the governor work without
> > > > schedtune in the mix.
> > > 
> > > Why not have schedtune be enabled always, and use it to add the margin?
> > > It seems like it'd simplify things.
> > 
> > Actually one of the effects we noticed when SchedTune and SchedFreq
> > are both in use is that we have a sort of "double boosting" effect.
> > 
> > SchedTune boosts the CPU utilization signal, thus already providing a
> > sort of margin for the selection of the OPP. This margin overlaps with
> > the SchedFreq margin, which in turns could results in the selection of
> > an OPP even more higher than required (with boost already accouned).
> >  
> > > I haven't looked at the schedtune code at all so I don't know whether
> > > this makes sense given its current implementation.
> > 
> > The current implementation requires review, of course ;-)
> > Last (and only) posting is based on top of SchedFreq code, as it was
> > at that time.
> > 
> > > But conceptually I don't know why we'd need or want one margin in
> > > schedutil which will be tunable, and then another mechanism for
> > > tuning as well.
> > 
> > I agree with Steve on the conceptual standpoint. The main goal of
> > SchedTune is actually to provide a "single tunable" to bias many
> > different subsystem in a "consistent" way. Thus, from a conceptual
> > standpoint, IMO it makes sens to investigate better how the boost value
> > can be linked with SchedFreq.
> > 
> > A possible option can be to:
> > 1. use an hardcoded margin (M) defined by SchedFreq
> >    this margin is used to trigger OPP jumps
> >    when SchedTune _is not_ in use
> > 2. "compose" the M margin with a boost value defined margin (B)
> >    when SchedTune _is_ in use
> > 
> > This means, e.g.
> >   schedfreq_margin = max(M, B)
> > Thus:
> > a) non boosted tasks (and in general when SchedTune is not in use)
> >    gets OPPs jumps based on the hardcoded M margin
> > b) boosted tasks can get more aggressive OPPs jumps based on the B
> >    margin
> > 
> > While the M margin is hardcoded, the B one is defined via CGroups
> > depending on the how much tasks needs to be boosted.
> > 
> 
> Makes sense to me. And I think M margin is the one we don't want to make
> part of the ABI and only play with it under DEBUG.

Correct.

Regarding "composing" the margin, schedtune could even overwrite the
margin entirely via cpufreq_set_cfs_capacity_margin (see patch #2 in
this series). This avoids complications around a "double boosting"
effect.

Either way, it sounds like the schedtune angle is something that we can
figure out in due time and change the code as needed later on. For
schedutil to make sense for frequency-invariant platforms we do need a
margin today, and there is desire to tune it easily, so I will move this
sysfs knob to a debug knob in v2.

Regards,
Mike

> 
> Best,
> 
> - Juri
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki March 17, 2016, 10:34 p.m. UTC | #15
On Thu, Mar 17, 2016 at 7:56 PM, Michael Turquette
<mturquette@baylibre.com> wrote:
> Quoting Juri Lelli (2016-03-17 10:54:07)
>> Hi,
>>
>> On 17/03/16 15:53, Patrick Bellasi wrote:
>> > On 17-Mar 06:55, Steve Muckle wrote:
>> > > On 03/17/2016 02:40 AM, Juri Lelli wrote:
>> > > >> Could the default schedtune value not serve as the out of the box margin?
>> > > >>
>> > > > I'm not sure I understand you here. For me schedtune should be disabled
>> > > > by default, so I'd say that it doesn't introduce any additional margin
>> > > > by default. But we still need a margin to make the governor work without
>> > > > schedtune in the mix.
>> > >
>> > > Why not have schedtune be enabled always, and use it to add the margin?
>> > > It seems like it'd simplify things.
>> >
>> > Actually one of the effects we noticed when SchedTune and SchedFreq
>> > are both in use is that we have a sort of "double boosting" effect.
>> >
>> > SchedTune boosts the CPU utilization signal, thus already providing a
>> > sort of margin for the selection of the OPP. This margin overlaps with
>> > the SchedFreq margin, which in turns could results in the selection of
>> > an OPP even more higher than required (with boost already accouned).
>> >
>> > > I haven't looked at the schedtune code at all so I don't know whether
>> > > this makes sense given its current implementation.
>> >
>> > The current implementation requires review, of course ;-)
>> > Last (and only) posting is based on top of SchedFreq code, as it was
>> > at that time.
>> >
>> > > But conceptually I don't know why we'd need or want one margin in
>> > > schedutil which will be tunable, and then another mechanism for
>> > > tuning as well.
>> >
>> > I agree with Steve on the conceptual standpoint. The main goal of
>> > SchedTune is actually to provide a "single tunable" to bias many
>> > different subsystem in a "consistent" way. Thus, from a conceptual
>> > standpoint, IMO it makes sens to investigate better how the boost value
>> > can be linked with SchedFreq.
>> >
>> > A possible option can be to:
>> > 1. use an hardcoded margin (M) defined by SchedFreq
>> >    this margin is used to trigger OPP jumps
>> >    when SchedTune _is not_ in use
>> > 2. "compose" the M margin with a boost value defined margin (B)
>> >    when SchedTune _is_ in use
>> >
>> > This means, e.g.
>> >   schedfreq_margin = max(M, B)
>> > Thus:
>> > a) non boosted tasks (and in general when SchedTune is not in use)
>> >    gets OPPs jumps based on the hardcoded M margin
>> > b) boosted tasks can get more aggressive OPPs jumps based on the B
>> >    margin
>> >
>> > While the M margin is hardcoded, the B one is defined via CGroups
>> > depending on the how much tasks needs to be boosted.
>> >
>>
>> Makes sense to me. And I think M margin is the one we don't want to make
>> part of the ABI and only play with it under DEBUG.
>
> Correct.
>
> Regarding "composing" the margin, schedtune could even overwrite the
> margin entirely via cpufreq_set_cfs_capacity_margin (see patch #2 in
> this series). This avoids complications around a "double boosting"
> effect.
>
> Either way, it sounds like the schedtune angle is something that we can
> figure out in due time and change the code as needed later on. For
> schedutil to make sense for frequency-invariant platforms we do need a
> margin today, and there is desire to tune it easily, so I will move this
> sysfs knob to a debug knob in v2.

Sounds good!

Also, if you look at the latest iteration of the schedutil patch
(https://patchwork.kernel.org/patch/8612561/), it maps the choice of
the margin to the choice of the frequency tipping point.  That is, the
value of (util / max) for which the frequency will stay the same as it
was before.  [For (util / max) below the tipping point the new
frequency will be less than the old one (unless it already is minimum)
and for (util / max) above it, the new frequency will be greater than
the old one.]

The tipping point seems to be a good candidate for a tunable to me,
because its meaning is well defined and the range of values that make
sense is quite easy to figure out too.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/cpufreq/cpufreq_schedutil.c b/drivers/cpufreq/cpufreq_schedutil.c
index 5aa26bf..12e49b9 100644
--- a/drivers/cpufreq/cpufreq_schedutil.c
+++ b/drivers/cpufreq/cpufreq_schedutil.c
@@ -246,8 +246,32 @@  static ssize_t rate_limit_us_store(struct gov_attr_set *attr_set, const char *bu
 
 static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us);
 
+static ssize_t capacity_margin_show(struct gov_attr_set *not_used,
+					   char *buf)
+{
+	return sprintf(buf, "%lu\n", cpufreq_get_cfs_capacity_margin());
+}
+
+static ssize_t capacity_margin_store(struct gov_attr_set *attr_set,
+				  const char *buf, size_t count)
+{
+	unsigned long margin;
+	int ret;
+
+	ret = sscanf(buf, "%lu", &margin);
+	if (ret != 1)
+		return -EINVAL;
+
+	cpufreq_set_cfs_capacity_margin(margin);
+
+	return count;
+}
+
+static struct governor_attr capacity_margin = __ATTR_RW(capacity_margin);
+
 static struct attribute *sugov_attributes[] = {
 	&rate_limit_us.attr,
+	&capacity_margin.attr,
 	NULL
 };
 
@@ -381,6 +405,7 @@  static int sugov_exit(struct cpufreq_policy *policy)
 
 	mutex_lock(&global_tunables_lock);
 
+	cpufreq_reset_cfs_capacity_margin();
 	count = gov_attr_set_put(&tunables->attr_set, &sg_policy->tunables_hook);
 	policy->governor_data = NULL;
 	if (!count)