Message ID | 1457932932-28444-5-git-send-email-mturquette+renesas@baylibre.com (mailing list archive) |
---|---|
State | RFC, archived |
Headers | show |
On Sun, Mar 13, 2016 at 10:22:08PM -0700, Michael Turquette wrote: > With the addition of the global cfs capacity margin helpers in patch, > "sched/cpufreq: new cfs capacity margin helpers", we can now export > sysfs tunables from the schedutil governor. This allows privileged users > to tune the value more easily. > > The margin value is global to cfs, not per-policy. As such schedutil > does not store any state about the margin. Schedutil restores the margin > value to its default value when exiting. Yuck sysfs.. I would really rather we did not expose this per default. And certainly not in this weird form. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Mar 15, 2016 at 02:40:43PM -0700, Michael Turquette wrote: > Quoting Peter Zijlstra (2016-03-15 14:20:47) > > On Sun, Mar 13, 2016 at 10:22:08PM -0700, Michael Turquette wrote: > > > With the addition of the global cfs capacity margin helpers in patch, > > > "sched/cpufreq: new cfs capacity margin helpers", we can now export > > > sysfs tunables from the schedutil governor. This allows privileged users > > > to tune the value more easily. > > > > > > The margin value is global to cfs, not per-policy. As such schedutil > > > does not store any state about the margin. Schedutil restores the margin > > > value to its default value when exiting. > > > > Yuck sysfs.. I would really rather we did not expose this per default. > > And certainly not in this weird form. > > I'm happy to change capacity_margin to up_threshold and use a > percentage. > > The sysfs approach has two benefits. First, it is aligned with cpufreq > user expectations. Second, there has been rough consensus that this > value should be tunable and sysfs gets us there quickly and painlessly. > We're already exporting rate_limit_us for schedutil via sysfs. Is there > a better way interface you can recommend? It really depends on how tunable you want this to be. Do we always want this to be a tunable, or just now while we're playing about with the whole thing? The problem with exposing it in sysfs is that you cannot take it out again, it becomes ABI. What we do for all the scheduler tunables (pretty much every time we have to take a value out of thin air), is we make them const for !SCHED_DEBUG builds, but have them as sysctl for SCHED_DEBUG builds (although we should probably move them to /debug/sched/ or somesuch). That way you get better code generation (compile time constants rule) for !debug builds, while having the 'joy' of poking at your number on debug builds. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/15/2016 03:37 PM, Michael Turquette wrote: >>>> Yuck sysfs.. I would really rather we did not expose this per default. >>>> > > > And certainly not in this weird form. >>> > > >>> > > I'm happy to change capacity_margin to up_threshold and use a >>> > > percentage. >>> > > >>> > > The sysfs approach has two benefits. First, it is aligned with cpufreq >>> > > user expectations. Second, there has been rough consensus that this >>> > > value should be tunable and sysfs gets us there quickly and painlessly. >>> > > We're already exporting rate_limit_us for schedutil via sysfs. Is there >>> > > a better way interface you can recommend? >> > >> > It really depends on how tunable you want this to be. Do we always want >> > this to be a tunable, or just now while we're playing about with the >> > whole thing? > > I had considered this myself, and I really think that Steve and Juri > should chime in as they have spent more time tuning and running the > numbers. > > I'm inclined to think that a debug version would be good enough, as I > don't imagine this value being changed at run-time by some userspace > daemon or something. > > Then again, maybe this knob will be part of the mythical > power-vs-performance slider? Patrick Bellasi's schedtune series [0] (which I think is the referenced mythical slider) aims to provide a more sophisticated interface for tuning scheduler-driven frequency selection. In addition to a global boost value it includes a cgroup controller as well for per-task tuning. I would definitely expect the margin/boost value to be modified at runtime, for example if the battery is running low, or the user wants 100% performance for a while, or the userspace framework wants to temporarily tailor the performance level for a particular set of tasks, etc. [0] http://article.gmane.org/gmane.linux.kernel/2022959 -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Mar 15, 2016 at 08:36:57PM -0700, Steve Muckle wrote: > > Then again, maybe this knob will be part of the mythical > > power-vs-performance slider? > > Patrick Bellasi's schedtune series [0] (which I think is the referenced > mythical slider) aims to provide a more sophisticated interface for > tuning scheduler-driven frequency selection. In addition to a global > boost value it includes a cgroup controller as well for per-task tuning. > > I would definitely expect the margin/boost value to be modified at > runtime, for example if the battery is running low, or the user wants > 100% performance for a while, or the userspace framework wants to > temporarily tailor the performance level for a particular set of tasks, etc. OK, so how about we start with it as a debug knob, and once we have experience and feel like it is indeed a useful runtime knob, we upgrade it to ABI. The problem with starting out as ABI is that its hard to take away again. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi, On 16/03/16 09:05, Peter Zijlstra wrote: > On Tue, Mar 15, 2016 at 08:36:57PM -0700, Steve Muckle wrote: > > > Then again, maybe this knob will be part of the mythical > > > power-vs-performance slider? > > > > Patrick Bellasi's schedtune series [0] (which I think is the referenced > > mythical slider) aims to provide a more sophisticated interface for > > tuning scheduler-driven frequency selection. In addition to a global > > boost value it includes a cgroup controller as well for per-task tuning. > > > > I would definitely expect the margin/boost value to be modified at > > runtime, for example if the battery is running low, or the user wants > > 100% performance for a while, or the userspace framework wants to > > temporarily tailor the performance level for a particular set of tasks, etc. > > OK, so how about we start with it as a debug knob, and once we have > experience and feel like it is indeed a useful runtime knob, we upgrade > it to ABI. > I tend to agree here. To me the margin is something that we need to make this thing work and to get acceptable performance out of the box. So we can play with it while debugging, but I consider the schedtune slider as the way to tune the system at runtime. Best, - Juri -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 16, 2016 at 9:05 AM, Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, Mar 15, 2016 at 08:36:57PM -0700, Steve Muckle wrote: >> > Then again, maybe this knob will be part of the mythical >> > power-vs-performance slider? >> >> Patrick Bellasi's schedtune series [0] (which I think is the referenced >> mythical slider) aims to provide a more sophisticated interface for >> tuning scheduler-driven frequency selection. In addition to a global >> boost value it includes a cgroup controller as well for per-task tuning. >> >> I would definitely expect the margin/boost value to be modified at >> runtime, for example if the battery is running low, or the user wants >> 100% performance for a while, or the userspace framework wants to >> temporarily tailor the performance level for a particular set of tasks, etc. > > OK, so how about we start with it as a debug knob, and once we have > experience and feel like it is indeed a useful runtime knob, we upgrade > it to ABI. > > The problem with starting out as ABI is that its hard to take away > again. Agreed, plus it is quite hard to get ABI right from the outset. Even if we decide on a sysfs knob, it still is unclear what exactly should be represented by it in what units etc. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/16/2016 03:02 AM, Juri Lelli wrote: > Hi, > > On 16/03/16 09:05, Peter Zijlstra wrote: >> On Tue, Mar 15, 2016 at 08:36:57PM -0700, Steve Muckle wrote: >>>> Then again, maybe this knob will be part of the mythical >>>> power-vs-performance slider? >>> >>> Patrick Bellasi's schedtune series [0] (which I think is the referenced >>> mythical slider) aims to provide a more sophisticated interface for >>> tuning scheduler-driven frequency selection. In addition to a global >>> boost value it includes a cgroup controller as well for per-task tuning. >>> >>> I would definitely expect the margin/boost value to be modified at >>> runtime, for example if the battery is running low, or the user wants >>> 100% performance for a while, or the userspace framework wants to >>> temporarily tailor the performance level for a particular set of tasks, etc. >> >> OK, so how about we start with it as a debug knob, and once we have >> experience and feel like it is indeed a useful runtime knob, we upgrade >> it to ABI. >> > > I tend to agree here. To me the margin is something that we need to make > this thing work and to get acceptable performance out of the box. So we > can play with it while debugging, but I consider the schedtune slider as > the way to tune the system at runtime. Could the default schedtune value not serve as the out of the box margin? Regardless I agree that a debug interface is the way to go for now while we figure things out. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Quoting Steve Muckle (2016-03-15 20:36:57) > On 03/15/2016 03:37 PM, Michael Turquette wrote: > >>>> Yuck sysfs.. I would really rather we did not expose this per default. > >>>> > > > And certainly not in this weird form. > >>> > > > >>> > > I'm happy to change capacity_margin to up_threshold and use a > >>> > > percentage. > >>> > > > >>> > > The sysfs approach has two benefits. First, it is aligned with cpufreq > >>> > > user expectations. Second, there has been rough consensus that this > >>> > > value should be tunable and sysfs gets us there quickly and painlessly. > >>> > > We're already exporting rate_limit_us for schedutil via sysfs. Is there > >>> > > a better way interface you can recommend? > >> > > >> > It really depends on how tunable you want this to be. Do we always want > >> > this to be a tunable, or just now while we're playing about with the > >> > whole thing? > > > > I had considered this myself, and I really think that Steve and Juri > > should chime in as they have spent more time tuning and running the > > numbers. > > > > I'm inclined to think that a debug version would be good enough, as I > > don't imagine this value being changed at run-time by some userspace > > daemon or something. > > > > Then again, maybe this knob will be part of the mythical > > power-vs-performance slider? > > Patrick Bellasi's schedtune series [0] (which I think is the referenced > mythical slider) aims to provide a more sophisticated interface for > tuning scheduler-driven frequency selection. In addition to a global > boost value it includes a cgroup controller as well for per-task tuning. /me spends 15 seconds looking schedtune > > I would definitely expect the margin/boost value to be modified at > runtime, for example if the battery is running low, or the user wants > 100% performance for a while, or the userspace framework wants to > temporarily tailor the performance level for a particular set of tasks, etc. Right, and it looks like schedtune is a kernel solution, not userspace solution. The following three interfaces from patch #2 could be used by schedtune: unsigned long cpufreq_get_cfs_capacity_margin(void); void cpufreq_set_cfs_capacity_margin(unsigned long margin); void cpufreq_reset_cfs_capacity_margin(void); Then we can let schedtune worry about the userspace abi. So I'll keep the basic idea of this patch, but explore making it debuggy, instead of sysfsy. Regards, Mike > > [0] http://article.gmane.org/gmane.linux.kernel/2022959 -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Quoting Steve Muckle (2016-03-16 10:55:49) > On 03/16/2016 03:02 AM, Juri Lelli wrote: > > Hi, > > > > On 16/03/16 09:05, Peter Zijlstra wrote: > >> On Tue, Mar 15, 2016 at 08:36:57PM -0700, Steve Muckle wrote: > >>>> Then again, maybe this knob will be part of the mythical > >>>> power-vs-performance slider? > >>> > >>> Patrick Bellasi's schedtune series [0] (which I think is the referenced > >>> mythical slider) aims to provide a more sophisticated interface for > >>> tuning scheduler-driven frequency selection. In addition to a global > >>> boost value it includes a cgroup controller as well for per-task tuning. > >>> > >>> I would definitely expect the margin/boost value to be modified at > >>> runtime, for example if the battery is running low, or the user wants > >>> 100% performance for a while, or the userspace framework wants to > >>> temporarily tailor the performance level for a particular set of tasks, etc. > >> > >> OK, so how about we start with it as a debug knob, and once we have > >> experience and feel like it is indeed a useful runtime knob, we upgrade > >> it to ABI. > >> > > > > I tend to agree here. To me the margin is something that we need to make > > this thing work and to get acceptable performance out of the box. So we > > can play with it while debugging, but I consider the schedtune slider as > > the way to tune the system at runtime. > > Could the default schedtune value not serve as the out of the box margin? It can. Let's keep the kernel interfaces in patch #2 for changing the margin/threshold, and schedtune can call these interfaces. Regards, Mike > > Regardless I agree that a debug interface is the way to go for now while > we figure things out. > -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 16/03/16 10:55, Steve Muckle wrote: > On 03/16/2016 03:02 AM, Juri Lelli wrote: > > Hi, > > > > On 16/03/16 09:05, Peter Zijlstra wrote: > >> On Tue, Mar 15, 2016 at 08:36:57PM -0700, Steve Muckle wrote: > >>>> Then again, maybe this knob will be part of the mythical > >>>> power-vs-performance slider? > >>> > >>> Patrick Bellasi's schedtune series [0] (which I think is the referenced > >>> mythical slider) aims to provide a more sophisticated interface for > >>> tuning scheduler-driven frequency selection. In addition to a global > >>> boost value it includes a cgroup controller as well for per-task tuning. > >>> > >>> I would definitely expect the margin/boost value to be modified at > >>> runtime, for example if the battery is running low, or the user wants > >>> 100% performance for a while, or the userspace framework wants to > >>> temporarily tailor the performance level for a particular set of tasks, etc. > >> > >> OK, so how about we start with it as a debug knob, and once we have > >> experience and feel like it is indeed a useful runtime knob, we upgrade > >> it to ABI. > >> > > > > I tend to agree here. To me the margin is something that we need to make > > this thing work and to get acceptable performance out of the box. So we > > can play with it while debugging, but I consider the schedtune slider as > > the way to tune the system at runtime. > > Could the default schedtune value not serve as the out of the box margin? > I'm not sure I understand you here. For me schedtune should be disabled by default, so I'd say that it doesn't introduce any additional margin by default. But we still need a margin to make the governor work without schedtune in the mix. > Regardless I agree that a debug interface is the way to go for now while > we figure things out. > Looks good to me. Best, - Juri -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/17/2016 02:40 AM, Juri Lelli wrote: >> Could the default schedtune value not serve as the out of the box margin? >> > I'm not sure I understand you here. For me schedtune should be disabled > by default, so I'd say that it doesn't introduce any additional margin > by default. But we still need a margin to make the governor work without > schedtune in the mix. Why not have schedtune be enabled always, and use it to add the margin? It seems like it'd simplify things. I haven't looked at the schedtune code at all so I don't know whether this makes sense given its current implementation. But conceptually I don't know why we'd need or want one margin in schedutil which will be tunable, and then another mechanism for tuning as well. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 17-Mar 06:55, Steve Muckle wrote: > On 03/17/2016 02:40 AM, Juri Lelli wrote: > >> Could the default schedtune value not serve as the out of the box margin? > >> > > I'm not sure I understand you here. For me schedtune should be disabled > > by default, so I'd say that it doesn't introduce any additional margin > > by default. But we still need a margin to make the governor work without > > schedtune in the mix. > > Why not have schedtune be enabled always, and use it to add the margin? > It seems like it'd simplify things. Actually one of the effects we noticed when SchedTune and SchedFreq are both in use is that we have a sort of "double boosting" effect. SchedTune boosts the CPU utilization signal, thus already providing a sort of margin for the selection of the OPP. This margin overlaps with the SchedFreq margin, which in turns could results in the selection of an OPP even more higher than required (with boost already accouned). > I haven't looked at the schedtune code at all so I don't know whether > this makes sense given its current implementation. The current implementation requires review, of course ;-) Last (and only) posting is based on top of SchedFreq code, as it was at that time. > But conceptually I don't know why we'd need or want one margin in > schedutil which will be tunable, and then another mechanism for > tuning as well. I agree with Steve on the conceptual standpoint. The main goal of SchedTune is actually to provide a "single tunable" to bias many different subsystem in a "consistent" way. Thus, from a conceptual standpoint, IMO it makes sens to investigate better how the boost value can be linked with SchedFreq. A possible option can be to: 1. use an hardcoded margin (M) defined by SchedFreq this margin is used to trigger OPP jumps when SchedTune _is not_ in use 2. "compose" the M margin with a boost value defined margin (B) when SchedTune _is_ in use This means, e.g. schedfreq_margin = max(M, B) Thus: a) non boosted tasks (and in general when SchedTune is not in use) gets OPPs jumps based on the hardcoded M margin b) boosted tasks can get more aggressive OPPs jumps based on the B margin While the M margin is hardcoded, the B one is defined via CGroups depending on the how much tasks needs to be boosted.
Hi, On 17/03/16 15:53, Patrick Bellasi wrote: > On 17-Mar 06:55, Steve Muckle wrote: > > On 03/17/2016 02:40 AM, Juri Lelli wrote: > > >> Could the default schedtune value not serve as the out of the box margin? > > >> > > > I'm not sure I understand you here. For me schedtune should be disabled > > > by default, so I'd say that it doesn't introduce any additional margin > > > by default. But we still need a margin to make the governor work without > > > schedtune in the mix. > > > > Why not have schedtune be enabled always, and use it to add the margin? > > It seems like it'd simplify things. > > Actually one of the effects we noticed when SchedTune and SchedFreq > are both in use is that we have a sort of "double boosting" effect. > > SchedTune boosts the CPU utilization signal, thus already providing a > sort of margin for the selection of the OPP. This margin overlaps with > the SchedFreq margin, which in turns could results in the selection of > an OPP even more higher than required (with boost already accouned). > > > I haven't looked at the schedtune code at all so I don't know whether > > this makes sense given its current implementation. > > The current implementation requires review, of course ;-) > Last (and only) posting is based on top of SchedFreq code, as it was > at that time. > > > But conceptually I don't know why we'd need or want one margin in > > schedutil which will be tunable, and then another mechanism for > > tuning as well. > > I agree with Steve on the conceptual standpoint. The main goal of > SchedTune is actually to provide a "single tunable" to bias many > different subsystem in a "consistent" way. Thus, from a conceptual > standpoint, IMO it makes sens to investigate better how the boost value > can be linked with SchedFreq. > > A possible option can be to: > 1. use an hardcoded margin (M) defined by SchedFreq > this margin is used to trigger OPP jumps > when SchedTune _is not_ in use > 2. "compose" the M margin with a boost value defined margin (B) > when SchedTune _is_ in use > > This means, e.g. > schedfreq_margin = max(M, B) > Thus: > a) non boosted tasks (and in general when SchedTune is not in use) > gets OPPs jumps based on the hardcoded M margin > b) boosted tasks can get more aggressive OPPs jumps based on the B > margin > > While the M margin is hardcoded, the B one is defined via CGroups > depending on the how much tasks needs to be boosted. > Makes sense to me. And I think M margin is the one we don't want to make part of the ABI and only play with it under DEBUG. Best, - Juri -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Quoting Juri Lelli (2016-03-17 10:54:07) > Hi, > > On 17/03/16 15:53, Patrick Bellasi wrote: > > On 17-Mar 06:55, Steve Muckle wrote: > > > On 03/17/2016 02:40 AM, Juri Lelli wrote: > > > >> Could the default schedtune value not serve as the out of the box margin? > > > >> > > > > I'm not sure I understand you here. For me schedtune should be disabled > > > > by default, so I'd say that it doesn't introduce any additional margin > > > > by default. But we still need a margin to make the governor work without > > > > schedtune in the mix. > > > > > > Why not have schedtune be enabled always, and use it to add the margin? > > > It seems like it'd simplify things. > > > > Actually one of the effects we noticed when SchedTune and SchedFreq > > are both in use is that we have a sort of "double boosting" effect. > > > > SchedTune boosts the CPU utilization signal, thus already providing a > > sort of margin for the selection of the OPP. This margin overlaps with > > the SchedFreq margin, which in turns could results in the selection of > > an OPP even more higher than required (with boost already accouned). > > > > > I haven't looked at the schedtune code at all so I don't know whether > > > this makes sense given its current implementation. > > > > The current implementation requires review, of course ;-) > > Last (and only) posting is based on top of SchedFreq code, as it was > > at that time. > > > > > But conceptually I don't know why we'd need or want one margin in > > > schedutil which will be tunable, and then another mechanism for > > > tuning as well. > > > > I agree with Steve on the conceptual standpoint. The main goal of > > SchedTune is actually to provide a "single tunable" to bias many > > different subsystem in a "consistent" way. Thus, from a conceptual > > standpoint, IMO it makes sens to investigate better how the boost value > > can be linked with SchedFreq. > > > > A possible option can be to: > > 1. use an hardcoded margin (M) defined by SchedFreq > > this margin is used to trigger OPP jumps > > when SchedTune _is not_ in use > > 2. "compose" the M margin with a boost value defined margin (B) > > when SchedTune _is_ in use > > > > This means, e.g. > > schedfreq_margin = max(M, B) > > Thus: > > a) non boosted tasks (and in general when SchedTune is not in use) > > gets OPPs jumps based on the hardcoded M margin > > b) boosted tasks can get more aggressive OPPs jumps based on the B > > margin > > > > While the M margin is hardcoded, the B one is defined via CGroups > > depending on the how much tasks needs to be boosted. > > > > Makes sense to me. And I think M margin is the one we don't want to make > part of the ABI and only play with it under DEBUG. Correct. Regarding "composing" the margin, schedtune could even overwrite the margin entirely via cpufreq_set_cfs_capacity_margin (see patch #2 in this series). This avoids complications around a "double boosting" effect. Either way, it sounds like the schedtune angle is something that we can figure out in due time and change the code as needed later on. For schedutil to make sense for frequency-invariant platforms we do need a margin today, and there is desire to tune it easily, so I will move this sysfs knob to a debug knob in v2. Regards, Mike > > Best, > > - Juri -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 17, 2016 at 7:56 PM, Michael Turquette <mturquette@baylibre.com> wrote: > Quoting Juri Lelli (2016-03-17 10:54:07) >> Hi, >> >> On 17/03/16 15:53, Patrick Bellasi wrote: >> > On 17-Mar 06:55, Steve Muckle wrote: >> > > On 03/17/2016 02:40 AM, Juri Lelli wrote: >> > > >> Could the default schedtune value not serve as the out of the box margin? >> > > >> >> > > > I'm not sure I understand you here. For me schedtune should be disabled >> > > > by default, so I'd say that it doesn't introduce any additional margin >> > > > by default. But we still need a margin to make the governor work without >> > > > schedtune in the mix. >> > > >> > > Why not have schedtune be enabled always, and use it to add the margin? >> > > It seems like it'd simplify things. >> > >> > Actually one of the effects we noticed when SchedTune and SchedFreq >> > are both in use is that we have a sort of "double boosting" effect. >> > >> > SchedTune boosts the CPU utilization signal, thus already providing a >> > sort of margin for the selection of the OPP. This margin overlaps with >> > the SchedFreq margin, which in turns could results in the selection of >> > an OPP even more higher than required (with boost already accouned). >> > >> > > I haven't looked at the schedtune code at all so I don't know whether >> > > this makes sense given its current implementation. >> > >> > The current implementation requires review, of course ;-) >> > Last (and only) posting is based on top of SchedFreq code, as it was >> > at that time. >> > >> > > But conceptually I don't know why we'd need or want one margin in >> > > schedutil which will be tunable, and then another mechanism for >> > > tuning as well. >> > >> > I agree with Steve on the conceptual standpoint. The main goal of >> > SchedTune is actually to provide a "single tunable" to bias many >> > different subsystem in a "consistent" way. Thus, from a conceptual >> > standpoint, IMO it makes sens to investigate better how the boost value >> > can be linked with SchedFreq. >> > >> > A possible option can be to: >> > 1. use an hardcoded margin (M) defined by SchedFreq >> > this margin is used to trigger OPP jumps >> > when SchedTune _is not_ in use >> > 2. "compose" the M margin with a boost value defined margin (B) >> > when SchedTune _is_ in use >> > >> > This means, e.g. >> > schedfreq_margin = max(M, B) >> > Thus: >> > a) non boosted tasks (and in general when SchedTune is not in use) >> > gets OPPs jumps based on the hardcoded M margin >> > b) boosted tasks can get more aggressive OPPs jumps based on the B >> > margin >> > >> > While the M margin is hardcoded, the B one is defined via CGroups >> > depending on the how much tasks needs to be boosted. >> > >> >> Makes sense to me. And I think M margin is the one we don't want to make >> part of the ABI and only play with it under DEBUG. > > Correct. > > Regarding "composing" the margin, schedtune could even overwrite the > margin entirely via cpufreq_set_cfs_capacity_margin (see patch #2 in > this series). This avoids complications around a "double boosting" > effect. > > Either way, it sounds like the schedtune angle is something that we can > figure out in due time and change the code as needed later on. For > schedutil to make sense for frequency-invariant platforms we do need a > margin today, and there is desire to tune it easily, so I will move this > sysfs knob to a debug knob in v2. Sounds good! Also, if you look at the latest iteration of the schedutil patch (https://patchwork.kernel.org/patch/8612561/), it maps the choice of the margin to the choice of the frequency tipping point. That is, the value of (util / max) for which the frequency will stay the same as it was before. [For (util / max) below the tipping point the new frequency will be less than the old one (unless it already is minimum) and for (util / max) above it, the new frequency will be greater than the old one.] The tipping point seems to be a good candidate for a tunable to me, because its meaning is well defined and the range of values that make sense is quite easy to figure out too. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/cpufreq/cpufreq_schedutil.c b/drivers/cpufreq/cpufreq_schedutil.c index 5aa26bf..12e49b9 100644 --- a/drivers/cpufreq/cpufreq_schedutil.c +++ b/drivers/cpufreq/cpufreq_schedutil.c @@ -246,8 +246,32 @@ static ssize_t rate_limit_us_store(struct gov_attr_set *attr_set, const char *bu static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us); +static ssize_t capacity_margin_show(struct gov_attr_set *not_used, + char *buf) +{ + return sprintf(buf, "%lu\n", cpufreq_get_cfs_capacity_margin()); +} + +static ssize_t capacity_margin_store(struct gov_attr_set *attr_set, + const char *buf, size_t count) +{ + unsigned long margin; + int ret; + + ret = sscanf(buf, "%lu", &margin); + if (ret != 1) + return -EINVAL; + + cpufreq_set_cfs_capacity_margin(margin); + + return count; +} + +static struct governor_attr capacity_margin = __ATTR_RW(capacity_margin); + static struct attribute *sugov_attributes[] = { &rate_limit_us.attr, + &capacity_margin.attr, NULL }; @@ -381,6 +405,7 @@ static int sugov_exit(struct cpufreq_policy *policy) mutex_lock(&global_tunables_lock); + cpufreq_reset_cfs_capacity_margin(); count = gov_attr_set_put(&tunables->attr_set, &sg_policy->tunables_hook); policy->governor_data = NULL; if (!count)
With the addition of the global cfs capacity margin helpers in patch, "sched/cpufreq: new cfs capacity margin helpers", we can now export sysfs tunables from the schedutil governor. This allows privileged users to tune the value more easily. The margin value is global to cfs, not per-policy. As such schedutil does not store any state about the margin. Schedutil restores the margin value to its default value when exiting. Signed-off-by: Michael Turquette <mturquette+renesas@baylibre.com> --- drivers/cpufreq/cpufreq_schedutil.c | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+)