Message ID | 20190320094918.20234-2-rnayak@codeaurora.org (mailing list archive) |
---|---|
State | RFC, archived |
Headers | show |
Series | DVFS in the OPP core | expand |
On 20-03-19, 15:19, Rajendra Nayak wrote: > From: Stephen Boyd <swboyd@chromium.org> > > Doing this allows us to call this API with any rate requested and have > it not need to match in the OPP table. Instead, we'll round the rate up > to the nearest OPP that we see so that we can get the voltage or level > that's required for that OPP. This supports users of OPP that want to > specify the 'fmax' tables of a device instead of every single frequency > that they need. And for devices that required the exact frequency, we > can rely on the clk framework to round the rate to the nearest supported > frequency instead of the OPP framework to do so. > > Note that this may affect drivers that don't want the clk framework to > do rounding, but instead want the OPP table to do the rounding for them. > Do we have that case? Should we add some flag to the OPP table to > indicate this and then not have that flag set when there isn't an OPP > table for the device and also introduce a property like 'opp-use-clk' to > tell the table that it should use the clk APIs to round rates instead of > OPP? > > Signed-off-by: Stephen Boyd <swboyd@chromium.org> > Signed-off-by: Rajendra Nayak <rnayak@codeaurora.org> > --- > drivers/opp/core.c | 8 +++++--- > 1 file changed, 5 insertions(+), 3 deletions(-) > > diff --git a/drivers/opp/core.c b/drivers/opp/core.c > index 0420f7e8ad5b..bc9a7762dd4c 100644 > --- a/drivers/opp/core.c > +++ b/drivers/opp/core.c > @@ -703,7 +703,7 @@ static int _set_required_opps(struct device *dev, > int dev_pm_opp_set_rate(struct device *dev, unsigned long target_freq) > { > struct opp_table *opp_table; > - unsigned long freq, old_freq; > + unsigned long freq, opp_freq, old_freq, old_opp_freq; > struct dev_pm_opp *old_opp, *opp; > struct clk *clk; > int ret; > @@ -742,13 +742,15 @@ int dev_pm_opp_set_rate(struct device *dev, unsigned long target_freq) > goto put_opp_table; > } > > - old_opp = _find_freq_ceil(opp_table, &old_freq); > + old_opp_freq = old_freq; > + old_opp = _find_freq_ceil(opp_table, &old_opp_freq); > if (IS_ERR(old_opp)) { > dev_err(dev, "%s: failed to find current OPP for freq %lu (%ld)\n", > __func__, old_freq, PTR_ERR(old_opp)); > } > > - opp = _find_freq_ceil(opp_table, &freq); > + opp_freq = freq; > + opp = _find_freq_ceil(opp_table, &opp_freq); > if (IS_ERR(opp)) { > ret = PTR_ERR(opp); > dev_err(dev, "%s: failed to find OPP for freq %lu (%d)\n", I see a logical problem with this patch. Suppose the clock driver supports following frequencies: 500M, 800M, 1G, 1.2G and the OPP table contains following list: 500M, 1G, 1.2G (i.e. missing 800M). Now 800M should never get programmed as it isn't part of the OPP table. But if you pass 600M to opp-set-rate, then it will end up selecting 800M as clock driver will round up to the closest value. Even if no one is doing this right now, it is a sensible usecase, specially during testing of patches and I don't think we should avoid it. What exactly is the use case for which we need this patch ? What kind of driver ? Some detail can be helpful to find another solution that fixes this problem.
On 6/11/2019 4:24 PM, Viresh Kumar wrote: > On 20-03-19, 15:19, Rajendra Nayak wrote: >> From: Stephen Boyd <swboyd@chromium.org> >> >> Doing this allows us to call this API with any rate requested and have >> it not need to match in the OPP table. Instead, we'll round the rate up >> to the nearest OPP that we see so that we can get the voltage or level >> that's required for that OPP. This supports users of OPP that want to >> specify the 'fmax' tables of a device instead of every single frequency >> that they need. And for devices that required the exact frequency, we >> can rely on the clk framework to round the rate to the nearest supported >> frequency instead of the OPP framework to do so. >> >> Note that this may affect drivers that don't want the clk framework to >> do rounding, but instead want the OPP table to do the rounding for them. >> Do we have that case? Should we add some flag to the OPP table to >> indicate this and then not have that flag set when there isn't an OPP >> table for the device and also introduce a property like 'opp-use-clk' to >> tell the table that it should use the clk APIs to round rates instead of >> OPP? >> >> Signed-off-by: Stephen Boyd <swboyd@chromium.org> >> Signed-off-by: Rajendra Nayak <rnayak@codeaurora.org> >> --- []... > > I see a logical problem with this patch. > > Suppose the clock driver supports following frequencies: 500M, 800M, > 1G, 1.2G and the OPP table contains following list: 500M, 1G, 1.2G > (i.e. missing 800M). > > Now 800M should never get programmed as it isn't part of the OPP > table. But if you pass 600M to opp-set-rate, then it will end up > selecting 800M as clock driver will round up to the closest value. correct > > Even if no one is doing this right now, it is a sensible usecase, > specially during testing of patches and I don't think we should avoid > it. > > What exactly is the use case for which we need this patch ? Like the changelog says 'This supports users of OPP that want to specify the 'fmax' tables of a device instead of every single frequency that they need' so the 'fmax' tables basically say what the max frequency the device can operate at for a given performance state/voltage level. so in your example it would be for instance 500M, Perf state = 2 1G, Perf state = 3 1.2G, Perf state = 4 Now when the device wants to operate at say 800Mhz, you need to set the Perf state to 3, so this patch basically avoids you having to put those additional OPPs in the table which would otherwise look something like this 500M, Perf state = 2 800M, Perf state = 3 <-- redundant OPP 1G, Perf state = 3 1.2G, Perf state = 4 Your example had just 1 missing entry in the 'fmax' tables in reality its a lot more, atleast on all qualcomm platforms.
On 12-06-19, 13:12, Rajendra Nayak wrote: > so the 'fmax' tables basically say what the max frequency the device can > operate at for a given performance state/voltage level. > > so in your example it would be for instance > > 500M, Perf state = 2 > 1G, Perf state = 3 > 1.2G, Perf state = 4 > > Now when the device wants to operate at say 800Mhz, you need to set the > Perf state to 3, so this patch basically avoids you having to put those additional > OPPs in the table which would otherwise look something like this > > 500M, Perf state = 2 > 800M, Perf state = 3 <-- redundant OPP > 1G, Perf state = 3 > 1.2G, Perf state = 4 > > Your example had just 1 missing entry in the 'fmax' tables in reality its a lot more, > atleast on all qualcomm platforms. Okay, I have applied this patch (alone) to the OPP tree with minor modifications in commit log and diff.
On 12-06-19, 13:55, Viresh Kumar wrote: > Okay, I have applied this patch (alone) to the OPP tree with minor > modifications in commit log and diff. And I have removed it now :) I am confused as hell on what we should be doing and what we are doing right now. And if we should do better. Let me explain with an example. - The clock provider supports following frequencies: 500, 600, 700, 800, 900, 1000 MHz. - The OPP table contains/supports only a subset: 500, 700, 1000 MHz. Now, the request to change the frequency starts from cpufreq governors, like schedutil when they calls: __cpufreq_driver_target(policy, 599 MHz, CPUFREQ_RELATION_L); CPUFREQ_RELATION_L means: lowest frequency at or above target. And so I would expect the frequency to get set to 600MHz (if we look at clock driver) or 700MHz (if we look at OPP table). I think we should decide this thing from the OPP table only as that's what the platform guys want us to use. So, we should end up with 700 MHz. Then we land into dev_pm_opp_set_rate(), which does this (which is code copied from earlier version of cpufreq-dt driver): - clk_round_rate(clk, 599 MHz). clk_round_rate() returns the highest frequency lower than target. So it must return 500 MHz (I haven't tested this yet, all theoretical). - _find_freq_ceil(opp_table, 500 MHz). This works like CPUFREQ_RELATION_L, so we find lowest frequency >= target freq. And so we should get: 500 MHz itself as OPP table has it. - clk_set_rate(clk, 500 MHz). This must be doing round-rate again, but I think we will settle with 500 MHz eventually. Now the questionnaire: - Is this whole exercise correct ? - We shouldn't have landed on 500 MHz, right ? - Is there anything wrong with the above theory (I am going to test it soon though). - Why do we need to do the first clock_round_rate() ? Should we remove it ? Now lets move to this patch, which makes it more confusing. The OPP tables for CPUs and GPUs should already be somewhat like fmax tables for particular voltage values and that's why both cpufreq and OPP core try to find a frequency higher than target so we choose the most optimum one power-efficiency wise. For cases where the OPP table is only a subset of the clk-providers table (almost always), if we let the clock provider to find the nearest frequency (which is lower) we will run the CPU/GPU at a not-so-optimal frequency. i.e. if 500, 600, 700 MHz all need voltage to be 1.2 V, we should be running at 700 always, while we may end up running at 500 MHz. This kind of behavior (introduced by this patch) is important for other devices which want to run at the nearest frequency to target one, but not for CPUs/GPUs. So, we need to tag these IO devices separately, maybe from DT ? So we select the closest match instead of most optimal one. But lets fix the existing issues first and then think about this patch.
On 13-06-19, 15:24, Viresh Kumar wrote: > I am confused as hell on what we should be doing and what we are doing > right now. And if we should do better. > > Let me explain with an example. > > - The clock provider supports following frequencies: 500, 600, 700, > 800, 900, 1000 MHz. > > - The OPP table contains/supports only a subset: 500, 700, 1000 MHz. > > Now, the request to change the frequency starts from cpufreq > governors, like schedutil when they calls: > > __cpufreq_driver_target(policy, 599 MHz, CPUFREQ_RELATION_L); > > CPUFREQ_RELATION_L means: lowest frequency at or above target. And so > I would expect the frequency to get set to 600MHz (if we look at clock > driver) or 700MHz (if we look at OPP table). I think we should decide > this thing from the OPP table only as that's what the platform guys > want us to use. So, we should end up with 700 MHz. > > Then we land into dev_pm_opp_set_rate(), which does this (which is > code copied from earlier version of cpufreq-dt driver): > > - clk_round_rate(clk, 599 MHz). > > clk_round_rate() returns the highest frequency lower than target. So > it must return 500 MHz (I haven't tested this yet, all theoretical). > > - _find_freq_ceil(opp_table, 500 MHz). > > This works like CPUFREQ_RELATION_L, so we find lowest frequency >= > target freq. And so we should get: 500 MHz itself as OPP table has > it. > > - clk_set_rate(clk, 500 MHz). > > This must be doing round-rate again, but I think we will settle with > 500 MHz eventually. > > > Now the questionnaire: > > - Is this whole exercise correct ? No, I missed the call to cpufreq_frequency_table_target() in __cpufreq_driver_target() which finds the exact frequency from cpufreq table (which was created using opp table) and so we never screw up here. Sorry for confusing everyone on this :( > Now lets move to this patch, which makes it more confusing. > > The OPP tables for CPUs and GPUs should already be somewhat like fmax > tables for particular voltage values and that's why both cpufreq and > OPP core try to find a frequency higher than target so we choose the > most optimum one power-efficiency wise. > > For cases where the OPP table is only a subset of the clk-providers > table (almost always), if we let the clock provider to find the > nearest frequency (which is lower) we will run the CPU/GPU at a > not-so-optimal frequency. i.e. if 500, 600, 700 MHz all need voltage > to be 1.2 V, we should be running at 700 always, while we may end up > running at 500 MHz. This won't happen for CPUs because of the reason I explained earlier. cpufreq core does the rounding before calling dev_pm_opp_set_rate(). And no other devices use dev_pm_opp_set_rate() right now apart from CPUs, so we are not going to break anything. > This kind of behavior (introduced by this patch) is important for > other devices which want to run at the nearest frequency to target > one, but not for CPUs/GPUs. So, we need to tag these IO devices > separately, maybe from DT ? So we select the closest match instead of > most optimal one. Hmm, so this patch won't break anything and I am inclined to apply it again :) Does anyone see any other issues with it, which I might be missing ?
> Now, the request to change the frequency starts from cpufreq > governors, like schedutil when they calls: > > __cpufreq_driver_target(policy, 599 MHz, CPUFREQ_RELATION_L); > > CPUFREQ_RELATION_L means: lowest frequency at or above target. And so > I would expect the frequency to get set to 600MHz (if we look at clock > driver) or 700MHz (if we look at OPP table). I think we should decide > this thing from the OPP table only as that's what the platform guys > want us to use. So, we should end up with 700 MHz. > > Then we land into dev_pm_opp_set_rate(), which does this (which is > code copied from earlier version of cpufreq-dt driver): so before we land into dev_pm_opp_set_rate() from a __cpufreq_driver_target() I guess we do have a cpufreq driver callback that gets called in between? which is either .target_index or .target In case of .target_index, the cpufreq core looks for a OPP index and we would land up with 700Mhz i guess, so we are good. In case of .target though the 'relation' CPUFREQ_RELATION_L does get passed over to the cpufreq driver which I am guessing is expected to handle it in some way to make sure the target frequency set is not less than whats requested? instead of simply passing the requested frequency over to dev_pm_opp_set_rate()? Looking at all the existing cpufreq drivers upstream, while most support .target_index the 3 which do support .target seem to completely ignore this 'relation' input that's passed to them. drivers/cpufreq/cppc_cpufreq.c: .target = cppc_cpufreq_set_target, drivers/cpufreq/cpufreq-nforce2.c: .target = nforce2_target, drivers/cpufreq/pcc-cpufreq.c: .target = pcc_cpufreq_target, > This kind of behavior (introduced by this patch) is important for > other devices which want to run at the nearest frequency to target > one, but not for CPUs/GPUs. So, we need to tag these IO devices > separately, maybe from DT ? So we select the closest match instead of > most optimal one. yes we do need some way to distinguish between CPU/GPU devices and other IO devices. CPU/GPU can always run at fmax for a given voltage, that's not true for IO devices and I don't see how we can satisfy both cases without clearly knowing if we are serving a processor or an IO device, unless the higher layers (cpufreq/devfreq) are able to handle this somehow without expecting the OPP layer to handle the differences.
On 14-06-19, 10:57, Viresh Kumar wrote: > Hmm, so this patch won't break anything and I am inclined to apply it again :) > > Does anyone see any other issues with it, which I might be missing ? I have updated the commit log a bit more to clarify on things, please let me know if it looks okay. opp: Don't overwrite rounded clk rate The OPP table normally contains 'fmax' values corresponding to the voltage or performance levels of each OPP, but we don't necessarily want all the devices to run at fmax all the time. Running at fmax makes sense for devices like CPU/GPU, which have a finite amount of work to do and since a specific amount of energy is consumed at an OPP, its better to run at the highest possible frequency for that voltage value. On the other hand, we have IO devices which need to run at specific frequencies only for their proper functioning, instead of maximum possible frequency. The OPP core currently roundup to the next possible OPP for a frequency and select the fmax value. To support the IO devices by the OPP core, lets do the roundup to fetch the voltage or performance state values, but not use the OPP frequency value. Rather use the value returned by clk_round_rate(). The current user, cpufreq, of dev_pm_opp_set_rate() already does the rounding to the next OPP before calling this routine and it won't have any side affects because of this change. Signed-off-by: Stephen Boyd <swboyd@chromium.org> Signed-off-by: Rajendra Nayak <rnayak@codeaurora.org> [ Viresh: Massaged changelog and use temp_opp variable instead ] Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
On 6/17/2019 9:20 AM, Viresh Kumar wrote: > On 14-06-19, 10:57, Viresh Kumar wrote: >> Hmm, so this patch won't break anything and I am inclined to apply it again :) >> >> Does anyone see any other issues with it, which I might be missing ? > > I have updated the commit log a bit more to clarify on things, please let me > know if it looks okay. > > opp: Don't overwrite rounded clk rate > > The OPP table normally contains 'fmax' values corresponding to the > voltage or performance levels of each OPP, but we don't necessarily want > all the devices to run at fmax all the time. Running at fmax makes sense > for devices like CPU/GPU, which have a finite amount of work to do and > since a specific amount of energy is consumed at an OPP, its better to > run at the highest possible frequency for that voltage value. > > On the other hand, we have IO devices which need to run at specific > frequencies only for their proper functioning, instead of maximum > possible frequency. > > The OPP core currently roundup to the next possible OPP for a frequency > and select the fmax value. To support the IO devices by the OPP core, > lets do the roundup to fetch the voltage or performance state values, > but not use the OPP frequency value. Rather use the value returned by > clk_round_rate(). > > The current user, cpufreq, of dev_pm_opp_set_rate() already does the > rounding to the next OPP before calling this routine and it won't > have any side affects because of this change. Looks good to me. Should this also be documented someplace that dev_pm_opp_set_rate() would not be able to distinguish between its users trying to scale CPU/GPU's vs IO devices, so its the callers responsibility to round it accordingly before calling the API? > > Signed-off-by: Stephen Boyd <swboyd@chromium.org> > Signed-off-by: Rajendra Nayak <rnayak@codeaurora.org> > [ Viresh: Massaged changelog and use temp_opp variable instead ] > Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> > >
On 17-06-19, 09:37, Rajendra Nayak wrote: > > > On 6/17/2019 9:20 AM, Viresh Kumar wrote: > > On 14-06-19, 10:57, Viresh Kumar wrote: > > > Hmm, so this patch won't break anything and I am inclined to apply it again :) > > > > > > Does anyone see any other issues with it, which I might be missing ? > > > > I have updated the commit log a bit more to clarify on things, please let me > > know if it looks okay. > > > > opp: Don't overwrite rounded clk rate > > The OPP table normally contains 'fmax' values corresponding to the > > voltage or performance levels of each OPP, but we don't necessarily want > > all the devices to run at fmax all the time. Running at fmax makes sense > > for devices like CPU/GPU, which have a finite amount of work to do and > > since a specific amount of energy is consumed at an OPP, its better to > > run at the highest possible frequency for that voltage value. > > On the other hand, we have IO devices which need to run at specific > > frequencies only for their proper functioning, instead of maximum > > possible frequency. > > The OPP core currently roundup to the next possible OPP for a frequency > > and select the fmax value. To support the IO devices by the OPP core, > > lets do the roundup to fetch the voltage or performance state values, > > but not use the OPP frequency value. Rather use the value returned by > > clk_round_rate(). > > The current user, cpufreq, of dev_pm_opp_set_rate() already does the > > rounding to the next OPP before calling this routine and it won't > > have any side affects because of this change. > > Looks good to me. Should this also be documented someplace that dev_pm_opp_set_rate() > would not be able to distinguish between its users trying to scale CPU/GPU's vs IO > devices, so its the callers responsibility to round it accordingly before calling the > API? diff --git a/drivers/opp/core.c b/drivers/opp/core.c index 0fbc77f05048..bae94bfa1e96 100644 --- a/drivers/opp/core.c +++ b/drivers/opp/core.c @@ -751,8 +751,11 @@ static int _set_required_opps(struct device *dev, * @dev: device for which we do this operation * @target_freq: frequency to achieve * - * This configures the power-supplies and clock source to the levels specified - * by the OPP corresponding to the target_freq. + * This configures the power-supplies to the levels specified by the OPP + * corresponding to the target_freq, and programs the clock to a value <= + * target_freq, as rounded by clk_round_rate(). Device wanting to run at fmax + * provided by the opp, should have already rounded to the target OPP's + * frequency. */
On 6/17/2019 9:47 AM, Viresh Kumar wrote: > On 17-06-19, 09:37, Rajendra Nayak wrote: >> >> >> On 6/17/2019 9:20 AM, Viresh Kumar wrote: >>> On 14-06-19, 10:57, Viresh Kumar wrote: >>>> Hmm, so this patch won't break anything and I am inclined to apply it again :) >>>> >>>> Does anyone see any other issues with it, which I might be missing ? >>> >>> I have updated the commit log a bit more to clarify on things, please let me >>> know if it looks okay. >>> >>> opp: Don't overwrite rounded clk rate >>> The OPP table normally contains 'fmax' values corresponding to the >>> voltage or performance levels of each OPP, but we don't necessarily want >>> all the devices to run at fmax all the time. Running at fmax makes sense >>> for devices like CPU/GPU, which have a finite amount of work to do and >>> since a specific amount of energy is consumed at an OPP, its better to >>> run at the highest possible frequency for that voltage value. >>> On the other hand, we have IO devices which need to run at specific >>> frequencies only for their proper functioning, instead of maximum >>> possible frequency. >>> The OPP core currently roundup to the next possible OPP for a frequency >>> and select the fmax value. To support the IO devices by the OPP core, >>> lets do the roundup to fetch the voltage or performance state values, >>> but not use the OPP frequency value. Rather use the value returned by >>> clk_round_rate(). >>> The current user, cpufreq, of dev_pm_opp_set_rate() already does the >>> rounding to the next OPP before calling this routine and it won't >>> have any side affects because of this change. >> >> Looks good to me. Should this also be documented someplace that dev_pm_opp_set_rate() >> would not be able to distinguish between its users trying to scale CPU/GPU's vs IO >> devices, so its the callers responsibility to round it accordingly before calling the >> API? > > diff --git a/drivers/opp/core.c b/drivers/opp/core.c > index 0fbc77f05048..bae94bfa1e96 100644 > --- a/drivers/opp/core.c > +++ b/drivers/opp/core.c > @@ -751,8 +751,11 @@ static int _set_required_opps(struct device *dev, > * @dev: device for which we do this operation > * @target_freq: frequency to achieve > * > - * This configures the power-supplies and clock source to the levels specified > - * by the OPP corresponding to the target_freq. > + * This configures the power-supplies to the levels specified by the OPP > + * corresponding to the target_freq, and programs the clock to a value <= > + * target_freq, as rounded by clk_round_rate(). Device wanting to run at fmax > + * provided by the opp, should have already rounded to the target OPP's > + * frequency. > */ Perfect, thanks.
diff --git a/drivers/opp/core.c b/drivers/opp/core.c index 0420f7e8ad5b..bc9a7762dd4c 100644 --- a/drivers/opp/core.c +++ b/drivers/opp/core.c @@ -703,7 +703,7 @@ static int _set_required_opps(struct device *dev, int dev_pm_opp_set_rate(struct device *dev, unsigned long target_freq) { struct opp_table *opp_table; - unsigned long freq, old_freq; + unsigned long freq, opp_freq, old_freq, old_opp_freq; struct dev_pm_opp *old_opp, *opp; struct clk *clk; int ret; @@ -742,13 +742,15 @@ int dev_pm_opp_set_rate(struct device *dev, unsigned long target_freq) goto put_opp_table; } - old_opp = _find_freq_ceil(opp_table, &old_freq); + old_opp_freq = old_freq; + old_opp = _find_freq_ceil(opp_table, &old_opp_freq); if (IS_ERR(old_opp)) { dev_err(dev, "%s: failed to find current OPP for freq %lu (%ld)\n", __func__, old_freq, PTR_ERR(old_opp)); } - opp = _find_freq_ceil(opp_table, &freq); + opp_freq = freq; + opp = _find_freq_ceil(opp_table, &opp_freq); if (IS_ERR(opp)) { ret = PTR_ERR(opp); dev_err(dev, "%s: failed to find OPP for freq %lu (%d)\n",