Message ID | f1773fdc-f6ef-ec28-0c0a-4a09e66ab63b@huawei.com (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers | show |
Series | [Question] : about 'cpuinfo_cur_freq' shown in sysfs when the CPU is in idle state | expand |
On 2020/6/2 11:34, Xiongfeng Wang wrote: > Hi Viresh, > > Sorry to disturb you about another problem as follows. > > CPPC use the increment of Desired Performance counter and Reference Performance > counter to get the CPU frequency and show it in sysfs through > 'cpuinfo_cur_freq'. But ACPI CPPC doesn't specifically define the behavior of > these two counters when the CPU is in idle state, such as stop incrementing when > the CPU is in idle state. > > ARMv8.4 Extension inctroduced support for the Activity Monitors Unit (AMU). The > processor frequency cycles and constant frequency cycles in AMU can be used as > Delivered Performance counter and Reference Performance counter. These two > counter in AMU does not increase when the PE is in WFI or WFE. So the increment > is zero when the PE is in WFI/WFE. This cause no issue because > 'cppc_get_rate_from_fbctrs()' in cppc_cpufreq driver will check the increment > and return the desired performance if the increment is zero. > > But when the CPU goes into power down idle state, accessing these two counters > in AMU by memory-mapped address will return zero. Such as CPU1 went into power > down idle state and CPU0 try to get the frequency of CPU1. In this situation, > will display a very big value for 'cpuinfo_cur_freq' in sysfs. Do you have some > advice about this problem ? Just a wild guess, how about just return 0 for idle CPUs? which means the frequency is 0 for idle CPUs. > > I was thinking about an idea as follows. We can run 'cppc_cpufreq_get_rate()' on > the CPU to be measured, so that we can make sure the CPU is in C0 state when we > access the two counters. Also we can return the actual frequency rather than > desired performance when the CPU is in WFI/WFE. But this modification will > change the existing logical and I am not sure if this will cause some bad effect. > > > diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c > index 257d726..ded3bcc 100644 > --- a/drivers/cpufreq/cppc_cpufreq.c > +++ b/drivers/cpufreq/cppc_cpufreq.c > @@ -396,9 +396,10 @@ static int cppc_get_rate_from_fbctrs(struct cppc_cpudata *cpu, > return cppc_cpufreq_perf_to_khz(cpu, delivered_perf); > } > > -static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum) > +static int cppc_cpufreq_get_rate_cpu(void *info) > { > struct cppc_perf_fb_ctrs fb_ctrs_t0 = {0}, fb_ctrs_t1 = {0}; > + unsigned int cpunum = *(unsigned int *)info; > struct cppc_cpudata *cpu = all_cpu_data[cpunum]; > int ret; > > @@ -418,6 +419,22 @@ static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum) > return cppc_get_rate_from_fbctrs(cpu, fb_ctrs_t0, fb_ctrs_t1); > } > > +static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum) > +{ > + unsigned int ret; > + > + ret = smp_call_on_cpu(cpunum, cppc_cpufreq_get_rate_cpu, &cpunum, true); > + > + /* > + * convert negative error code to zero, otherwise we will display > + * an odd value for 'cpuinfo_cur_freq' in sysfs > + */ > + if (ret < 0) > + ret = 0; > + > + return ret; > +} > + > static int cppc_cpufreq_set_boost(struct cpufreq_policy *policy, int state) > { > struct cppc_cpudata *cpudata; > It will bring the CPU back if the CPU is in idle state, not friendly to powersaving :) Thanks Hanjun
On 02-06-20, 11:34, Xiongfeng Wang wrote: > Hi Viresh, > > Sorry to disturb you about another problem as follows. > > CPPC use the increment of Desired Performance counter and Reference Performance > counter to get the CPU frequency and show it in sysfs through > 'cpuinfo_cur_freq'. But ACPI CPPC doesn't specifically define the behavior of > these two counters when the CPU is in idle state, such as stop incrementing when > the CPU is in idle state. > > ARMv8.4 Extension inctroduced support for the Activity Monitors Unit (AMU). The > processor frequency cycles and constant frequency cycles in AMU can be used as > Delivered Performance counter and Reference Performance counter. These two > counter in AMU does not increase when the PE is in WFI or WFE. So the increment > is zero when the PE is in WFI/WFE. This cause no issue because > 'cppc_get_rate_from_fbctrs()' in cppc_cpufreq driver will check the increment > and return the desired performance if the increment is zero. > > But when the CPU goes into power down idle state, accessing these two counters > in AMU by memory-mapped address will return zero. Such as CPU1 went into power > down idle state and CPU0 try to get the frequency of CPU1. In this situation, > will display a very big value for 'cpuinfo_cur_freq' in sysfs. Do you have some > advice about this problem ? > > I was thinking about an idea as follows. We can run 'cppc_cpufreq_get_rate()' on > the CPU to be measured, so that we can make sure the CPU is in C0 state when we > access the two counters. Also we can return the actual frequency rather than > desired performance when the CPU is in WFI/WFE. But this modification will > change the existing logical and I am not sure if this will cause some bad effect. > > > diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c > index 257d726..ded3bcc 100644 > --- a/drivers/cpufreq/cppc_cpufreq.c > +++ b/drivers/cpufreq/cppc_cpufreq.c > @@ -396,9 +396,10 @@ static int cppc_get_rate_from_fbctrs(struct cppc_cpudata *cpu, > return cppc_cpufreq_perf_to_khz(cpu, delivered_perf); > } > > -static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum) > +static int cppc_cpufreq_get_rate_cpu(void *info) > { > struct cppc_perf_fb_ctrs fb_ctrs_t0 = {0}, fb_ctrs_t1 = {0}; > + unsigned int cpunum = *(unsigned int *)info; > struct cppc_cpudata *cpu = all_cpu_data[cpunum]; > int ret; > > @@ -418,6 +419,22 @@ static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum) > return cppc_get_rate_from_fbctrs(cpu, fb_ctrs_t0, fb_ctrs_t1); > } > > +static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum) > +{ > + unsigned int ret; > + > + ret = smp_call_on_cpu(cpunum, cppc_cpufreq_get_rate_cpu, &cpunum, true); > + > + /* > + * convert negative error code to zero, otherwise we will display > + * an odd value for 'cpuinfo_cur_freq' in sysfs > + */ > + if (ret < 0) > + ret = 0; > + > + return ret; > +} > + > static int cppc_cpufreq_set_boost(struct cpufreq_policy *policy, int state) > { > struct cppc_cpudata *cpudata; I don't see any other sane solution, even if this brings the CPU back to normal state and waste power. We should be able to reliably provide value to userspace. Rafael / Sudeep: What you do say ?
On Wed, Jun 03, 2020 at 01:22:00PM +0530, Viresh Kumar wrote: > On 02-06-20, 11:34, Xiongfeng Wang wrote: > > Hi Viresh, > > > > Sorry to disturb you about another problem as follows. > > > > CPPC use the increment of Desired Performance counter and Reference Performance > > counter to get the CPU frequency and show it in sysfs through > > 'cpuinfo_cur_freq'. But ACPI CPPC doesn't specifically define the behavior of > > these two counters when the CPU is in idle state, such as stop incrementing when > > the CPU is in idle state. > > > > ARMv8.4 Extension inctroduced support for the Activity Monitors Unit (AMU). The > > processor frequency cycles and constant frequency cycles in AMU can be used as > > Delivered Performance counter and Reference Performance counter. These two > > counter in AMU does not increase when the PE is in WFI or WFE. So the increment > > is zero when the PE is in WFI/WFE. This cause no issue because > > 'cppc_get_rate_from_fbctrs()' in cppc_cpufreq driver will check the increment > > and return the desired performance if the increment is zero. > > > > But when the CPU goes into power down idle state, accessing these two counters > > in AMU by memory-mapped address will return zero. Such as CPU1 went into power > > down idle state and CPU0 try to get the frequency of CPU1. In this situation, > > will display a very big value for 'cpuinfo_cur_freq' in sysfs. Do you have some > > advice about this problem ? > > > > I was thinking about an idea as follows. We can run 'cppc_cpufreq_get_rate()' on > > the CPU to be measured, so that we can make sure the CPU is in C0 state when we > > access the two counters. Also we can return the actual frequency rather than > > desired performance when the CPU is in WFI/WFE. But this modification will > > change the existing logical and I am not sure if this will cause some bad effect. > > > > > > diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c > > index 257d726..ded3bcc 100644 > > --- a/drivers/cpufreq/cppc_cpufreq.c > > +++ b/drivers/cpufreq/cppc_cpufreq.c > > @@ -396,9 +396,10 @@ static int cppc_get_rate_from_fbctrs(struct cppc_cpudata *cpu, > > return cppc_cpufreq_perf_to_khz(cpu, delivered_perf); > > } > > > > -static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum) > > +static int cppc_cpufreq_get_rate_cpu(void *info) > > { > > struct cppc_perf_fb_ctrs fb_ctrs_t0 = {0}, fb_ctrs_t1 = {0}; > > + unsigned int cpunum = *(unsigned int *)info; > > struct cppc_cpudata *cpu = all_cpu_data[cpunum]; > > int ret; > > > > @@ -418,6 +419,22 @@ static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum) > > return cppc_get_rate_from_fbctrs(cpu, fb_ctrs_t0, fb_ctrs_t1); > > } > > > > +static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum) > > +{ > > + unsigned int ret; > > + > > + ret = smp_call_on_cpu(cpunum, cppc_cpufreq_get_rate_cpu, &cpunum, true); > > + > > + /* > > + * convert negative error code to zero, otherwise we will display > > + * an odd value for 'cpuinfo_cur_freq' in sysfs > > + */ > > + if (ret < 0) > > + ret = 0; > > + > > + return ret; > > +} > > + > > static int cppc_cpufreq_set_boost(struct cpufreq_policy *policy, int state) > > { > > struct cppc_cpudata *cpudata; > > I don't see any other sane solution, even if this brings the CPU back > to normal state and waste power. We should be able to reliably provide > value to userspace. > > Rafael / Sudeep: What you do say ? Agreed on returning 0 as it aligns with the semantics followed. We can't return the last set/fetched value as it fails to align with the values returned when CPU is not idle. But I have another question. If we can detect that CPPC on some platforms rely on CPU registers(I assume FFH registers here and not system/io/... type of GAS registers), can we set dvfs_on_any_cpu(can't recall exact flag name) to false if not already done to prevent such issues. Or I am talking non-sense as it may be applicable only for _set operation and not _get.
On 03-06-20, 11:07, Sudeep Holla wrote: > But I have another question. If we can detect that CPPC on some platforms > rely on CPU registers(I assume FFH registers here and not system/io/... > type of GAS registers), can we set dvfs_on_any_cpu(can't recall exact > flag name) to false if not already done to prevent such issues. Or I am > talking non-sense as it may be applicable only for _set operation and Yes, non-sense :) > not _get.
On Wed, Jun 03, 2020 at 03:40:10PM +0530, Viresh Kumar wrote: > On 03-06-20, 11:07, Sudeep Holla wrote: > > But I have another question. If we can detect that CPPC on some platforms > > rely on CPU registers(I assume FFH registers here and not system/io/... > > type of GAS registers), can we set dvfs_on_any_cpu(can't recall exact > > flag name) to false if not already done to prevent such issues. Or I am > > talking non-sense as it may be applicable only for _set operation and > > Yes, non-sense :) > Thanks for confirming
On 03-06-20, 11:17, Sudeep Holla wrote: > On Wed, Jun 03, 2020 at 03:40:10PM +0530, Viresh Kumar wrote: > > On 03-06-20, 11:07, Sudeep Holla wrote: > > > But I have another question. If we can detect that CPPC on some platforms > > > rely on CPU registers(I assume FFH registers here and not system/io/... > > > type of GAS registers), can we set dvfs_on_any_cpu(can't recall exact > > > flag name) to false if not already done to prevent such issues. Or I am > > > talking non-sense as it may be applicable only for _set operation and > > > > Yes, non-sense :) > > > > Thanks for confirming
On Wed, Jun 03, 2020 at 03:51:59PM +0530, Viresh Kumar wrote: > On 03-06-20, 11:17, Sudeep Holla wrote: > > On Wed, Jun 03, 2020 at 03:40:10PM +0530, Viresh Kumar wrote: > > > On 03-06-20, 11:07, Sudeep Holla wrote: > > > > But I have another question. If we can detect that CPPC on some platforms > > > > rely on CPU registers(I assume FFH registers here and not system/io/... > > > > type of GAS registers), can we set dvfs_on_any_cpu(can't recall exact > > > > flag name) to false if not already done to prevent such issues. Or I am > > > > talking non-sense as it may be applicable only for _set operation and > > > > > > Yes, non-sense :) > > > > > > > Thanks for confirming
On Wed, Jun 3, 2020 at 9:52 AM Viresh Kumar <viresh.kumar@linaro.org> wrote: > > On 02-06-20, 11:34, Xiongfeng Wang wrote: > > Hi Viresh, > > > > Sorry to disturb you about another problem as follows. > > > > CPPC use the increment of Desired Performance counter and Reference Performance > > counter to get the CPU frequency and show it in sysfs through > > 'cpuinfo_cur_freq'. But ACPI CPPC doesn't specifically define the behavior of > > these two counters when the CPU is in idle state, such as stop incrementing when > > the CPU is in idle state. > > > > ARMv8.4 Extension inctroduced support for the Activity Monitors Unit (AMU). The > > processor frequency cycles and constant frequency cycles in AMU can be used as > > Delivered Performance counter and Reference Performance counter. These two > > counter in AMU does not increase when the PE is in WFI or WFE. So the increment > > is zero when the PE is in WFI/WFE. This cause no issue because > > 'cppc_get_rate_from_fbctrs()' in cppc_cpufreq driver will check the increment > > and return the desired performance if the increment is zero. > > > > But when the CPU goes into power down idle state, accessing these two counters > > in AMU by memory-mapped address will return zero. Such as CPU1 went into power > > down idle state and CPU0 try to get the frequency of CPU1. In this situation, > > will display a very big value for 'cpuinfo_cur_freq' in sysfs. Do you have some > > advice about this problem ? > > > > I was thinking about an idea as follows. We can run 'cppc_cpufreq_get_rate()' on > > the CPU to be measured, so that we can make sure the CPU is in C0 state when we > > access the two counters. Also we can return the actual frequency rather than > > desired performance when the CPU is in WFI/WFE. But this modification will > > change the existing logical and I am not sure if this will cause some bad effect. > > > > > > diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c > > index 257d726..ded3bcc 100644 > > --- a/drivers/cpufreq/cppc_cpufreq.c > > +++ b/drivers/cpufreq/cppc_cpufreq.c > > @@ -396,9 +396,10 @@ static int cppc_get_rate_from_fbctrs(struct cppc_cpudata *cpu, > > return cppc_cpufreq_perf_to_khz(cpu, delivered_perf); > > } > > > > -static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum) > > +static int cppc_cpufreq_get_rate_cpu(void *info) > > { > > struct cppc_perf_fb_ctrs fb_ctrs_t0 = {0}, fb_ctrs_t1 = {0}; > > + unsigned int cpunum = *(unsigned int *)info; > > struct cppc_cpudata *cpu = all_cpu_data[cpunum]; > > int ret; > > > > @@ -418,6 +419,22 @@ static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum) > > return cppc_get_rate_from_fbctrs(cpu, fb_ctrs_t0, fb_ctrs_t1); > > } > > > > +static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum) > > +{ > > + unsigned int ret; > > + > > + ret = smp_call_on_cpu(cpunum, cppc_cpufreq_get_rate_cpu, &cpunum, true); > > + > > + /* > > + * convert negative error code to zero, otherwise we will display > > + * an odd value for 'cpuinfo_cur_freq' in sysfs > > + */ > > + if (ret < 0) > > + ret = 0; > > + > > + return ret; > > +} > > + > > static int cppc_cpufreq_set_boost(struct cpufreq_policy *policy, int state) > > { > > struct cppc_cpudata *cpudata; > > I don't see any other sane solution, even if this brings the CPU back > to normal state and waste power. We should be able to reliably provide > value to userspace. > > Rafael / Sudeep: What you do say ? The frequency value obtained by kicking the CPU out of idle artificially is bogus, though. You may as well return a random number instead. The frequency of a CPU in an idle state is in fact unknown in the case at hand, so returning 0 looks like the cleanest option to me. Thanks!
Hi Rafael, Thanks for your reply ! On 2020/6/3 21:39, Rafael J. Wysocki wrote: > On Wed, Jun 3, 2020 at 9:52 AM Viresh Kumar <viresh.kumar@linaro.org> wrote: >> >> On 02-06-20, 11:34, Xiongfeng Wang wrote: >>> Hi Viresh, >>> >>> Sorry to disturb you about another problem as follows. >>> >>> CPPC use the increment of Desired Performance counter and Reference Performance >>> counter to get the CPU frequency and show it in sysfs through >>> 'cpuinfo_cur_freq'. But ACPI CPPC doesn't specifically define the behavior of >>> these two counters when the CPU is in idle state, such as stop incrementing when >>> the CPU is in idle state. >>> >>> ARMv8.4 Extension inctroduced support for the Activity Monitors Unit (AMU). The >>> processor frequency cycles and constant frequency cycles in AMU can be used as >>> Delivered Performance counter and Reference Performance counter. These two >>> counter in AMU does not increase when the PE is in WFI or WFE. So the increment >>> is zero when the PE is in WFI/WFE. This cause no issue because >>> 'cppc_get_rate_from_fbctrs()' in cppc_cpufreq driver will check the increment >>> and return the desired performance if the increment is zero. >>> >>> But when the CPU goes into power down idle state, accessing these two counters >>> in AMU by memory-mapped address will return zero. Such as CPU1 went into power >>> down idle state and CPU0 try to get the frequency of CPU1. In this situation, >>> will display a very big value for 'cpuinfo_cur_freq' in sysfs. Do you have some >>> advice about this problem ? >>> >>> I was thinking about an idea as follows. We can run 'cppc_cpufreq_get_rate()' on >>> the CPU to be measured, so that we can make sure the CPU is in C0 state when we >>> access the two counters. Also we can return the actual frequency rather than >>> desired performance when the CPU is in WFI/WFE. But this modification will >>> change the existing logical and I am not sure if this will cause some bad effect. >>> >>> >>> diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c >>> index 257d726..ded3bcc 100644 >>> --- a/drivers/cpufreq/cppc_cpufreq.c >>> +++ b/drivers/cpufreq/cppc_cpufreq.c >>> @@ -396,9 +396,10 @@ static int cppc_get_rate_from_fbctrs(struct cppc_cpudata *cpu, >>> return cppc_cpufreq_perf_to_khz(cpu, delivered_perf); >>> } >>> >>> -static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum) >>> +static int cppc_cpufreq_get_rate_cpu(void *info) >>> { >>> struct cppc_perf_fb_ctrs fb_ctrs_t0 = {0}, fb_ctrs_t1 = {0}; >>> + unsigned int cpunum = *(unsigned int *)info; >>> struct cppc_cpudata *cpu = all_cpu_data[cpunum]; >>> int ret; >>> >>> @@ -418,6 +419,22 @@ static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum) >>> return cppc_get_rate_from_fbctrs(cpu, fb_ctrs_t0, fb_ctrs_t1); >>> } >>> >>> +static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum) >>> +{ >>> + unsigned int ret; >>> + >>> + ret = smp_call_on_cpu(cpunum, cppc_cpufreq_get_rate_cpu, &cpunum, true); >>> + >>> + /* >>> + * convert negative error code to zero, otherwise we will display >>> + * an odd value for 'cpuinfo_cur_freq' in sysfs >>> + */ >>> + if (ret < 0) >>> + ret = 0; >>> + >>> + return ret; >>> +} >>> + >>> static int cppc_cpufreq_set_boost(struct cpufreq_policy *policy, int state) >>> { >>> struct cppc_cpudata *cpudata; >> >> I don't see any other sane solution, even if this brings the CPU back >> to normal state and waste power. We should be able to reliably provide >> value to userspace. >> >> Rafael / Sudeep: What you do say ? > > The frequency value obtained by kicking the CPU out of idle > artificially is bogus, though. You may as well return a random number > instead. Yes, it may return a randowm number as well. > > The frequency of a CPU in an idle state is in fact unknown in the case > at hand, so returning 0 looks like the cleanest option to me. I am not sure about how the user will use 'cpuinfo_cur_freq' in sysfs. If I return 0 when the CPU is idle, when I run a light load on the CPU, I will get a zero value for 'cpuinfo_cur_freq' when the CPU is idle. When the CPU is not idle, I will get a non-zero value. The user may feel odd about 'cpuinfo_cur_frreq' switching between a zero value and a non-zero value. They may hope it can return the frequency when the CPU execute instructions, namely in C0 state. I am not so sure about the user will look at 'cpuinfo_cur_freq'. Thanks, Xiongfeng > > Thanks! > > . >
On 04-06-20, 09:32, Xiongfeng Wang wrote: > On 2020/6/3 21:39, Rafael J. Wysocki wrote: > > The frequency value obtained by kicking the CPU out of idle > > artificially is bogus, though. You may as well return a random number > > instead. > > Yes, it may return a randowm number as well. > > > > > The frequency of a CPU in an idle state is in fact unknown in the case > > at hand, so returning 0 looks like the cleanest option to me. > > I am not sure about how the user will use 'cpuinfo_cur_freq' in sysfs. If I > return 0 when the CPU is idle, when I run a light load on the CPU, I will get a > zero value for 'cpuinfo_cur_freq' when the CPU is idle. When the CPU is not > idle, I will get a non-zero value. The user may feel odd about > 'cpuinfo_cur_frreq' switching between a zero value and a non-zero value. They > may hope it can return the frequency when the CPU execute instructions, namely > in C0 state. I am not so sure about the user will look at 'cpuinfo_cur_freq'. This is what I was worried about as well. The interface to sysfs needs to be robust. Returning frequency on some readings and 0 on others doesn't look right to me as well. This will break scripts (I am not sure if some scripts are there to look for these values) with the randomness of values returned by it. On reading values locally from the CPU, I thought about the case where userspace can prevent a CPU going into idle just by reading its frequency from sysfs (and so waste power), but the same can be done by userspace to run arbitrary load on the CPUs. Can we do some sort of caching of the last frequency the CPU was running at before going into idle ? Then we can just check if cpu is idle and so return cached value.
On Thu, Jun 4, 2020 at 6:41 AM Viresh Kumar <viresh.kumar@linaro.org> wrote: > > On 04-06-20, 09:32, Xiongfeng Wang wrote: > > On 2020/6/3 21:39, Rafael J. Wysocki wrote: > > > The frequency value obtained by kicking the CPU out of idle > > > artificially is bogus, though. You may as well return a random number > > > instead. > > > > Yes, it may return a randowm number as well. > > > > > > > > The frequency of a CPU in an idle state is in fact unknown in the case > > > at hand, so returning 0 looks like the cleanest option to me. > > > > I am not sure about how the user will use 'cpuinfo_cur_freq' in sysfs. If I > > return 0 when the CPU is idle, when I run a light load on the CPU, I will get a > > zero value for 'cpuinfo_cur_freq' when the CPU is idle. When the CPU is not > > idle, I will get a non-zero value. The user may feel odd about > > 'cpuinfo_cur_frreq' switching between a zero value and a non-zero value. They > > may hope it can return the frequency when the CPU execute instructions, namely > > in C0 state. I am not so sure about the user will look at 'cpuinfo_cur_freq'. > > This is what I was worried about as well. The interface to sysfs needs > to be robust. Returning frequency on some readings and 0 on others > doesn't look right to me as well. This will break scripts (I am not > sure if some scripts are there to look for these values) with the > randomness of values returned by it. The only thing the scripts need to do is to skip zeros (or anything less than the minimum hw frequency for that matter) coming from that attribute. > On reading values locally from the CPU, I thought about the case where > userspace can prevent a CPU going into idle just by reading its > frequency from sysfs (and so waste power), but the same can be done by > userspace to run arbitrary load on the CPUs. > > Can we do some sort of caching of the last frequency the CPU was > running at before going into idle ? Then we can just check if cpu is > idle and so return cached value. That is an option, but it looks like in this case the cpuinfo_cur_freq attribute should not be present at all, as per the documentation. Cheers!
On Thu, Jun 04, 2020 at 12:42:06PM +0200, Rafael J. Wysocki wrote: > On Thu, Jun 4, 2020 at 6:41 AM Viresh Kumar <viresh.kumar@linaro.org> wrote: > > > > On 04-06-20, 09:32, Xiongfeng Wang wrote: > > > On 2020/6/3 21:39, Rafael J. Wysocki wrote: > > > > The frequency value obtained by kicking the CPU out of idle > > > > artificially is bogus, though. You may as well return a random number > > > > instead. > > > > > > Yes, it may return a randowm number as well. > > > > > > > > > > > The frequency of a CPU in an idle state is in fact unknown in the case > > > > at hand, so returning 0 looks like the cleanest option to me. > > > > > > I am not sure about how the user will use 'cpuinfo_cur_freq' in sysfs. If I > > > return 0 when the CPU is idle, when I run a light load on the CPU, I will get a > > > zero value for 'cpuinfo_cur_freq' when the CPU is idle. When the CPU is not > > > idle, I will get a non-zero value. The user may feel odd about > > > 'cpuinfo_cur_frreq' switching between a zero value and a non-zero value. They > > > may hope it can return the frequency when the CPU execute instructions, namely > > > in C0 state. I am not so sure about the user will look at 'cpuinfo_cur_freq'. > > > > This is what I was worried about as well. The interface to sysfs needs > > to be robust. Returning frequency on some readings and 0 on others > > doesn't look right to me as well. This will break scripts (I am not > > sure if some scripts are there to look for these values) with the > > randomness of values returned by it. > > The only thing the scripts need to do is to skip zeros (or anything > less than the minimum hw frequency for that matter) coming from that > attribute. > > > On reading values locally from the CPU, I thought about the case where > > userspace can prevent a CPU going into idle just by reading its > > frequency from sysfs (and so waste power), but the same can be done by > > userspace to run arbitrary load on the CPUs. > > > > Can we do some sort of caching of the last frequency the CPU was > > running at before going into idle ? Then we can just check if cpu is > > idle and so return cached value. > > That is an option, but it looks like in this case the cpuinfo_cur_freq > attribute should not be present at all, as per the documentation. > +1 for dropping the attribute.
Hi guys, Sorry for showing up late to the party, I was on holiday last week. On Thursday 04 Jun 2020 at 13:58:22 (+0100), Sudeep Holla wrote: > On Thu, Jun 04, 2020 at 12:42:06PM +0200, Rafael J. Wysocki wrote: > > On Thu, Jun 4, 2020 at 6:41 AM Viresh Kumar <viresh.kumar@linaro.org> wrote: > > > > > > On 04-06-20, 09:32, Xiongfeng Wang wrote: > > > > On 2020/6/3 21:39, Rafael J. Wysocki wrote: > > > > > The frequency value obtained by kicking the CPU out of idle > > > > > artificially is bogus, though. You may as well return a random number > > > > > instead. > > > > > > > > Yes, it may return a randowm number as well. > > > > > > > > > > > > > > The frequency of a CPU in an idle state is in fact unknown in the case > > > > > at hand, so returning 0 looks like the cleanest option to me. > > > > > > > > I am not sure about how the user will use 'cpuinfo_cur_freq' in sysfs. If I > > > > return 0 when the CPU is idle, when I run a light load on the CPU, I will get a > > > > zero value for 'cpuinfo_cur_freq' when the CPU is idle. When the CPU is not > > > > idle, I will get a non-zero value. The user may feel odd about > > > > 'cpuinfo_cur_frreq' switching between a zero value and a non-zero value. They > > > > may hope it can return the frequency when the CPU execute instructions, namely > > > > in C0 state. I am not so sure about the user will look at 'cpuinfo_cur_freq'. > > > > > > This is what I was worried about as well. The interface to sysfs needs > > > to be robust. Returning frequency on some readings and 0 on others > > > doesn't look right to me as well. This will break scripts (I am not > > > sure if some scripts are there to look for these values) with the > > > randomness of values returned by it. > > > > The only thing the scripts need to do is to skip zeros (or anything > > less than the minimum hw frequency for that matter) coming from that > > attribute. > > > > > On reading values locally from the CPU, I thought about the case where > > > userspace can prevent a CPU going into idle just by reading its > > > frequency from sysfs (and so waste power), but the same can be done by > > > userspace to run arbitrary load on the CPUs. > > > > > > Can we do some sort of caching of the last frequency the CPU was > > > running at before going into idle ? Then we can just check if cpu is > > > idle and so return cached value. > > > > That is an option, but it looks like in this case the cpuinfo_cur_freq > > attribute should not be present at all, as per the documentation. > > > > +1 for dropping the attribute. > I've been experimenting with some code quite recently that uses the scheduler frequency scale factor to compute this hardware current rate for CPPC. On the scheduler tick, the scale factor is computed in arch_scale_freq_tick() to give an indication on delivered performance, using AMUs on arm64 [1] and APERF/MPERF on x86 [2]. Basically, this scale factor has the cached value of the average delivered performance between the last two scheduler ticks, on a capacity scale: 0-1024. All that would be needed is to convert from the scheduler frequency scale to the CPPC expected performance scale. The gist of the code would be: delivered_perf = topology_get_freq_scale(cpu); delivered_perf *= fb_ctrs.reference_perf; delivered_perf = div64_u64(delivered_perf << SCHED_CAPACITY_SHIFT, per_cpu(arch_max_freq_scale, cpu)); While this solution is not perfect, it would provide the best view of the hardware "current" rate without the cost of waking up the CPU when idle, scheduling additional work on the CPU, doing checks on whether the CPU is idle and/or providing other caching mechanisms. Do you think such an implementation could make cpuinfo_cur_freq worth keeping? I'm happy to push the patches for this and discuss the details there. Thanks, Ionela. [1] https://lkml.org/lkml/2020/3/5/183 [2] https://lkml.org/lkml/2020/1/22/1039 > -- > Regards, > Sudeep
Hi Ionela, Thanks for your reply ! On 2020/6/10 17:40, Ionela Voinescu wrote: > Hi guys, > > Sorry for showing up late to the party, I was on holiday last week. > > On Thursday 04 Jun 2020 at 13:58:22 (+0100), Sudeep Holla wrote: >> On Thu, Jun 04, 2020 at 12:42:06PM +0200, Rafael J. Wysocki wrote: >>> On Thu, Jun 4, 2020 at 6:41 AM Viresh Kumar <viresh.kumar@linaro.org> wrote: >>>> >>>> On 04-06-20, 09:32, Xiongfeng Wang wrote: >>>>> On 2020/6/3 21:39, Rafael J. Wysocki wrote: >>>>>> The frequency value obtained by kicking the CPU out of idle >>>>>> artificially is bogus, though. You may as well return a random number >>>>>> instead. >>>>> >>>>> Yes, it may return a randowm number as well. >>>>> >>>>>> >>>>>> The frequency of a CPU in an idle state is in fact unknown in the case >>>>>> at hand, so returning 0 looks like the cleanest option to me. >>>>> >>>>> I am not sure about how the user will use 'cpuinfo_cur_freq' in sysfs. If I >>>>> return 0 when the CPU is idle, when I run a light load on the CPU, I will get a >>>>> zero value for 'cpuinfo_cur_freq' when the CPU is idle. When the CPU is not >>>>> idle, I will get a non-zero value. The user may feel odd about >>>>> 'cpuinfo_cur_frreq' switching between a zero value and a non-zero value. They >>>>> may hope it can return the frequency when the CPU execute instructions, namely >>>>> in C0 state. I am not so sure about the user will look at 'cpuinfo_cur_freq'. >>>> >>>> This is what I was worried about as well. The interface to sysfs needs >>>> to be robust. Returning frequency on some readings and 0 on others >>>> doesn't look right to me as well. This will break scripts (I am not >>>> sure if some scripts are there to look for these values) with the >>>> randomness of values returned by it. >>> >>> The only thing the scripts need to do is to skip zeros (or anything >>> less than the minimum hw frequency for that matter) coming from that >>> attribute. >>> >>>> On reading values locally from the CPU, I thought about the case where >>>> userspace can prevent a CPU going into idle just by reading its >>>> frequency from sysfs (and so waste power), but the same can be done by >>>> userspace to run arbitrary load on the CPUs. >>>> >>>> Can we do some sort of caching of the last frequency the CPU was >>>> running at before going into idle ? Then we can just check if cpu is >>>> idle and so return cached value. >>> >>> That is an option, but it looks like in this case the cpuinfo_cur_freq >>> attribute should not be present at all, as per the documentation. >>> >> >> +1 for dropping the attribute. >> > > I've been experimenting with some code quite recently that uses the > scheduler frequency scale factor to compute this hardware current rate > for CPPC. > > On the scheduler tick, the scale factor is computed in > arch_scale_freq_tick() to give an indication on delivered performance, > using AMUs on arm64 [1] and APERF/MPERF on x86 [2]. Basically, this scale > factor has the cached value of the average delivered performance between > the last two scheduler ticks, on a capacity scale: 0-1024. All that would > be needed is to convert from the scheduler frequency scale to the CPPC > expected performance scale. > > The gist of the code would be: > > delivered_perf = topology_get_freq_scale(cpu); > delivered_perf *= fb_ctrs.reference_perf; > delivered_perf = div64_u64(delivered_perf << SCHED_CAPACITY_SHIFT, > per_cpu(arch_max_freq_scale, cpu)); > > While this solution is not perfect, it would provide the best view of > the hardware "current" rate without the cost of waking up the CPU when > idle, scheduling additional work on the CPU, doing checks on whether > the CPU is idle and/or providing other caching mechanisms. I think it's a good idea. It's just that the value is a average frequency in the last two scheduler ticks, not an instantaneous frequency. What 'cppc_cpufreq_get_rate()' get is also not an instantaneous frequency. It's a average frequency in 2us. I check the interval between two frequency updates on my machine. It's about 10ms. So the frequency will change at least one time in two scheduler ticks if HZ is 1000. I am not sure how people will use 'cpuinfo_cur_freq'. They may not expect a very accurate frequency. How about we return this value when CPU is idle? Or just return 0 in idle is better ? > > Do you think such an implementation could make cpuinfo_cur_freq worth > keeping? > > I'm happy to push the patches for this and discuss the details there. Thanks. I'm happy to see the patches and give some help. Thanks, Xiongfeng > > Thanks, > Ionela. > > > [1] https://lkml.org/lkml/2020/3/5/183 > [2] https://lkml.org/lkml/2020/1/22/1039 > >> -- >> Regards, >> Sudeep > > . >
On Thu, Jun 11, 2020 at 3:52 AM Xiongfeng Wang <wangxiongfeng2@huawei.com> wrote: > > Hi Ionela, > > Thanks for your reply ! > > On 2020/6/10 17:40, Ionela Voinescu wrote: > > Hi guys, > > > > Sorry for showing up late to the party, I was on holiday last week. > > > > On Thursday 04 Jun 2020 at 13:58:22 (+0100), Sudeep Holla wrote: > >> On Thu, Jun 04, 2020 at 12:42:06PM +0200, Rafael J. Wysocki wrote: > >>> On Thu, Jun 4, 2020 at 6:41 AM Viresh Kumar <viresh.kumar@linaro.org> wrote: > >>>> > >>>> On 04-06-20, 09:32, Xiongfeng Wang wrote: > >>>>> On 2020/6/3 21:39, Rafael J. Wysocki wrote: > >>>>>> The frequency value obtained by kicking the CPU out of idle > >>>>>> artificially is bogus, though. You may as well return a random number > >>>>>> instead. > >>>>> > >>>>> Yes, it may return a randowm number as well. > >>>>> > >>>>>> > >>>>>> The frequency of a CPU in an idle state is in fact unknown in the case > >>>>>> at hand, so returning 0 looks like the cleanest option to me. > >>>>> > >>>>> I am not sure about how the user will use 'cpuinfo_cur_freq' in sysfs. If I > >>>>> return 0 when the CPU is idle, when I run a light load on the CPU, I will get a > >>>>> zero value for 'cpuinfo_cur_freq' when the CPU is idle. When the CPU is not > >>>>> idle, I will get a non-zero value. The user may feel odd about > >>>>> 'cpuinfo_cur_frreq' switching between a zero value and a non-zero value. They > >>>>> may hope it can return the frequency when the CPU execute instructions, namely > >>>>> in C0 state. I am not so sure about the user will look at 'cpuinfo_cur_freq'. > >>>> > >>>> This is what I was worried about as well. The interface to sysfs needs > >>>> to be robust. Returning frequency on some readings and 0 on others > >>>> doesn't look right to me as well. This will break scripts (I am not > >>>> sure if some scripts are there to look for these values) with the > >>>> randomness of values returned by it. > >>> > >>> The only thing the scripts need to do is to skip zeros (or anything > >>> less than the minimum hw frequency for that matter) coming from that > >>> attribute. > >>> > >>>> On reading values locally from the CPU, I thought about the case where > >>>> userspace can prevent a CPU going into idle just by reading its > >>>> frequency from sysfs (and so waste power), but the same can be done by > >>>> userspace to run arbitrary load on the CPUs. > >>>> > >>>> Can we do some sort of caching of the last frequency the CPU was > >>>> running at before going into idle ? Then we can just check if cpu is > >>>> idle and so return cached value. > >>> > >>> That is an option, but it looks like in this case the cpuinfo_cur_freq > >>> attribute should not be present at all, as per the documentation. > >>> > >> > >> +1 for dropping the attribute. > >> > > > > I've been experimenting with some code quite recently that uses the > > scheduler frequency scale factor to compute this hardware current rate > > for CPPC. > > > > On the scheduler tick, the scale factor is computed in > > arch_scale_freq_tick() to give an indication on delivered performance, > > using AMUs on arm64 [1] and APERF/MPERF on x86 [2]. Basically, this scale > > factor has the cached value of the average delivered performance between > > the last two scheduler ticks, on a capacity scale: 0-1024. All that would > > be needed is to convert from the scheduler frequency scale to the CPPC > > expected performance scale. > > > > The gist of the code would be: > > > > delivered_perf = topology_get_freq_scale(cpu); > > delivered_perf *= fb_ctrs.reference_perf; > > delivered_perf = div64_u64(delivered_perf << SCHED_CAPACITY_SHIFT, > > per_cpu(arch_max_freq_scale, cpu)); > > > > While this solution is not perfect, it would provide the best view of > > the hardware "current" rate without the cost of waking up the CPU when > > idle, scheduling additional work on the CPU, doing checks on whether > > the CPU is idle and/or providing other caching mechanisms. > > I think it's a good idea. It's just that the value is a average frequency in the > last two scheduler ticks, not an instantaneous frequency. What > 'cppc_cpufreq_get_rate()' get is also not an instantaneous frequency. It's a > average frequency in 2us. I check the interval between two frequency updates on > my machine. It's about 10ms. So the frequency will change at least one time in > two scheduler ticks if HZ is 1000. > > I am not sure how people will use 'cpuinfo_cur_freq'. They may not expect a very > accurate frequency. How about we return this value when CPU is idle? Or just > return 0 in idle is better ? According to the documentation, if this attribute is present, it should return the exact frequency of the CPU (with a caveat that it may change already when user space is able to consume the value), and which is what the users may expect. As I said before, IMO the CPPC driver should not cause the core to expose cpuinfo_cur_freq at all. Thanks!
diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c index 257d726..ded3bcc 100644 --- a/drivers/cpufreq/cppc_cpufreq.c +++ b/drivers/cpufreq/cppc_cpufreq.c @@ -396,9 +396,10 @@ static int cppc_get_rate_from_fbctrs(struct cppc_cpudata *cpu, return cppc_cpufreq_perf_to_khz(cpu, delivered_perf); } -static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum) +static int cppc_cpufreq_get_rate_cpu(void *info) { struct cppc_perf_fb_ctrs fb_ctrs_t0 = {0}, fb_ctrs_t1 = {0}; + unsigned int cpunum = *(unsigned int *)info; struct cppc_cpudata *cpu = all_cpu_data[cpunum]; int ret; @@ -418,6 +419,22 @@ static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum) return cppc_get_rate_from_fbctrs(cpu, fb_ctrs_t0, fb_ctrs_t1); } +static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum) +{ + unsigned int ret; + + ret = smp_call_on_cpu(cpunum, cppc_cpufreq_get_rate_cpu, &cpunum, true); + + /* + * convert negative error code to zero, otherwise we will display + * an odd value for 'cpuinfo_cur_freq' in sysfs + */ + if (ret < 0) + ret = 0; + + return ret; +} + static int cppc_cpufreq_set_boost(struct cpufreq_policy *policy, int state) { struct cppc_cpudata *cpudata;