diff mbox series

[Question] : about 'cpuinfo_cur_freq' shown in sysfs when the CPU is in idle state

Message ID f1773fdc-f6ef-ec28-0c0a-4a09e66ab63b@huawei.com (mailing list archive)
State Not Applicable, archived
Headers show
Series [Question] : about 'cpuinfo_cur_freq' shown in sysfs when the CPU is in idle state | expand

Commit Message

Xiongfeng Wang June 2, 2020, 3:34 a.m. UTC
Hi Viresh,

Sorry to disturb you about another problem as follows.

CPPC use the increment of Desired Performance counter and Reference Performance
counter to get the CPU frequency and show it in sysfs through
'cpuinfo_cur_freq'. But ACPI CPPC doesn't specifically define the behavior of
these two counters when the CPU is in idle state, such as stop incrementing when
the CPU is in idle state.

ARMv8.4 Extension inctroduced support for the Activity Monitors Unit (AMU). The
processor frequency cycles and constant frequency cycles in AMU can be used as
Delivered Performance counter and Reference Performance counter. These two
counter in AMU does not increase when the PE is in WFI or WFE. So the increment
is zero when the PE is in WFI/WFE. This cause no issue because
'cppc_get_rate_from_fbctrs()' in cppc_cpufreq driver will check the increment
and return the desired performance if the increment is zero.

But when the CPU goes into power down idle state, accessing these two counters
in AMU by memory-mapped address will return zero. Such as CPU1 went into power
down idle state and CPU0 try to get the frequency of CPU1. In this situation,
will display a very big value for 'cpuinfo_cur_freq' in sysfs. Do you have some
advice about this problem ?

I was thinking about an idea as follows. We can run 'cppc_cpufreq_get_rate()' on
the CPU to be measured, so that we can make sure the CPU is in C0 state when we
access the two counters. Also we can return the actual frequency rather than
desired performance when the CPU is in WFI/WFE. But this modification will
change the existing logical and I am not sure if this will cause some bad effect.




Thanks,
Xiongfeng

Comments

Hanjun Guo June 3, 2020, 2:05 a.m. UTC | #1
On 2020/6/2 11:34, Xiongfeng Wang wrote:
> Hi Viresh,
> 
> Sorry to disturb you about another problem as follows.
> 
> CPPC use the increment of Desired Performance counter and Reference Performance
> counter to get the CPU frequency and show it in sysfs through
> 'cpuinfo_cur_freq'. But ACPI CPPC doesn't specifically define the behavior of
> these two counters when the CPU is in idle state, such as stop incrementing when
> the CPU is in idle state.
> 
> ARMv8.4 Extension inctroduced support for the Activity Monitors Unit (AMU). The
> processor frequency cycles and constant frequency cycles in AMU can be used as
> Delivered Performance counter and Reference Performance counter. These two
> counter in AMU does not increase when the PE is in WFI or WFE. So the increment
> is zero when the PE is in WFI/WFE. This cause no issue because
> 'cppc_get_rate_from_fbctrs()' in cppc_cpufreq driver will check the increment
> and return the desired performance if the increment is zero.
> 
> But when the CPU goes into power down idle state, accessing these two counters
> in AMU by memory-mapped address will return zero. Such as CPU1 went into power
> down idle state and CPU0 try to get the frequency of CPU1. In this situation,
> will display a very big value for 'cpuinfo_cur_freq' in sysfs. Do you have some
> advice about this problem ?

Just a wild guess, how about just return 0 for idle CPUs? which means
the frequency is 0 for idle CPUs.

> 
> I was thinking about an idea as follows. We can run 'cppc_cpufreq_get_rate()' on
> the CPU to be measured, so that we can make sure the CPU is in C0 state when we
> access the two counters. Also we can return the actual frequency rather than
> desired performance when the CPU is in WFI/WFE. But this modification will
> change the existing logical and I am not sure if this will cause some bad effect.
> 
> 
> diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
> index 257d726..ded3bcc 100644
> --- a/drivers/cpufreq/cppc_cpufreq.c
> +++ b/drivers/cpufreq/cppc_cpufreq.c
> @@ -396,9 +396,10 @@ static int cppc_get_rate_from_fbctrs(struct cppc_cpudata *cpu,
>          return cppc_cpufreq_perf_to_khz(cpu, delivered_perf);
>   }
> 
> -static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum)
> +static int cppc_cpufreq_get_rate_cpu(void *info)
>   {
>          struct cppc_perf_fb_ctrs fb_ctrs_t0 = {0}, fb_ctrs_t1 = {0};
> + unsigned int cpunum = *(unsigned int *)info;
>          struct cppc_cpudata *cpu = all_cpu_data[cpunum];
>          int ret;
> 
> @@ -418,6 +419,22 @@ static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum)
>          return cppc_get_rate_from_fbctrs(cpu, fb_ctrs_t0, fb_ctrs_t1);
>   }
> 
> +static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum)
> +{
> + unsigned int ret;
> +
> + ret = smp_call_on_cpu(cpunum, cppc_cpufreq_get_rate_cpu, &cpunum, true);
> +
> + /*
> +  * convert negative error code to zero, otherwise we will display
> +  * an odd value for 'cpuinfo_cur_freq' in sysfs
> +  */
> + if (ret < 0)
> +         ret = 0;
> +
> + return ret;
> +}
> +
>   static int cppc_cpufreq_set_boost(struct cpufreq_policy *policy, int state)
>   {
>          struct cppc_cpudata *cpudata;
> 

It will bring the CPU back if the CPU is in idle state, not friendly to
powersaving :)

Thanks
Hanjun
Viresh Kumar June 3, 2020, 7:52 a.m. UTC | #2
On 02-06-20, 11:34, Xiongfeng Wang wrote:
> Hi Viresh,
> 
> Sorry to disturb you about another problem as follows.
> 
> CPPC use the increment of Desired Performance counter and Reference Performance
> counter to get the CPU frequency and show it in sysfs through
> 'cpuinfo_cur_freq'. But ACPI CPPC doesn't specifically define the behavior of
> these two counters when the CPU is in idle state, such as stop incrementing when
> the CPU is in idle state.
> 
> ARMv8.4 Extension inctroduced support for the Activity Monitors Unit (AMU). The
> processor frequency cycles and constant frequency cycles in AMU can be used as
> Delivered Performance counter and Reference Performance counter. These two
> counter in AMU does not increase when the PE is in WFI or WFE. So the increment
> is zero when the PE is in WFI/WFE. This cause no issue because
> 'cppc_get_rate_from_fbctrs()' in cppc_cpufreq driver will check the increment
> and return the desired performance if the increment is zero.
> 
> But when the CPU goes into power down idle state, accessing these two counters
> in AMU by memory-mapped address will return zero. Such as CPU1 went into power
> down idle state and CPU0 try to get the frequency of CPU1. In this situation,
> will display a very big value for 'cpuinfo_cur_freq' in sysfs. Do you have some
> advice about this problem ?
> 
> I was thinking about an idea as follows. We can run 'cppc_cpufreq_get_rate()' on
> the CPU to be measured, so that we can make sure the CPU is in C0 state when we
> access the two counters. Also we can return the actual frequency rather than
> desired performance when the CPU is in WFI/WFE. But this modification will
> change the existing logical and I am not sure if this will cause some bad effect.
> 
> 
> diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
> index 257d726..ded3bcc 100644
> --- a/drivers/cpufreq/cppc_cpufreq.c
> +++ b/drivers/cpufreq/cppc_cpufreq.c
> @@ -396,9 +396,10 @@ static int cppc_get_rate_from_fbctrs(struct cppc_cpudata *cpu,
>         return cppc_cpufreq_perf_to_khz(cpu, delivered_perf);
>  }
> 
> -static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum)
> +static int cppc_cpufreq_get_rate_cpu(void *info)
>  {
>         struct cppc_perf_fb_ctrs fb_ctrs_t0 = {0}, fb_ctrs_t1 = {0};
> + unsigned int cpunum = *(unsigned int *)info;
>         struct cppc_cpudata *cpu = all_cpu_data[cpunum];
>         int ret;
> 
> @@ -418,6 +419,22 @@ static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum)
>         return cppc_get_rate_from_fbctrs(cpu, fb_ctrs_t0, fb_ctrs_t1);
>  }
> 
> +static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum)
> +{
> + unsigned int ret;
> +
> + ret = smp_call_on_cpu(cpunum, cppc_cpufreq_get_rate_cpu, &cpunum, true);
> +
> + /*
> +  * convert negative error code to zero, otherwise we will display
> +  * an odd value for 'cpuinfo_cur_freq' in sysfs
> +  */
> + if (ret < 0)
> +         ret = 0;
> +
> + return ret;
> +}
> +
>  static int cppc_cpufreq_set_boost(struct cpufreq_policy *policy, int state)
>  {
>         struct cppc_cpudata *cpudata;

I don't see any other sane solution, even if this brings the CPU back
to normal state and waste power. We should be able to reliably provide
value to userspace.

Rafael / Sudeep: What you do say ?
Sudeep Holla June 3, 2020, 10:07 a.m. UTC | #3
On Wed, Jun 03, 2020 at 01:22:00PM +0530, Viresh Kumar wrote:
> On 02-06-20, 11:34, Xiongfeng Wang wrote:
> > Hi Viresh,
> > 
> > Sorry to disturb you about another problem as follows.
> > 
> > CPPC use the increment of Desired Performance counter and Reference Performance
> > counter to get the CPU frequency and show it in sysfs through
> > 'cpuinfo_cur_freq'. But ACPI CPPC doesn't specifically define the behavior of
> > these two counters when the CPU is in idle state, such as stop incrementing when
> > the CPU is in idle state.
> > 
> > ARMv8.4 Extension inctroduced support for the Activity Monitors Unit (AMU). The
> > processor frequency cycles and constant frequency cycles in AMU can be used as
> > Delivered Performance counter and Reference Performance counter. These two
> > counter in AMU does not increase when the PE is in WFI or WFE. So the increment
> > is zero when the PE is in WFI/WFE. This cause no issue because
> > 'cppc_get_rate_from_fbctrs()' in cppc_cpufreq driver will check the increment
> > and return the desired performance if the increment is zero.
> > 
> > But when the CPU goes into power down idle state, accessing these two counters
> > in AMU by memory-mapped address will return zero. Such as CPU1 went into power
> > down idle state and CPU0 try to get the frequency of CPU1. In this situation,
> > will display a very big value for 'cpuinfo_cur_freq' in sysfs. Do you have some
> > advice about this problem ?
> > 
> > I was thinking about an idea as follows. We can run 'cppc_cpufreq_get_rate()' on
> > the CPU to be measured, so that we can make sure the CPU is in C0 state when we
> > access the two counters. Also we can return the actual frequency rather than
> > desired performance when the CPU is in WFI/WFE. But this modification will
> > change the existing logical and I am not sure if this will cause some bad effect.
> > 
> > 
> > diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
> > index 257d726..ded3bcc 100644
> > --- a/drivers/cpufreq/cppc_cpufreq.c
> > +++ b/drivers/cpufreq/cppc_cpufreq.c
> > @@ -396,9 +396,10 @@ static int cppc_get_rate_from_fbctrs(struct cppc_cpudata *cpu,
> >         return cppc_cpufreq_perf_to_khz(cpu, delivered_perf);
> >  }
> > 
> > -static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum)
> > +static int cppc_cpufreq_get_rate_cpu(void *info)
> >  {
> >         struct cppc_perf_fb_ctrs fb_ctrs_t0 = {0}, fb_ctrs_t1 = {0};
> > + unsigned int cpunum = *(unsigned int *)info;
> >         struct cppc_cpudata *cpu = all_cpu_data[cpunum];
> >         int ret;
> > 
> > @@ -418,6 +419,22 @@ static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum)
> >         return cppc_get_rate_from_fbctrs(cpu, fb_ctrs_t0, fb_ctrs_t1);
> >  }
> > 
> > +static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum)
> > +{
> > + unsigned int ret;
> > +
> > + ret = smp_call_on_cpu(cpunum, cppc_cpufreq_get_rate_cpu, &cpunum, true);
> > +
> > + /*
> > +  * convert negative error code to zero, otherwise we will display
> > +  * an odd value for 'cpuinfo_cur_freq' in sysfs
> > +  */
> > + if (ret < 0)
> > +         ret = 0;
> > +
> > + return ret;
> > +}
> > +
> >  static int cppc_cpufreq_set_boost(struct cpufreq_policy *policy, int state)
> >  {
> >         struct cppc_cpudata *cpudata;
> 
> I don't see any other sane solution, even if this brings the CPU back
> to normal state and waste power. We should be able to reliably provide
> value to userspace.
> 
> Rafael / Sudeep: What you do say ?

Agreed on returning 0 as it aligns with the semantics followed. We can't
return the last set/fetched value as it fails to align with the values
returned when CPU is not idle.

But I have another question. If we can detect that CPPC on some platforms
rely on CPU registers(I assume FFH registers here and not system/io/...
type of GAS registers), can we set dvfs_on_any_cpu(can't recall exact
flag name) to false if not already done to prevent such issues. Or I am
talking non-sense as it may be applicable only for _set operation and
not _get.
Viresh Kumar June 3, 2020, 10:10 a.m. UTC | #4
On 03-06-20, 11:07, Sudeep Holla wrote:
> But I have another question. If we can detect that CPPC on some platforms
> rely on CPU registers(I assume FFH registers here and not system/io/...
> type of GAS registers), can we set dvfs_on_any_cpu(can't recall exact
> flag name) to false if not already done to prevent such issues. Or I am
> talking non-sense as it may be applicable only for _set operation and

          Yes, non-sense :)

> not _get.
Sudeep Holla June 3, 2020, 10:17 a.m. UTC | #5
On Wed, Jun 03, 2020 at 03:40:10PM +0530, Viresh Kumar wrote:
> On 03-06-20, 11:07, Sudeep Holla wrote:
> > But I have another question. If we can detect that CPPC on some platforms
> > rely on CPU registers(I assume FFH registers here and not system/io/...
> > type of GAS registers), can we set dvfs_on_any_cpu(can't recall exact
> > flag name) to false if not already done to prevent such issues. Or I am
> > talking non-sense as it may be applicable only for _set operation and
>
>           Yes, non-sense :)
>

Thanks for confirming 
Viresh Kumar June 3, 2020, 10:21 a.m. UTC | #6
On 03-06-20, 11:17, Sudeep Holla wrote:
> On Wed, Jun 03, 2020 at 03:40:10PM +0530, Viresh Kumar wrote:
> > On 03-06-20, 11:07, Sudeep Holla wrote:
> > > But I have another question. If we can detect that CPPC on some platforms
> > > rely on CPU registers(I assume FFH registers here and not system/io/...
> > > type of GAS registers), can we set dvfs_on_any_cpu(can't recall exact
> > > flag name) to false if not already done to prevent such issues. Or I am
> > > talking non-sense as it may be applicable only for _set operation and
> >
> >           Yes, non-sense :)
> >
> 
> Thanks for confirming 
Sudeep Holla June 3, 2020, 10:40 a.m. UTC | #7
On Wed, Jun 03, 2020 at 03:51:59PM +0530, Viresh Kumar wrote:
> On 03-06-20, 11:17, Sudeep Holla wrote:
> > On Wed, Jun 03, 2020 at 03:40:10PM +0530, Viresh Kumar wrote:
> > > On 03-06-20, 11:07, Sudeep Holla wrote:
> > > > But I have another question. If we can detect that CPPC on some platforms
> > > > rely on CPU registers(I assume FFH registers here and not system/io/...
> > > > type of GAS registers), can we set dvfs_on_any_cpu(can't recall exact
> > > > flag name) to false if not already done to prevent such issues. Or I am
> > > > talking non-sense as it may be applicable only for _set operation and
> > >
> > >           Yes, non-sense :)
> > >
> >
> > Thanks for confirming 
Rafael J. Wysocki June 3, 2020, 1:39 p.m. UTC | #8
On Wed, Jun 3, 2020 at 9:52 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
>
> On 02-06-20, 11:34, Xiongfeng Wang wrote:
> > Hi Viresh,
> >
> > Sorry to disturb you about another problem as follows.
> >
> > CPPC use the increment of Desired Performance counter and Reference Performance
> > counter to get the CPU frequency and show it in sysfs through
> > 'cpuinfo_cur_freq'. But ACPI CPPC doesn't specifically define the behavior of
> > these two counters when the CPU is in idle state, such as stop incrementing when
> > the CPU is in idle state.
> >
> > ARMv8.4 Extension inctroduced support for the Activity Monitors Unit (AMU). The
> > processor frequency cycles and constant frequency cycles in AMU can be used as
> > Delivered Performance counter and Reference Performance counter. These two
> > counter in AMU does not increase when the PE is in WFI or WFE. So the increment
> > is zero when the PE is in WFI/WFE. This cause no issue because
> > 'cppc_get_rate_from_fbctrs()' in cppc_cpufreq driver will check the increment
> > and return the desired performance if the increment is zero.
> >
> > But when the CPU goes into power down idle state, accessing these two counters
> > in AMU by memory-mapped address will return zero. Such as CPU1 went into power
> > down idle state and CPU0 try to get the frequency of CPU1. In this situation,
> > will display a very big value for 'cpuinfo_cur_freq' in sysfs. Do you have some
> > advice about this problem ?
> >
> > I was thinking about an idea as follows. We can run 'cppc_cpufreq_get_rate()' on
> > the CPU to be measured, so that we can make sure the CPU is in C0 state when we
> > access the two counters. Also we can return the actual frequency rather than
> > desired performance when the CPU is in WFI/WFE. But this modification will
> > change the existing logical and I am not sure if this will cause some bad effect.
> >
> >
> > diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
> > index 257d726..ded3bcc 100644
> > --- a/drivers/cpufreq/cppc_cpufreq.c
> > +++ b/drivers/cpufreq/cppc_cpufreq.c
> > @@ -396,9 +396,10 @@ static int cppc_get_rate_from_fbctrs(struct cppc_cpudata *cpu,
> >         return cppc_cpufreq_perf_to_khz(cpu, delivered_perf);
> >  }
> >
> > -static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum)
> > +static int cppc_cpufreq_get_rate_cpu(void *info)
> >  {
> >         struct cppc_perf_fb_ctrs fb_ctrs_t0 = {0}, fb_ctrs_t1 = {0};
> > + unsigned int cpunum = *(unsigned int *)info;
> >         struct cppc_cpudata *cpu = all_cpu_data[cpunum];
> >         int ret;
> >
> > @@ -418,6 +419,22 @@ static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum)
> >         return cppc_get_rate_from_fbctrs(cpu, fb_ctrs_t0, fb_ctrs_t1);
> >  }
> >
> > +static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum)
> > +{
> > + unsigned int ret;
> > +
> > + ret = smp_call_on_cpu(cpunum, cppc_cpufreq_get_rate_cpu, &cpunum, true);
> > +
> > + /*
> > +  * convert negative error code to zero, otherwise we will display
> > +  * an odd value for 'cpuinfo_cur_freq' in sysfs
> > +  */
> > + if (ret < 0)
> > +         ret = 0;
> > +
> > + return ret;
> > +}
> > +
> >  static int cppc_cpufreq_set_boost(struct cpufreq_policy *policy, int state)
> >  {
> >         struct cppc_cpudata *cpudata;
>
> I don't see any other sane solution, even if this brings the CPU back
> to normal state and waste power. We should be able to reliably provide
> value to userspace.
>
> Rafael / Sudeep: What you do say ?

The frequency value obtained by kicking the CPU out of idle
artificially is bogus, though.  You may as well return a random number
instead.

The frequency of a CPU in an idle state is in fact unknown in the case
at hand, so returning 0 looks like the cleanest option to me.

Thanks!
Xiongfeng Wang June 4, 2020, 1:32 a.m. UTC | #9
Hi Rafael,

Thanks for your reply !

On 2020/6/3 21:39, Rafael J. Wysocki wrote:
> On Wed, Jun 3, 2020 at 9:52 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
>>
>> On 02-06-20, 11:34, Xiongfeng Wang wrote:
>>> Hi Viresh,
>>>
>>> Sorry to disturb you about another problem as follows.
>>>
>>> CPPC use the increment of Desired Performance counter and Reference Performance
>>> counter to get the CPU frequency and show it in sysfs through
>>> 'cpuinfo_cur_freq'. But ACPI CPPC doesn't specifically define the behavior of
>>> these two counters when the CPU is in idle state, such as stop incrementing when
>>> the CPU is in idle state.
>>>
>>> ARMv8.4 Extension inctroduced support for the Activity Monitors Unit (AMU). The
>>> processor frequency cycles and constant frequency cycles in AMU can be used as
>>> Delivered Performance counter and Reference Performance counter. These two
>>> counter in AMU does not increase when the PE is in WFI or WFE. So the increment
>>> is zero when the PE is in WFI/WFE. This cause no issue because
>>> 'cppc_get_rate_from_fbctrs()' in cppc_cpufreq driver will check the increment
>>> and return the desired performance if the increment is zero.
>>>
>>> But when the CPU goes into power down idle state, accessing these two counters
>>> in AMU by memory-mapped address will return zero. Such as CPU1 went into power
>>> down idle state and CPU0 try to get the frequency of CPU1. In this situation,
>>> will display a very big value for 'cpuinfo_cur_freq' in sysfs. Do you have some
>>> advice about this problem ?
>>>
>>> I was thinking about an idea as follows. We can run 'cppc_cpufreq_get_rate()' on
>>> the CPU to be measured, so that we can make sure the CPU is in C0 state when we
>>> access the two counters. Also we can return the actual frequency rather than
>>> desired performance when the CPU is in WFI/WFE. But this modification will
>>> change the existing logical and I am not sure if this will cause some bad effect.
>>>
>>>
>>> diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
>>> index 257d726..ded3bcc 100644
>>> --- a/drivers/cpufreq/cppc_cpufreq.c
>>> +++ b/drivers/cpufreq/cppc_cpufreq.c
>>> @@ -396,9 +396,10 @@ static int cppc_get_rate_from_fbctrs(struct cppc_cpudata *cpu,
>>>         return cppc_cpufreq_perf_to_khz(cpu, delivered_perf);
>>>  }
>>>
>>> -static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum)
>>> +static int cppc_cpufreq_get_rate_cpu(void *info)
>>>  {
>>>         struct cppc_perf_fb_ctrs fb_ctrs_t0 = {0}, fb_ctrs_t1 = {0};
>>> + unsigned int cpunum = *(unsigned int *)info;
>>>         struct cppc_cpudata *cpu = all_cpu_data[cpunum];
>>>         int ret;
>>>
>>> @@ -418,6 +419,22 @@ static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum)
>>>         return cppc_get_rate_from_fbctrs(cpu, fb_ctrs_t0, fb_ctrs_t1);
>>>  }
>>>
>>> +static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum)
>>> +{
>>> + unsigned int ret;
>>> +
>>> + ret = smp_call_on_cpu(cpunum, cppc_cpufreq_get_rate_cpu, &cpunum, true);
>>> +
>>> + /*
>>> +  * convert negative error code to zero, otherwise we will display
>>> +  * an odd value for 'cpuinfo_cur_freq' in sysfs
>>> +  */
>>> + if (ret < 0)
>>> +         ret = 0;
>>> +
>>> + return ret;
>>> +}
>>> +
>>>  static int cppc_cpufreq_set_boost(struct cpufreq_policy *policy, int state)
>>>  {
>>>         struct cppc_cpudata *cpudata;
>>
>> I don't see any other sane solution, even if this brings the CPU back
>> to normal state and waste power. We should be able to reliably provide
>> value to userspace.
>>
>> Rafael / Sudeep: What you do say ?
> 
> The frequency value obtained by kicking the CPU out of idle
> artificially is bogus, though.  You may as well return a random number
> instead.

Yes, it may return a randowm number as well.

> 
> The frequency of a CPU in an idle state is in fact unknown in the case
> at hand, so returning 0 looks like the cleanest option to me.

I am not sure about how the user will use 'cpuinfo_cur_freq' in sysfs. If I
return 0 when the CPU is idle, when I run a light load on the CPU, I will get a
zero value for 'cpuinfo_cur_freq' when the CPU is idle. When the CPU is not
idle, I will get a non-zero value. The user may feel odd about
'cpuinfo_cur_frreq' switching between a zero value and a non-zero value. They
may hope it can return the frequency when the CPU execute instructions, namely
in C0 state. I am not so sure about the user will look at 'cpuinfo_cur_freq'.

Thanks,
Xiongfeng

> 
> Thanks!
> 
> .
>
Viresh Kumar June 4, 2020, 4:41 a.m. UTC | #10
On 04-06-20, 09:32, Xiongfeng Wang wrote:
> On 2020/6/3 21:39, Rafael J. Wysocki wrote:
> > The frequency value obtained by kicking the CPU out of idle
> > artificially is bogus, though.  You may as well return a random number
> > instead.
> 
> Yes, it may return a randowm number as well.
> 
> > 
> > The frequency of a CPU in an idle state is in fact unknown in the case
> > at hand, so returning 0 looks like the cleanest option to me.
> 
> I am not sure about how the user will use 'cpuinfo_cur_freq' in sysfs. If I
> return 0 when the CPU is idle, when I run a light load on the CPU, I will get a
> zero value for 'cpuinfo_cur_freq' when the CPU is idle. When the CPU is not
> idle, I will get a non-zero value. The user may feel odd about
> 'cpuinfo_cur_frreq' switching between a zero value and a non-zero value. They
> may hope it can return the frequency when the CPU execute instructions, namely
> in C0 state. I am not so sure about the user will look at 'cpuinfo_cur_freq'.

This is what I was worried about as well. The interface to sysfs needs
to be robust. Returning frequency on some readings and 0 on others
doesn't look right to me as well. This will break scripts (I am not
sure if some scripts are there to look for these values) with the
randomness of values returned by it.

On reading values locally from the CPU, I thought about the case where
userspace can prevent a CPU going into idle just by reading its
frequency from sysfs (and so waste power), but the same can be done by
userspace to run arbitrary load on the CPUs.

Can we do some sort of caching of the last frequency the CPU was
running at before going into idle ? Then we can just check if cpu is
idle and so return cached value.
Rafael J. Wysocki June 4, 2020, 10:42 a.m. UTC | #11
On Thu, Jun 4, 2020 at 6:41 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
>
> On 04-06-20, 09:32, Xiongfeng Wang wrote:
> > On 2020/6/3 21:39, Rafael J. Wysocki wrote:
> > > The frequency value obtained by kicking the CPU out of idle
> > > artificially is bogus, though.  You may as well return a random number
> > > instead.
> >
> > Yes, it may return a randowm number as well.
> >
> > >
> > > The frequency of a CPU in an idle state is in fact unknown in the case
> > > at hand, so returning 0 looks like the cleanest option to me.
> >
> > I am not sure about how the user will use 'cpuinfo_cur_freq' in sysfs. If I
> > return 0 when the CPU is idle, when I run a light load on the CPU, I will get a
> > zero value for 'cpuinfo_cur_freq' when the CPU is idle. When the CPU is not
> > idle, I will get a non-zero value. The user may feel odd about
> > 'cpuinfo_cur_frreq' switching between a zero value and a non-zero value. They
> > may hope it can return the frequency when the CPU execute instructions, namely
> > in C0 state. I am not so sure about the user will look at 'cpuinfo_cur_freq'.
>
> This is what I was worried about as well. The interface to sysfs needs
> to be robust. Returning frequency on some readings and 0 on others
> doesn't look right to me as well. This will break scripts (I am not
> sure if some scripts are there to look for these values) with the
> randomness of values returned by it.

The only thing the scripts need to do is to skip zeros (or anything
less than the minimum hw frequency for that matter) coming from that
attribute.

> On reading values locally from the CPU, I thought about the case where
> userspace can prevent a CPU going into idle just by reading its
> frequency from sysfs (and so waste power), but the same can be done by
> userspace to run arbitrary load on the CPUs.
>
> Can we do some sort of caching of the last frequency the CPU was
> running at before going into idle ? Then we can just check if cpu is
> idle and so return cached value.

That is an option, but it looks like in this case the cpuinfo_cur_freq
attribute should not be present at all, as per the documentation.

Cheers!
Sudeep Holla June 4, 2020, 12:58 p.m. UTC | #12
On Thu, Jun 04, 2020 at 12:42:06PM +0200, Rafael J. Wysocki wrote:
> On Thu, Jun 4, 2020 at 6:41 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
> >
> > On 04-06-20, 09:32, Xiongfeng Wang wrote:
> > > On 2020/6/3 21:39, Rafael J. Wysocki wrote:
> > > > The frequency value obtained by kicking the CPU out of idle
> > > > artificially is bogus, though.  You may as well return a random number
> > > > instead.
> > >
> > > Yes, it may return a randowm number as well.
> > >
> > > >
> > > > The frequency of a CPU in an idle state is in fact unknown in the case
> > > > at hand, so returning 0 looks like the cleanest option to me.
> > >
> > > I am not sure about how the user will use 'cpuinfo_cur_freq' in sysfs. If I
> > > return 0 when the CPU is idle, when I run a light load on the CPU, I will get a
> > > zero value for 'cpuinfo_cur_freq' when the CPU is idle. When the CPU is not
> > > idle, I will get a non-zero value. The user may feel odd about
> > > 'cpuinfo_cur_frreq' switching between a zero value and a non-zero value. They
> > > may hope it can return the frequency when the CPU execute instructions, namely
> > > in C0 state. I am not so sure about the user will look at 'cpuinfo_cur_freq'.
> >
> > This is what I was worried about as well. The interface to sysfs needs
> > to be robust. Returning frequency on some readings and 0 on others
> > doesn't look right to me as well. This will break scripts (I am not
> > sure if some scripts are there to look for these values) with the
> > randomness of values returned by it.
> 
> The only thing the scripts need to do is to skip zeros (or anything
> less than the minimum hw frequency for that matter) coming from that
> attribute.
> 
> > On reading values locally from the CPU, I thought about the case where
> > userspace can prevent a CPU going into idle just by reading its
> > frequency from sysfs (and so waste power), but the same can be done by
> > userspace to run arbitrary load on the CPUs.
> >
> > Can we do some sort of caching of the last frequency the CPU was
> > running at before going into idle ? Then we can just check if cpu is
> > idle and so return cached value.
> 
> That is an option, but it looks like in this case the cpuinfo_cur_freq
> attribute should not be present at all, as per the documentation.
> 

+1 for dropping the attribute.
Ionela Voinescu June 10, 2020, 9:40 a.m. UTC | #13
Hi guys,

Sorry for showing up late to the party, I was on holiday last week.

On Thursday 04 Jun 2020 at 13:58:22 (+0100), Sudeep Holla wrote:
> On Thu, Jun 04, 2020 at 12:42:06PM +0200, Rafael J. Wysocki wrote:
> > On Thu, Jun 4, 2020 at 6:41 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
> > >
> > > On 04-06-20, 09:32, Xiongfeng Wang wrote:
> > > > On 2020/6/3 21:39, Rafael J. Wysocki wrote:
> > > > > The frequency value obtained by kicking the CPU out of idle
> > > > > artificially is bogus, though.  You may as well return a random number
> > > > > instead.
> > > >
> > > > Yes, it may return a randowm number as well.
> > > >
> > > > >
> > > > > The frequency of a CPU in an idle state is in fact unknown in the case
> > > > > at hand, so returning 0 looks like the cleanest option to me.
> > > >
> > > > I am not sure about how the user will use 'cpuinfo_cur_freq' in sysfs. If I
> > > > return 0 when the CPU is idle, when I run a light load on the CPU, I will get a
> > > > zero value for 'cpuinfo_cur_freq' when the CPU is idle. When the CPU is not
> > > > idle, I will get a non-zero value. The user may feel odd about
> > > > 'cpuinfo_cur_frreq' switching between a zero value and a non-zero value. They
> > > > may hope it can return the frequency when the CPU execute instructions, namely
> > > > in C0 state. I am not so sure about the user will look at 'cpuinfo_cur_freq'.
> > >
> > > This is what I was worried about as well. The interface to sysfs needs
> > > to be robust. Returning frequency on some readings and 0 on others
> > > doesn't look right to me as well. This will break scripts (I am not
> > > sure if some scripts are there to look for these values) with the
> > > randomness of values returned by it.
> > 
> > The only thing the scripts need to do is to skip zeros (or anything
> > less than the minimum hw frequency for that matter) coming from that
> > attribute.
> > 
> > > On reading values locally from the CPU, I thought about the case where
> > > userspace can prevent a CPU going into idle just by reading its
> > > frequency from sysfs (and so waste power), but the same can be done by
> > > userspace to run arbitrary load on the CPUs.
> > >
> > > Can we do some sort of caching of the last frequency the CPU was
> > > running at before going into idle ? Then we can just check if cpu is
> > > idle and so return cached value.
> > 
> > That is an option, but it looks like in this case the cpuinfo_cur_freq
> > attribute should not be present at all, as per the documentation.
> > 
> 
> +1 for dropping the attribute.
> 

I've been experimenting with some code quite recently that uses the
scheduler frequency scale factor to compute this hardware current rate
for CPPC.

On the scheduler tick, the scale factor is computed in
arch_scale_freq_tick() to give an indication on delivered performance,
using AMUs on arm64 [1] and APERF/MPERF on x86 [2]. Basically, this scale
factor has the cached value of the average delivered performance between
the last two scheduler ticks, on a capacity scale: 0-1024. All that would
be needed is to convert from the scheduler frequency scale to the CPPC
expected performance scale.

The gist of the code would be:

delivered_perf = topology_get_freq_scale(cpu);
delivered_perf *= fb_ctrs.reference_perf;
delivered_perf = div64_u64(delivered_perf << SCHED_CAPACITY_SHIFT,
			   per_cpu(arch_max_freq_scale, cpu));

While this solution is not perfect, it would provide the best view of
the hardware "current" rate without the cost of waking up the CPU when
idle, scheduling additional work on the CPU, doing checks on whether
the CPU is idle and/or providing other caching mechanisms.

Do you think such an implementation could make cpuinfo_cur_freq worth
keeping?

I'm happy to push the patches for this and discuss the details there.

Thanks,
Ionela.


[1] https://lkml.org/lkml/2020/3/5/183
[2] https://lkml.org/lkml/2020/1/22/1039

> -- 
> Regards,
> Sudeep
Xiongfeng Wang June 11, 2020, 1:52 a.m. UTC | #14
Hi Ionela,

Thanks for your reply !

On 2020/6/10 17:40, Ionela Voinescu wrote:
> Hi guys,
> 
> Sorry for showing up late to the party, I was on holiday last week.
> 
> On Thursday 04 Jun 2020 at 13:58:22 (+0100), Sudeep Holla wrote:
>> On Thu, Jun 04, 2020 at 12:42:06PM +0200, Rafael J. Wysocki wrote:
>>> On Thu, Jun 4, 2020 at 6:41 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
>>>>
>>>> On 04-06-20, 09:32, Xiongfeng Wang wrote:
>>>>> On 2020/6/3 21:39, Rafael J. Wysocki wrote:
>>>>>> The frequency value obtained by kicking the CPU out of idle
>>>>>> artificially is bogus, though.  You may as well return a random number
>>>>>> instead.
>>>>>
>>>>> Yes, it may return a randowm number as well.
>>>>>
>>>>>>
>>>>>> The frequency of a CPU in an idle state is in fact unknown in the case
>>>>>> at hand, so returning 0 looks like the cleanest option to me.
>>>>>
>>>>> I am not sure about how the user will use 'cpuinfo_cur_freq' in sysfs. If I
>>>>> return 0 when the CPU is idle, when I run a light load on the CPU, I will get a
>>>>> zero value for 'cpuinfo_cur_freq' when the CPU is idle. When the CPU is not
>>>>> idle, I will get a non-zero value. The user may feel odd about
>>>>> 'cpuinfo_cur_frreq' switching between a zero value and a non-zero value. They
>>>>> may hope it can return the frequency when the CPU execute instructions, namely
>>>>> in C0 state. I am not so sure about the user will look at 'cpuinfo_cur_freq'.
>>>>
>>>> This is what I was worried about as well. The interface to sysfs needs
>>>> to be robust. Returning frequency on some readings and 0 on others
>>>> doesn't look right to me as well. This will break scripts (I am not
>>>> sure if some scripts are there to look for these values) with the
>>>> randomness of values returned by it.
>>>
>>> The only thing the scripts need to do is to skip zeros (or anything
>>> less than the minimum hw frequency for that matter) coming from that
>>> attribute.
>>>
>>>> On reading values locally from the CPU, I thought about the case where
>>>> userspace can prevent a CPU going into idle just by reading its
>>>> frequency from sysfs (and so waste power), but the same can be done by
>>>> userspace to run arbitrary load on the CPUs.
>>>>
>>>> Can we do some sort of caching of the last frequency the CPU was
>>>> running at before going into idle ? Then we can just check if cpu is
>>>> idle and so return cached value.
>>>
>>> That is an option, but it looks like in this case the cpuinfo_cur_freq
>>> attribute should not be present at all, as per the documentation.
>>>
>>
>> +1 for dropping the attribute.
>>
> 
> I've been experimenting with some code quite recently that uses the
> scheduler frequency scale factor to compute this hardware current rate
> for CPPC.
> 
> On the scheduler tick, the scale factor is computed in
> arch_scale_freq_tick() to give an indication on delivered performance,
> using AMUs on arm64 [1] and APERF/MPERF on x86 [2]. Basically, this scale
> factor has the cached value of the average delivered performance between
> the last two scheduler ticks, on a capacity scale: 0-1024. All that would
> be needed is to convert from the scheduler frequency scale to the CPPC
> expected performance scale.
> 
> The gist of the code would be:
> 
> delivered_perf = topology_get_freq_scale(cpu);
> delivered_perf *= fb_ctrs.reference_perf;
> delivered_perf = div64_u64(delivered_perf << SCHED_CAPACITY_SHIFT,
> 			   per_cpu(arch_max_freq_scale, cpu));
> 
> While this solution is not perfect, it would provide the best view of
> the hardware "current" rate without the cost of waking up the CPU when
> idle, scheduling additional work on the CPU, doing checks on whether
> the CPU is idle and/or providing other caching mechanisms.

I think it's a good idea. It's just that the value is a average frequency in the
last two scheduler ticks, not an instantaneous frequency. What
'cppc_cpufreq_get_rate()' get is also not an  instantaneous frequency. It's a
average frequency in 2us. I check the interval between two frequency updates on
my machine. It's about 10ms. So the frequency will change at least one time in
two scheduler ticks if HZ is 1000.

I am not sure how people will use 'cpuinfo_cur_freq'. They may not expect a very
accurate frequency. How about we return this value when CPU is idle? Or just
return 0 in idle is better ?

> 
> Do you think such an implementation could make cpuinfo_cur_freq worth
> keeping?
> 
> I'm happy to push the patches for this and discuss the details there.

Thanks. I'm happy to see the patches and give some help.

Thanks,
Xiongfeng

> 
> Thanks,
> Ionela.
> 
> 
> [1] https://lkml.org/lkml/2020/3/5/183
> [2] https://lkml.org/lkml/2020/1/22/1039
> 
>> -- 
>> Regards,
>> Sudeep
> 
> .
>
Rafael J. Wysocki June 12, 2020, 11:55 a.m. UTC | #15
On Thu, Jun 11, 2020 at 3:52 AM Xiongfeng Wang
<wangxiongfeng2@huawei.com> wrote:
>
> Hi Ionela,
>
> Thanks for your reply !
>
> On 2020/6/10 17:40, Ionela Voinescu wrote:
> > Hi guys,
> >
> > Sorry for showing up late to the party, I was on holiday last week.
> >
> > On Thursday 04 Jun 2020 at 13:58:22 (+0100), Sudeep Holla wrote:
> >> On Thu, Jun 04, 2020 at 12:42:06PM +0200, Rafael J. Wysocki wrote:
> >>> On Thu, Jun 4, 2020 at 6:41 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
> >>>>
> >>>> On 04-06-20, 09:32, Xiongfeng Wang wrote:
> >>>>> On 2020/6/3 21:39, Rafael J. Wysocki wrote:
> >>>>>> The frequency value obtained by kicking the CPU out of idle
> >>>>>> artificially is bogus, though.  You may as well return a random number
> >>>>>> instead.
> >>>>>
> >>>>> Yes, it may return a randowm number as well.
> >>>>>
> >>>>>>
> >>>>>> The frequency of a CPU in an idle state is in fact unknown in the case
> >>>>>> at hand, so returning 0 looks like the cleanest option to me.
> >>>>>
> >>>>> I am not sure about how the user will use 'cpuinfo_cur_freq' in sysfs. If I
> >>>>> return 0 when the CPU is idle, when I run a light load on the CPU, I will get a
> >>>>> zero value for 'cpuinfo_cur_freq' when the CPU is idle. When the CPU is not
> >>>>> idle, I will get a non-zero value. The user may feel odd about
> >>>>> 'cpuinfo_cur_frreq' switching between a zero value and a non-zero value. They
> >>>>> may hope it can return the frequency when the CPU execute instructions, namely
> >>>>> in C0 state. I am not so sure about the user will look at 'cpuinfo_cur_freq'.
> >>>>
> >>>> This is what I was worried about as well. The interface to sysfs needs
> >>>> to be robust. Returning frequency on some readings and 0 on others
> >>>> doesn't look right to me as well. This will break scripts (I am not
> >>>> sure if some scripts are there to look for these values) with the
> >>>> randomness of values returned by it.
> >>>
> >>> The only thing the scripts need to do is to skip zeros (or anything
> >>> less than the minimum hw frequency for that matter) coming from that
> >>> attribute.
> >>>
> >>>> On reading values locally from the CPU, I thought about the case where
> >>>> userspace can prevent a CPU going into idle just by reading its
> >>>> frequency from sysfs (and so waste power), but the same can be done by
> >>>> userspace to run arbitrary load on the CPUs.
> >>>>
> >>>> Can we do some sort of caching of the last frequency the CPU was
> >>>> running at before going into idle ? Then we can just check if cpu is
> >>>> idle and so return cached value.
> >>>
> >>> That is an option, but it looks like in this case the cpuinfo_cur_freq
> >>> attribute should not be present at all, as per the documentation.
> >>>
> >>
> >> +1 for dropping the attribute.
> >>
> >
> > I've been experimenting with some code quite recently that uses the
> > scheduler frequency scale factor to compute this hardware current rate
> > for CPPC.
> >
> > On the scheduler tick, the scale factor is computed in
> > arch_scale_freq_tick() to give an indication on delivered performance,
> > using AMUs on arm64 [1] and APERF/MPERF on x86 [2]. Basically, this scale
> > factor has the cached value of the average delivered performance between
> > the last two scheduler ticks, on a capacity scale: 0-1024. All that would
> > be needed is to convert from the scheduler frequency scale to the CPPC
> > expected performance scale.
> >
> > The gist of the code would be:
> >
> > delivered_perf = topology_get_freq_scale(cpu);
> > delivered_perf *= fb_ctrs.reference_perf;
> > delivered_perf = div64_u64(delivered_perf << SCHED_CAPACITY_SHIFT,
> >                          per_cpu(arch_max_freq_scale, cpu));
> >
> > While this solution is not perfect, it would provide the best view of
> > the hardware "current" rate without the cost of waking up the CPU when
> > idle, scheduling additional work on the CPU, doing checks on whether
> > the CPU is idle and/or providing other caching mechanisms.
>
> I think it's a good idea. It's just that the value is a average frequency in the
> last two scheduler ticks, not an instantaneous frequency. What
> 'cppc_cpufreq_get_rate()' get is also not an  instantaneous frequency. It's a
> average frequency in 2us. I check the interval between two frequency updates on
> my machine. It's about 10ms. So the frequency will change at least one time in
> two scheduler ticks if HZ is 1000.
>
> I am not sure how people will use 'cpuinfo_cur_freq'. They may not expect a very
> accurate frequency. How about we return this value when CPU is idle? Or just
> return 0 in idle is better ?

According to the documentation, if this attribute is present, it
should return the exact frequency of the CPU (with a caveat that it
may change already when user space is able to consume the value), and
which is what the users may expect.

As I said before, IMO the CPPC driver should not cause the core to
expose cpuinfo_cur_freq at all.

Thanks!
diff mbox series

Patch

diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
index 257d726..ded3bcc 100644
--- a/drivers/cpufreq/cppc_cpufreq.c
+++ b/drivers/cpufreq/cppc_cpufreq.c
@@ -396,9 +396,10 @@  static int cppc_get_rate_from_fbctrs(struct cppc_cpudata *cpu,
        return cppc_cpufreq_perf_to_khz(cpu, delivered_perf);
 }

-static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum)
+static int cppc_cpufreq_get_rate_cpu(void *info)
 {
        struct cppc_perf_fb_ctrs fb_ctrs_t0 = {0}, fb_ctrs_t1 = {0};
+ unsigned int cpunum = *(unsigned int *)info;
        struct cppc_cpudata *cpu = all_cpu_data[cpunum];
        int ret;

@@ -418,6 +419,22 @@  static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum)
        return cppc_get_rate_from_fbctrs(cpu, fb_ctrs_t0, fb_ctrs_t1);
 }

+static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum)
+{
+ unsigned int ret;
+
+ ret = smp_call_on_cpu(cpunum, cppc_cpufreq_get_rate_cpu, &cpunum, true);
+
+ /*
+  * convert negative error code to zero, otherwise we will display
+  * an odd value for 'cpuinfo_cur_freq' in sysfs
+  */
+ if (ret < 0)
+         ret = 0;
+
+ return ret;
+}
+
 static int cppc_cpufreq_set_boost(struct cpufreq_policy *policy, int state)
 {
        struct cppc_cpudata *cpudata;