cpufreq / CPPC: Add cpuinfo_cur_freq support for CPPC

Message ID	1526989324-4183-1-git-send-email-george.cherian@cavium.com (mailing list archive)
State	Changes Requested, archived
Headers	show Return-Path: <linux-pm-owner@kernel.org> From: George Cherian <george.cherian@cavium.com> To: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org Cc: rjw@rjwysocki.net, viresh.kumar@linaro.org, George Cherian <george.cherian@cavium.com> Subject: [PATCH] cpufreq / CPPC: Add cpuinfo_cur_freq support for CPPC Date: Tue, 22 May 2018 04:42:04 -0700 Message-Id: <1526989324-4183-1-git-send-email-george.cherian@cavium.com> MIME-Version: 1.0 Content-Type: text/plain Received-SPF: None (protection.outlook.com: cavium.com does not designate permitted sender hosts) X-Microsoft-Exchange-Diagnostics: =?us-ascii?Q?1; SN6PR07MB4928; 23:QOET4czaPhfHyVXmqELkXS9nz3U5Rj0lhxBVE/MNd?= =?us-ascii?Q?zL4/ounsq+91QjuKyrVkxyXLXH8865H1ieqRh7Dn32FXht6x07jCfzSOVbjH?= =?us-ascii?Q?B03OnKGCJ5ozfBRYRJN35WfBCJuNRogNAJUFM0glhAkwMFNp+Emvn3OZYtkd?= =?us-ascii?Q?WsIst4OLBwN4yfDbpxNymZqA9wLIUG6reBPTpckuaw9ZBQ1ZQnJQyFydnMDU?= =?us-ascii?Q?9mbvunA8nlc6K2yPVlkL5Wz7CgqoEL91eq/3/vx0lw3jNpR/sentT2cyeSpW?= =?us-ascii?Q?oKZ7RgCBPjKpq6vb+WwOvEHOUrEh5psVvkM3dV65RYGgLqA9Iew59S3uP3bL?= =?us-ascii?Q?zVjA2OTZaeh1/t+Nyp3gAaMq52RomOGFIpI0LPq/ZiCIa9uuwYuONvbMnpIm?= =?us-ascii?Q?bMdTSPwZXWNVfvwPbtp05UoehDKhh5omOS6mc1y2KsJqIzjs+rAfk2fLyzqJ?= =?us-ascii?Q?C0+aaoI2g7WY0yTGcGWN1xRW/2N6/1tDsx520xMX6PD3Hfqj+Hro9a7lHgox?= =?us-ascii?Q?RSuhE347v1JC5EuUYCl4CyW/1QDirgqwZ8YBr4+Hrq/gqwTR0JWyp8L455dz?= =?us-ascii?Q?7bEACuIA3Ggw6pfl7gssJRpvG7kZTDwvNT4cS60g+ZQmvlqEe0rUoQyB9OT1?= =?us-ascii?Q?x9wLtPbublxG5/R0lHaNV2cWk7QHKtY1eECNfy9FISh3eyNd5ct6Z5KLB4Ap?= =?us-ascii?Q?XbnGLXuMEq/kXgDBtJwTyG5t5KuOPAQE+0Dj6aKPnuuwjmArJUYP02uHnqxB?= =?us-ascii?Q?95A5+QdTeuCKeNelS3J1qcAMf7wV98cDQ8KCZboMfs6WIePg+PNYVU0eyUpn?= =?us-ascii?Q?mkdnrXyeD8RJMZZPHg8Dcr2CvCMuPiLTZjm3RWtDbsa1RkQPqhUrqYhrOcYC?= =?us-ascii?Q?Dfzq0mStFebxhMg35En86/u4I7cp1TaPJKbJyb5wuth7lcUryuTN5fF9rAsg?= =?us-ascii?Q?iAtc46xUHQErE6RArsUeSL5KF7QkC9YFQYM+bwZ2CNzKMwRYKulQl2TESQy8?= =?us-ascii?Q?DIcaaQ3SBOXx25/aB5lzFcAufS5JQBuauWE2/lq6SBLVq8fb9ncnAmQSFR8H?= =?us-ascii?Q?D7GO5WvSOPK1XD5/HRCYGFvpyGPxrOR3DWaYKXgRccffGes+5jBSPNjruOAE?= =?us-ascii?Q?ltlSPRoGE5O9j10OA3uaAx7wEqjqm2O/1uiHmOvQE1tRnRCvYSFSDm2mpNpE?= =?us-ascii?Q?SBDzxgzG2kEormByHQIYPpflSD0RLj6Ybi5AM2yUUrG7g/VmzAOPS5cHRKlk?= =?us-ascii?Q?W7KPUgwCMjqHzMzhpAVpr3DLbMk5Cge5tt9iliQaaPzqJVi16NSYPtLf0z1T?= =?us-ascii?B?QT09?= X-Microsoft-Antispam-Message-Info: wgojFHjwM5OnsiPK0jRpxQR4y2yPA5UCG5ceiivqi9P4mqHIZ52DyJbP1SQx9noArPOYQpkjRfuJmrF0LI+sZDu5eEX9mh18TpQfYwlUz8PMigi9aP9bVfhTPf73vIxgiEtmgWG7p7+pibu7JgddsE+qjfLcs0bZ9Wd9fZ8pvtil5cZ4CwEoThlgL5kDILLb X-Microsoft-Exchange-Diagnostics: 1; SN6PR07MB4928; 6:PSC2g/t6GXqDFlEEsJNsttyVDMXLYcf4cnb7qFyaYZ7OG//3LQW7WXJJAzVDOyrfaLqtZy5OLxNG3U2tEC3WIcYJ0DkZBcb8SBPKt1gBaspC5dIlVJ5FauXiDpMV+DxH9s8kPoFkvaNvg8YvLy1vRf2IaeAfJC6qjW4hWEKm7n3vUEHCq8VirL1As25gLfpW3ZlPZNzBbHcjnaDHBU53su0uWyTW5HHf85znzp2OxNBaF8Ul7u5GSohae7jRBqW9d2kNeq3UXWo4DGoM2bUi+BVRWJCCmtzprw6FpVwnCAKnb50qpGo8Ox2EDgz33Lodr9zqhCvj8wWmfcXVAKCjue7ahGJgLtqsYntunfu/eTc0Un5qo0mPsJJSPKz+z1zsahDN9yu3FLbKAzLH7Cmf6KbXAg07ajWddq+/YRSZTz6vGSmWXH96COEfteiL5tE0wQ+nwEPOEuTpJHetvDT41Q==; 5:NU4uJKd23SIdDkOyN7Vdwt5DxUPeo0g29ovll/uEffGvAB3agGygYS5A7YV18UP+62vVVeGHAGMUCf/OncreB2O8WLavxFivPfA507gyCx5YB/QjT/M6+ZX7vMgeb4L6SFfAOGF1qe5EOEn8V06uZJ1LbZ4TiDRKevfl6/e+v8w=; 24:fRHACY+jKH8/30/aCOPgvFFv+TU2lUOy5JkJ1+O0/vf35H49Pj4RRhBjcoYnba3p5U/qMKXRhjd7ggpf/w10YbAGyrq4mKILlgQCYDVnh/g= SpamDiagnosticOutput: 1:99 SpamDiagnosticMetadata: NSPM X-Microsoft-Exchange-Diagnostics: 1; SN6PR07MB4928; 7:bFEysYXSoNAiG8sd+r4IX20m62igIskCn5KD/B8FgGdoMnT2/BN+oI52CDhz1Mu/fikHPnxpNVUluW1NSlLZ7d0D7G7nOW5P43C95teCQnE4OyHKIJFn/y0n4b5dXrpAev+1a+8lxdrz6dKaMkJcvATNyrrNxnU253U6HWlXmwrAiEOCT4Rjx3lgIFqD+WgaAqRrUgeJImSTzH6wU+1zpNZTxsUygeWeRHYNdZJPtX9Fie9klFucgUjCnWLgsLqL X-MS-Office365-Filtering-Correlation-Id: a7dd8ac9-606c-470e-204c-08d5bfd911f3 Sender: linux-pm-owner@vger.kernel.org Precedence: bulk

Cherian, George May 22, 2018, 11:42 a.m. UTC

Per Section 8.4.7.1.3 of ACPI 6.2, The platform provides performance
feedback via set of performance counters. To determine the actual
performance level delivered over time, OSPM may read a set of
performance counters from the Reference Performance Counter Register
and the Delivered Performance Counter Register.

OSPM calculates the delivered performance over a given time period by
taking a beginning and ending snapshot of both the reference and
delivered performance counters, and calculating:

delivered_perf = reference_perf X (delta of delivered_perf counter / delta of reference_perf counter).

Implement the above and hook this to the cpufreq->get method.

Signed-off-by: George Cherian <george.cherian@cavium.com>
---
 drivers/cpufreq/cppc_cpufreq.c | 44 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

Viresh Kumar May 23, 2018, 5:02 a.m. UTC | #1

On 22-05-18, 04:42, George Cherian wrote:
> Per Section 8.4.7.1.3 of ACPI 6.2, The platform provides performance
> feedback via set of performance counters. To determine the actual
> performance level delivered over time, OSPM may read a set of
> performance counters from the Reference Performance Counter Register
> and the Delivered Performance Counter Register.
> 
> OSPM calculates the delivered performance over a given time period by
> taking a beginning and ending snapshot of both the reference and
> delivered performance counters, and calculating:
> 
> delivered_perf = reference_perf X (delta of delivered_perf counter / delta of reference_perf counter).
> 
> Implement the above and hook this to the cpufreq->get method.
> 
> Signed-off-by: George Cherian <george.cherian@cavium.com>
> ---
>  drivers/cpufreq/cppc_cpufreq.c | 44 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 44 insertions(+)

Acked-by: Viresh Kumar <viresh.kumar@linaro.org>

Prakash, Prashanth May 24, 2018, 7:25 p.m. UTC | #2

Hi George,

On 5/22/2018 5:42 AM, George Cherian wrote:
> Per Section 8.4.7.1.3 of ACPI 6.2, The platform provides performance
> feedback via set of performance counters. To determine the actual
> performance level delivered over time, OSPM may read a set of
> performance counters from the Reference Performance Counter Register
> and the Delivered Performance Counter Register.
>
> OSPM calculates the delivered performance over a given time period by
> taking a beginning and ending snapshot of both the reference and
> delivered performance counters, and calculating:
>
> delivered_perf = reference_perf X (delta of delivered_perf counter / delta of reference_perf counter).
>
> Implement the above and hook this to the cpufreq->get method.
>
> Signed-off-by: George Cherian <george.cherian@cavium.com>
> ---
>  drivers/cpufreq/cppc_cpufreq.c | 44 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 44 insertions(+)
>
> diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
> index b15115a..a046915 100644
> --- a/drivers/cpufreq/cppc_cpufreq.c
> +++ b/drivers/cpufreq/cppc_cpufreq.c
> @@ -240,10 +240,54 @@ static int cppc_cpufreq_cpu_init(struct cpufreq_policy *policy)
>  	return ret;
>  }
>  
> +static int cppc_get_rate_from_fbctrs(struct cppc_perf_fb_ctrs fb_ctrs_t0,
> +				     struct cppc_perf_fb_ctrs fb_ctrs_t1)
> +{
> +	u64 delta_reference, delta_delivered;
> +	u64 reference_perf, ratio;
> +
> +	reference_perf = fb_ctrs_t0.reference_perf;
> +	if (fb_ctrs_t1.reference > fb_ctrs_t0.reference)
> +		delta_reference = fb_ctrs_t1.reference - fb_ctrs_t0.reference;
> +	else /* Counters would have wrapped-around */
> +		delta_reference  = ((u64)(~((u64)0)) - fb_ctrs_t0.reference) +
> +					fb_ctrs_t1.reference;
> +
> +	if (fb_ctrs_t1.delivered > fb_ctrs_t0.delivered)
> +		delta_delivered = fb_ctrs_t1.delivered - fb_ctrs_t0.delivered;
> +	else /* Counters would have wrapped-around */
> +		delta_delivered  = ((u64)(~((u64)0)) - fb_ctrs_t0.delivered) +
> +					fb_ctrs_t1.delivered;
We need to check that the wraparound time is long enough to make sure that
the counters cannot wrap around more than once. We can register a  get() api
only after checking that wraparound time value is reasonably high.

I am not aware of any platforms where wraparound time is soo short, but
wouldn't hurt to check once during init.
> +
> +	if (delta_reference)  /* Check to avoid divide-by zero */
> +		ratio = (delta_delivered * 1000) / delta_reference;
Why not just return the computed value here instead of *1000 and later /1000?
return (ref_per * delta_del) / delta_ref;
> +	else
> +		return -EINVAL;
Instead of EINVAL, i think we should return current frequency.

The counters can pause if CPUs are in idle state during our sampling interval, so
If the counters did not progress, it is reasonable to assume the delivered perf was
equal to desired perf.

Even if platform wanted to limit, since the CPUs were asleep(idle) we could not have
observed lower performance, so we will not throw off  any logic that could be driven
using the returned value.
> +
> +	return (reference_perf * ratio) / 1000;
This should be converted to KHz as cpufreq is not aware of CPPC abstract scale
> +}
> +
> +static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum)
> +{
> +	struct cppc_perf_fb_ctrs fb_ctrs_t0 = {0}, fb_ctrs_t1 = {0};
> +	int ret;
> +
> +	ret = cppc_get_perf_ctrs(cpunum, &fb_ctrs_t0);
> +	if (ret)
> +		return ret;
> +
> +	ret = cppc_get_perf_ctrs(cpunum, &fb_ctrs_t1);
> +	if (ret)
> +		return ret;
> +
> +	return cppc_get_rate_from_fbctrs(fb_ctrs_t0, fb_ctrs_t1);
> +}
We need to make sure that we get a reasonably sample so make sure the reported
performance is accurate.
The counters can capture short transient throttling/limiting, so by sampling a really
short duration of time we could return quite inaccurate measure of performance.

We need to include some reasonable delay between the two calls. What is reasonable
is debatable - so this can be few(2-10) microseconds defined as default. If the same value
doesn't work for all the platforms, we might need to add a platform specific value.

> +
>  static struct cpufreq_driver cppc_cpufreq_driver = {
>  	.flags = CPUFREQ_CONST_LOOPS,
>  	.verify = cppc_verify_policy,
>  	.target = cppc_cpufreq_set_target,
> +	.get = cppc_cpufreq_get_rate,
>  	.init = cppc_cpufreq_cpu_init,
>  	.stop_cpu = cppc_cpufreq_stop_cpu,
>  	.name = "cppc_cpufreq",

George Cherian May 25, 2018, 6:27 a.m. UTC | #3

Hi Prashanth,

On 05/25/2018 12:55 AM, Prakash, Prashanth wrote:
> Hi George,
> 
> On 5/22/2018 5:42 AM, George Cherian wrote:
>> Per Section 8.4.7.1.3 of ACPI 6.2, The platform provides performance
>> feedback via set of performance counters. To determine the actual
>> performance level delivered over time, OSPM may read a set of
>> performance counters from the Reference Performance Counter Register
>> and the Delivered Performance Counter Register.
>>
>> OSPM calculates the delivered performance over a given time period by
>> taking a beginning and ending snapshot of both the reference and
>> delivered performance counters, and calculating:
>>
>> delivered_perf = reference_perf X (delta of delivered_perf counter / delta of reference_perf counter).
>>
>> Implement the above and hook this to the cpufreq->get method.
>>
>> Signed-off-by: George Cherian <george.cherian@cavium.com>
>> ---
>>   drivers/cpufreq/cppc_cpufreq.c | 44 ++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 44 insertions(+)
>>
>> diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
>> index b15115a..a046915 100644
>> --- a/drivers/cpufreq/cppc_cpufreq.c
>> +++ b/drivers/cpufreq/cppc_cpufreq.c
>> @@ -240,10 +240,54 @@ static int cppc_cpufreq_cpu_init(struct cpufreq_policy *policy)
>>   	return ret;
>>   }
>>   
>> +static int cppc_get_rate_from_fbctrs(struct cppc_perf_fb_ctrs fb_ctrs_t0,
>> +				     struct cppc_perf_fb_ctrs fb_ctrs_t1)
>> +{
>> +	u64 delta_reference, delta_delivered;
>> +	u64 reference_perf, ratio;
>> +
>> +	reference_perf = fb_ctrs_t0.reference_perf;
>> +	if (fb_ctrs_t1.reference > fb_ctrs_t0.reference)
>> +		delta_reference = fb_ctrs_t1.reference - fb_ctrs_t0.reference;
>> +	else /* Counters would have wrapped-around */
>> +		delta_reference  = ((u64)(~((u64)0)) - fb_ctrs_t0.reference) +
>> +					fb_ctrs_t1.reference;
>> +
>> +	if (fb_ctrs_t1.delivered > fb_ctrs_t0.delivered)
>> +		delta_delivered = fb_ctrs_t1.delivered - fb_ctrs_t0.delivered;
>> +	else /* Counters would have wrapped-around */
>> +		delta_delivered  = ((u64)(~((u64)0)) - fb_ctrs_t0.delivered) +
>> +					fb_ctrs_t1.delivered;
> We need to check that the wraparound time is long enough to make sure that
> the counters cannot wrap around more than once. We can register a  get() api
> only after checking that wraparound time value is reasonably high.
> 
> I am not aware of any platforms where wraparound time is soo short, but
> wouldn't hurt to check once during init.
By design the wraparound time is a 64 bit counter, for that matter even
all the feedback counters too are 64 bit counters. I don't see any
chance in which the counters can wraparound twice in back to back reads.
The only situation is in which system itself is running at a really high
frequency. Even in that case today's spec is not sufficient to support 
the same.

>> +
>> +	if (delta_reference)  /* Check to avoid divide-by zero */
>> +		ratio = (delta_delivered * 1000) / delta_reference;
> Why not just return the computed value here instead of *1000 and later /1000?
> return (ref_per * delta_del) / delta_ref;
Yes.
>> +	else
>> +		return -EINVAL;
> Instead of EINVAL, i think we should return current frequency.
> 
Sorry, I didn't get you, How do you calculate the current frequency?
Did you mean reference performance?

> The counters can pause if CPUs are in idle state during our sampling interval, so
> If the counters did not progress, it is reasonable to assume the delivered perf was
> equal to desired perf.
No, that is wrong. Here the check is for reference performance delta.
This counter can never pause. In case of cpuidle only the delivered 
counters could pause. Delivered counters will pause only if the 
particular core enters power down mode, Otherwise we would be still 
clocking the core and we should be getting a delta across 2 sampling 
periods. In case if the reference counter is paused which by design is 
not correct then there is no point in returning reference performance 
numbers. That too is wrong. In case the low level FW is not updating the
counters properly then it should be evident till Linux, instead of 
returning a bogus frequency.
> 
> Even if platform wanted to limit, since the CPUs were asleep(idle) we could not have
> observed lower performance, so we will not throw off  any logic that could be driven
> using the returned value.
>> +
>> +	return (reference_perf * ratio) / 1000;
> This should be converted to KHz as cpufreq is not aware of CPPC abstract scale
In our platform all performance registers are implemented in KHz. 
Because of which we never had an issue with conversion. I am  not
aware whether ACPI mandates to use any particular unit. How is that
implemented in your platform? Just to avoid any extra conversion don't
you feel it is better to always report in KHz from firmware.

>> +}
>> +
>> +static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum)
>> +{
>> +	struct cppc_perf_fb_ctrs fb_ctrs_t0 = {0}, fb_ctrs_t1 = {0};
>> +	int ret;
>> +
>> +	ret = cppc_get_perf_ctrs(cpunum, &fb_ctrs_t0);
>> +	if (ret)
>> +		return ret;
>> +
>> +	ret = cppc_get_perf_ctrs(cpunum, &fb_ctrs_t1);
>> +	if (ret)
>> +		return ret;
>> +
>> +	return cppc_get_rate_from_fbctrs(fb_ctrs_t0, fb_ctrs_t1);
>> +}
> We need to make sure that we get a reasonably sample so make sure the reported
> performance is accurate.
> The counters can capture short transient throttling/limiting, so by sampling a really
> short duration of time we could return quite inaccurate measure of performance.
> 
I would say it as a momentary thing only when the frequency is being 
ramped up/down.

> We need to include some reasonable delay between the two calls. What is reasonable
> is debatable - so this can be few(2-10) microseconds defined as default. If the same value
> doesn't work for all the platforms, we might need to add a platform specific value.
> 
cppc_get_perf_ctrs itself is a slow call, we have to format the CPC 
packet and ring a doorbell and then the response to be read from the 
shared registers. My initial implementation had a delay but in testing,
I found that it was unnecessary to have such a delay. Can you please
let me know whether it works without delay in your platform?

Or else let me know whether udelay(10) is sufficient in between the
calls.
>> +
>>   static struct cpufreq_driver cppc_cpufreq_driver = {
>>   	.flags = CPUFREQ_CONST_LOOPS,
>>   	.verify = cppc_verify_policy,
>>   	.target = cppc_cpufreq_set_target,
>> +	.get = cppc_cpufreq_get_rate,
>>   	.init = cppc_cpufreq_cpu_init,
>>   	.stop_cpu = cppc_cpufreq_stop_cpu,
>>   	.name = "cppc_cpufreq",
>

Rafael J. Wysocki May 25, 2018, 8:46 a.m. UTC | #4

On Fri, May 25, 2018 at 8:27 AM, George Cherian
<gcherian@caviumnetworks.com> wrote:
> Hi Prashanth,
>
>
> On 05/25/2018 12:55 AM, Prakash, Prashanth wrote:
>>
>> Hi George,
>>
>> On 5/22/2018 5:42 AM, George Cherian wrote:
>>>
>>> Per Section 8.4.7.1.3 of ACPI 6.2, The platform provides performance
>>> feedback via set of performance counters. To determine the actual
>>> performance level delivered over time, OSPM may read a set of
>>> performance counters from the Reference Performance Counter Register
>>> and the Delivered Performance Counter Register.
>>>
>>> OSPM calculates the delivered performance over a given time period by
>>> taking a beginning and ending snapshot of both the reference and
>>> delivered performance counters, and calculating:
>>>
>>> delivered_perf = reference_perf X (delta of delivered_perf counter /
>>> delta of reference_perf counter).
>>>
>>> Implement the above and hook this to the cpufreq->get method.
>>>
>>> Signed-off-by: George Cherian <george.cherian@cavium.com>
>>> ---
>>>   drivers/cpufreq/cppc_cpufreq.c | 44
>>> ++++++++++++++++++++++++++++++++++++++++++
>>>   1 file changed, 44 insertions(+)
>>>
>>> diff --git a/drivers/cpufreq/cppc_cpufreq.c
>>> b/drivers/cpufreq/cppc_cpufreq.c
>>> index b15115a..a046915 100644
>>> --- a/drivers/cpufreq/cppc_cpufreq.c
>>> +++ b/drivers/cpufreq/cppc_cpufreq.c
>>> @@ -240,10 +240,54 @@ static int cppc_cpufreq_cpu_init(struct
>>> cpufreq_policy *policy)
>>>         return ret;
>>>   }
>>>   +static int cppc_get_rate_from_fbctrs(struct cppc_perf_fb_ctrs
>>> fb_ctrs_t0,
>>> +                                    struct cppc_perf_fb_ctrs fb_ctrs_t1)
>>> +{
>>> +       u64 delta_reference, delta_delivered;
>>> +       u64 reference_perf, ratio;
>>> +
>>> +       reference_perf = fb_ctrs_t0.reference_perf;
>>> +       if (fb_ctrs_t1.reference > fb_ctrs_t0.reference)
>>> +               delta_reference = fb_ctrs_t1.reference -
>>> fb_ctrs_t0.reference;
>>> +       else /* Counters would have wrapped-around */
>>> +               delta_reference  = ((u64)(~((u64)0)) -
>>> fb_ctrs_t0.reference) +
>>> +                                       fb_ctrs_t1.reference;
>>> +
>>> +       if (fb_ctrs_t1.delivered > fb_ctrs_t0.delivered)
>>> +               delta_delivered = fb_ctrs_t1.delivered -
>>> fb_ctrs_t0.delivered;
>>> +       else /* Counters would have wrapped-around */
>>> +               delta_delivered  = ((u64)(~((u64)0)) -
>>> fb_ctrs_t0.delivered) +
>>> +                                       fb_ctrs_t1.delivered;
>>
>> We need to check that the wraparound time is long enough to make sure that
>> the counters cannot wrap around more than once. We can register a  get()
>> api
>> only after checking that wraparound time value is reasonably high.
>>
>> I am not aware of any platforms where wraparound time is soo short, but
>> wouldn't hurt to check once during init.
>
> By design the wraparound time is a 64 bit counter, for that matter even
> all the feedback counters too are 64 bit counters. I don't see any
> chance in which the counters can wraparound twice in back to back reads.
> The only situation is in which system itself is running at a really high
> frequency. Even in that case today's spec is not sufficient to support the
> same.
>
>>> +
>>> +       if (delta_reference)  /* Check to avoid divide-by zero */
>>> +               ratio = (delta_delivered * 1000) / delta_reference;
>>
>> Why not just return the computed value here instead of *1000 and later
>> /1000?
>> return (ref_per * delta_del) / delta_ref;
>
> Yes.
>>>
>>> +       else
>>> +               return -EINVAL;
>>
>> Instead of EINVAL, i think we should return current frequency.
>>
> Sorry, I didn't get you, How do you calculate the current frequency?
> Did you mean reference performance?
>
>> The counters can pause if CPUs are in idle state during our sampling
>> interval, so
>> If the counters did not progress, it is reasonable to assume the delivered
>> perf was
>> equal to desired perf.
>
> No, that is wrong. Here the check is for reference performance delta.
> This counter can never pause. In case of cpuidle only the delivered counters
> could pause. Delivered counters will pause only if the particular core
> enters power down mode, Otherwise we would be still clocking the core and we
> should be getting a delta across 2 sampling periods. In case if the
> reference counter is paused which by design is not correct then there is no
> point in returning reference performance numbers. That too is wrong. In case
> the low level FW is not updating the
> counters properly then it should be evident till Linux, instead of returning
> a bogus frequency.
>>
>>
>> Even if platform wanted to limit, since the CPUs were asleep(idle) we
>> could not have
>> observed lower performance, so we will not throw off  any logic that could
>> be driven
>> using the returned value.
>>>
>>> +
>>> +       return (reference_perf * ratio) / 1000;
>>
>> This should be converted to KHz as cpufreq is not aware of CPPC abstract
>> scale
>
> In our platform all performance registers are implemented in KHz. Because of
> which we never had an issue with conversion. I am  not
> aware whether ACPI mandates to use any particular unit. How is that
> implemented in your platform? Just to avoid any extra conversion don't
> you feel it is better to always report in KHz from firmware.
>
>>> +}
>>> +
>>> +static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum)
>>> +{
>>> +       struct cppc_perf_fb_ctrs fb_ctrs_t0 = {0}, fb_ctrs_t1 = {0};
>>> +       int ret;
>>> +
>>> +       ret = cppc_get_perf_ctrs(cpunum, &fb_ctrs_t0);
>>> +       if (ret)
>>> +               return ret;
>>> +
>>> +       ret = cppc_get_perf_ctrs(cpunum, &fb_ctrs_t1);
>>> +       if (ret)
>>> +               return ret;
>>> +
>>> +       return cppc_get_rate_from_fbctrs(fb_ctrs_t0, fb_ctrs_t1);
>>> +}
>>
>> We need to make sure that we get a reasonably sample so make sure the
>> reported
>> performance is accurate.
>> The counters can capture short transient throttling/limiting, so by
>> sampling a really
>> short duration of time we could return quite inaccurate measure of
>> performance.
>>
> I would say it as a momentary thing only when the frequency is being ramped
> up/down.
>
>> We need to include some reasonable delay between the two calls. What is
>> reasonable
>> is debatable - so this can be few(2-10) microseconds defined as default.
>> If the same value
>> doesn't work for all the platforms, we might need to add a platform
>> specific value.
>>
> cppc_get_perf_ctrs itself is a slow call, we have to format the CPC packet
> and ring a doorbell and then the response to be read from the shared
> registers. My initial implementation had a delay but in testing,
> I found that it was unnecessary to have such a delay. Can you please
> let me know whether it works without delay in your platform?
>
> Or else let me know whether udelay(10) is sufficient in between the
> calls.
>
>>> +
>>>   static struct cpufreq_driver cppc_cpufreq_driver = {
>>>         .flags = CPUFREQ_CONST_LOOPS,
>>>         .verify = cppc_verify_policy,
>>>         .target = cppc_cpufreq_set_target,
>>> +       .get = cppc_cpufreq_get_rate,
>>>         .init = cppc_cpufreq_cpu_init,
>>>         .stop_cpu = cppc_cpufreq_stop_cpu,
>>>         .name = "cppc_cpufreq",
>>

I was about to apply the $subject patch, but now I would like you and
Prashanth to agree on it, so please ask Prashanth for an ACK on the
final version.

Prakash, Prashanth May 25, 2018, 9 p.m. UTC | #5

On 5/25/2018 12:27 AM, George Cherian wrote:
> Hi Prashanth,
>
> On 05/25/2018 12:55 AM, Prakash, Prashanth wrote:
>> Hi George,
>>
>> On 5/22/2018 5:42 AM, George Cherian wrote:
>>> Per Section 8.4.7.1.3 of ACPI 6.2, The platform provides performance
>>> feedback via set of performance counters. To determine the actual
>>> performance level delivered over time, OSPM may read a set of
>>> performance counters from the Reference Performance Counter Register
>>> and the Delivered Performance Counter Register.
>>>
>>> OSPM calculates the delivered performance over a given time period by
>>> taking a beginning and ending snapshot of both the reference and
>>> delivered performance counters, and calculating:
>>>
>>> delivered_perf = reference_perf X (delta of delivered_perf counter / delta of reference_perf counter).
>>>
>>> Implement the above and hook this to the cpufreq->get method.
>>>
>>> Signed-off-by: George Cherian <george.cherian@cavium.com>
>>> ---
>>>   drivers/cpufreq/cppc_cpufreq.c | 44 ++++++++++++++++++++++++++++++++++++++++++
>>>   1 file changed, 44 insertions(+)
>>>
>>> diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
>>> index b15115a..a046915 100644
>>> --- a/drivers/cpufreq/cppc_cpufreq.c
>>> +++ b/drivers/cpufreq/cppc_cpufreq.c
>>> @@ -240,10 +240,54 @@ static int cppc_cpufreq_cpu_init(struct cpufreq_policy *policy)
>>>       return ret;
>>>   }
>>>   +static int cppc_get_rate_from_fbctrs(struct cppc_perf_fb_ctrs fb_ctrs_t0,
>>> +                     struct cppc_perf_fb_ctrs fb_ctrs_t1)
>>> +{
>>> +    u64 delta_reference, delta_delivered;
>>> +    u64 reference_perf, ratio;
>>> +
>>> +    reference_perf = fb_ctrs_t0.reference_perf;
>>> +    if (fb_ctrs_t1.reference > fb_ctrs_t0.reference)
>>> +        delta_reference = fb_ctrs_t1.reference - fb_ctrs_t0.reference;
>>> +    else /* Counters would have wrapped-around */
>>> +        delta_reference  = ((u64)(~((u64)0)) - fb_ctrs_t0.reference) +
>>> +                    fb_ctrs_t1.reference;
>>> +
>>> +    if (fb_ctrs_t1.delivered > fb_ctrs_t0.delivered)
>>> +        delta_delivered = fb_ctrs_t1.delivered - fb_ctrs_t0.delivered;
>>> +    else /* Counters would have wrapped-around */
>>> +        delta_delivered  = ((u64)(~((u64)0)) - fb_ctrs_t0.delivered) +
>>> +                    fb_ctrs_t1.delivered;
>> We need to check that the wraparound time is long enough to make sure that
>> the counters cannot wrap around more than once. We can register a  get() api
>> only after checking that wraparound time value is reasonably high.
>>
>> I am not aware of any platforms where wraparound time is soo short, but
>> wouldn't hurt to check once during init.
> By design the wraparound time is a 64 bit counter, for that matter even
> all the feedback counters too are 64 bit counters. I don't see any
> chance in which the counters can wraparound twice in back to back reads.
> The only situation is in which system itself is running at a really high
> frequency. Even in that case today's spec is not sufficient to support the same.

The spec doesn't say these have to be 64bit registers.  The wraparound
counter register is in spec to communicate the worst case(shortest)
counter rollover time.

As as mentioned before this is just a defensive check to make sure that
the platform has not set it to some very low number (which is allowed
by the spec).

>
>>> +
>>> +    if (delta_reference)  /* Check to avoid divide-by zero */
>>> +        ratio = (delta_delivered * 1000) / delta_reference;
>> Why not just return the computed value here instead of *1000 and later /1000?
>> return (ref_per * delta_del) / delta_ref;
> Yes.
>>> +    else
>>> +        return -EINVAL;
>> Instead of EINVAL, i think we should return current frequency.
>>
> Sorry, I didn't get you, How do you calculate the current frequency?
> Did you mean reference performance?
I mean the performance that OSPM/Linux had requested earlier.
i.e the desired_perf
>
>> The counters can pause if CPUs are in idle state during our sampling interval, so
>> If the counters did not progress, it is reasonable to assume the delivered perf was
>> equal to desired perf.
> No, that is wrong. Here the check is for reference performance delta.
> This counter can never pause. In case of cpuidle only the delivered counters could pause. Delivered counters will pause only if the particular core enters power down mode, Otherwise we would be still clocking the core and we should be getting a delta across 2 sampling periods. In case if the reference counter is paused which by design is not correct then there is no point in returning reference performance numbers. That too is wrong. In case the low level FW is not updating the
> counters properly then it should be evident till Linux, instead of returning a bogus frequency.

Again you are describing how it works on a specific platform and not
how it is described in spec. Section 8.4.7.1.3.1.1 of ACPI 6.2 states
"The Reference Performance Counter Register counts at a fixed rate
any time the processor is active."

Implies the counters *may* pause in idle states - I can imagine an
implementation where you can keep this counter running and
account for it via delivered counter, but we cannot make any
assumptions outside of what the spec describes.

>>
>> Even if platform wanted to limit, since the CPUs were asleep(idle) we could not have
>> observed lower performance, so we will not throw off  any logic that could be driven
>> using the returned value.
>>> +
>>> +    return (reference_perf * ratio) / 1000;
>> This should be converted to KHz as cpufreq is not aware of CPPC abstract scale
> In our platform all performance registers are implemented in KHz. Because of which we never had an issue with conversion. I am  not
> aware whether ACPI mandates to use any particular unit. How is that
> implemented in your platform? Just to avoid any extra conversion don't
> you feel it is better to always report in KHz from firmware.
Again think of spec not a specific platform :)
- The CPPC spec works on abstract scale and cpufreq works in KHz.
- The above computed value is in abstract scale
- The abstarct scale may be in KHz on your platform, but we cannot assume the
same about all the platforms
>
>>> +}
>>> +
>>> +static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum)
>>> +{
>>> +    struct cppc_perf_fb_ctrs fb_ctrs_t0 = {0}, fb_ctrs_t1 = {0};
>>> +    int ret;
>>> +
>>> +    ret = cppc_get_perf_ctrs(cpunum, &fb_ctrs_t0);
>>> +    if (ret)
>>> +        return ret;
>>> +
>>> +    ret = cppc_get_perf_ctrs(cpunum, &fb_ctrs_t1);
>>> +    if (ret)
>>> +        return ret;
>>> +
>>> +    return cppc_get_rate_from_fbctrs(fb_ctrs_t0, fb_ctrs_t1);
>>> +}
>> We need to make sure that we get a reasonably sample so make sure the reported
>> performance is accurate.
>> The counters can capture short transient throttling/limiting, so by sampling a really
>> short duration of time we could return quite inaccurate measure of performance.
>>
> I would say it as a momentary thing only when the frequency is being ramped up/down.
This exact behavior would depend on how different limiting functions are implemented.
So this would vary from one platform to another.
>
>> We need to include some reasonable delay between the two calls. What is reasonable
>> is debatable - so this can be few(2-10) microseconds defined as default. If the same value
>> doesn't work for all the platforms, we might need to add a platform specific value.
>>
> cppc_get_perf_ctrs itself is a slow call, we have to format the CPC packet and ring a doorbell and then the response to be read from the shared registers. My initial implementation had a delay but in testing,
> I found that it was unnecessary to have such a delay. Can you please
> let me know whether it works without delay in your platform?
>
> Or else let me know whether udelay(10) is sufficient in between the
> calls.
Feedback counters need not be in PCC .
2us should be sufficient.
>>> +
>>>   static struct cpufreq_driver cppc_cpufreq_driver = {
>>>       .flags = CPUFREQ_CONST_LOOPS,
>>>       .verify = cppc_verify_policy,
>>>       .target = cppc_cpufreq_set_target,
>>> +    .get = cppc_cpufreq_get_rate,
>>>       .init = cppc_cpufreq_cpu_init,
>>>       .stop_cpu = cppc_cpufreq_stop_cpu,
>>>       .name = "cppc_cpufreq",
>>

George Cherian May 28, 2018, 7:09 a.m. UTC | #6

Hi Prashanth,

On 05/26/2018 02:30 AM, Prakash, Prashanth wrote:
> 
> On 5/25/2018 12:27 AM, George Cherian wrote:
>> Hi Prashanth,
>>
>> On 05/25/2018 12:55 AM, Prakash, Prashanth wrote:
>>> Hi George,
>>>
>>> On 5/22/2018 5:42 AM, George Cherian wrote:
>>>> Per Section 8.4.7.1.3 of ACPI 6.2, The platform provides performance
>>>> feedback via set of performance counters. To determine the actual
>>>> performance level delivered over time, OSPM may read a set of
>>>> performance counters from the Reference Performance Counter Register
>>>> and the Delivered Performance Counter Register.
>>>>
>>>> OSPM calculates the delivered performance over a given time period by
>>>> taking a beginning and ending snapshot of both the reference and
>>>> delivered performance counters, and calculating:
>>>>
>>>> delivered_perf = reference_perf X (delta of delivered_perf counter / delta of reference_perf counter).
>>>>
>>>> Implement the above and hook this to the cpufreq->get method.
>>>>
>>>> Signed-off-by: George Cherian <george.cherian@cavium.com>
>>>> ---
>>>>    drivers/cpufreq/cppc_cpufreq.c | 44 ++++++++++++++++++++++++++++++++++++++++++
>>>>    1 file changed, 44 insertions(+)
>>>>
>>>> diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
>>>> index b15115a..a046915 100644
>>>> --- a/drivers/cpufreq/cppc_cpufreq.c
>>>> +++ b/drivers/cpufreq/cppc_cpufreq.c
>>>> @@ -240,10 +240,54 @@ static int cppc_cpufreq_cpu_init(struct cpufreq_policy *policy)
>>>>        return ret;
>>>>    }
>>>>    +static int cppc_get_rate_from_fbctrs(struct cppc_perf_fb_ctrs fb_ctrs_t0,
>>>> +                     struct cppc_perf_fb_ctrs fb_ctrs_t1)
>>>> +{
>>>> +    u64 delta_reference, delta_delivered;
>>>> +    u64 reference_perf, ratio;
>>>> +
>>>> +    reference_perf = fb_ctrs_t0.reference_perf;
>>>> +    if (fb_ctrs_t1.reference > fb_ctrs_t0.reference)
>>>> +        delta_reference = fb_ctrs_t1.reference - fb_ctrs_t0.reference;
>>>> +    else /* Counters would have wrapped-around */
>>>> +        delta_reference  = ((u64)(~((u64)0)) - fb_ctrs_t0.reference) +
>>>> +                    fb_ctrs_t1.reference;
>>>> +
>>>> +    if (fb_ctrs_t1.delivered > fb_ctrs_t0.delivered)
>>>> +        delta_delivered = fb_ctrs_t1.delivered - fb_ctrs_t0.delivered;
>>>> +    else /* Counters would have wrapped-around */
>>>> +        delta_delivered  = ((u64)(~((u64)0)) - fb_ctrs_t0.delivered) +
>>>> +                    fb_ctrs_t1.delivered;
>>> We need to check that the wraparound time is long enough to make sure that
>>> the counters cannot wrap around more than once. We can register a  get() api
>>> only after checking that wraparound time value is reasonably high.
>>>
>>> I am not aware of any platforms where wraparound time is soo short, but
>>> wouldn't hurt to check once during init.
>> By design the wraparound time is a 64 bit counter, for that matter even
>> all the feedback counters too are 64 bit counters. I don't see any
>> chance in which the counters can wraparound twice in back to back reads.
>> The only situation is in which system itself is running at a really high
>> frequency. Even in that case today's spec is not sufficient to support the same.
> 
> The spec doesn't say these have to be 64bit registers.  The wraparound
> counter register is in spec to communicate the worst case(shortest)
> counter rollover time.

Spec says these are 32 or 64 bit registers. Spec also defines counter
wraparound time in seconds. The minimum value possible is 1 as zero 
means the counters are never assumed to wrap around. Even in platforms 
with value set as 1 (1 sec) I dont really see a situation in which
the counter can wraparound twice if we are putting a delay of 2usec
between sampling.

> 
> As as mentioned before this is just a defensive check to make sure that
> the platform has not set it to some very low number (which is allowed
> by the spec).
It might be unnecessary to have a check like this.
> 
>>
>>>> +
>>>> +    if (delta_reference)  /* Check to avoid divide-by zero */
>>>> +        ratio = (delta_delivered * 1000) / delta_reference;
>>> Why not just return the computed value here instead of *1000 and later /1000?
>>> return (ref_per * delta_del) / delta_ref;
>> Yes.
>>>> +    else
>>>> +        return -EINVAL;
>>> Instead of EINVAL, i think we should return current frequency.
>>>
>> Sorry, I didn't get you, How do you calculate the current frequency?
>> Did you mean reference performance?
> I mean the performance that OSPM/Linux had requested earlier.
> i.e the desired_perf
Okay, I will make necessary changes for this in v2.

>>
>>> The counters can pause if CPUs are in idle state during our sampling interval, so
>>> If the counters did not progress, it is reasonable to assume the delivered perf was
>>> equal to desired perf.
>> No, that is wrong. Here the check is for reference performance delta.
>> This counter can never pause. In case of cpuidle only the delivered counters could pause. Delivered counters will pause only if the particular core enters power down mode, Otherwise we would be still clocking the core and we should be getting a delta across 2 sampling periods. In case if the reference counter is paused which by design is not correct then there is no point in returning reference performance numbers. That too is wrong. In case the low level FW is not updating the
>> counters properly then it should be evident till Linux, instead of returning a bogus frequency.
> 
> Again you are describing how it works on a specific platform and not
> how it is described in spec. Section 8.4.7.1.3.1.1 of ACPI 6.2 states
> "The Reference Performance Counter Register counts at a fixed rate
> any time the processor is active."
>  > Implies the counters *may* pause in idle state -I can imagine an
> implementation where you can keep this counter running and
> account for it via delivered counter, but we cannot make any
> assumptions outside of what the spec describes.
> 
>>>
>>> Even if platform wanted to limit, since the CPUs were asleep(idle) we could not have
>>> observed lower performance, so we will not throw off  any logic that could be driven
>>> using the returned value.
>>>> +
>>>> +    return (reference_perf * ratio) / 1000;
>>> This should be converted to KHz as cpufreq is not aware of CPPC abstract scale
>> In our platform all performance registers are implemented in KHz. Because of which we never had an issue with conversion. I am  not
>> aware whether ACPI mandates to use any particular unit. How is that
>> implemented in your platform? Just to avoid any extra conversion don't
>> you feel it is better to always report in KHz from firmware.
> Again think of spec not a specific platform :)
> - The CPPC spec works on abstract scale and cpufreq works in KHz.
> - The above computed value is in abstract scale
> - The abstarct scale may be in KHz on your platform, but we cannot assume the
> same about all the platforms
For now can I assume it to be in KHz only?
I am not sure how to convert the abstract scale to Khz.
Can you please give me some pointers on the same?
In spec there is currently no interface which tells what is the abstract
scale!!
>>
>>>> +}
>>>> +
>>>> +static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum)
>>>> +{
>>>> +    struct cppc_perf_fb_ctrs fb_ctrs_t0 = {0}, fb_ctrs_t1 = {0};
>>>> +    int ret;
>>>> +
>>>> +    ret = cppc_get_perf_ctrs(cpunum, &fb_ctrs_t0);
>>>> +    if (ret)
>>>> +        return ret;
>>>> +
>>>> +    ret = cppc_get_perf_ctrs(cpunum, &fb_ctrs_t1);
>>>> +    if (ret)
>>>> +        return ret;
>>>> +
>>>> +    return cppc_get_rate_from_fbctrs(fb_ctrs_t0, fb_ctrs_t1);
>>>> +}
>>> We need to make sure that we get a reasonably sample so make sure the reported
>>> performance is accurate.
>>> The counters can capture short transient throttling/limiting, so by sampling a really
>>> short duration of time we could return quite inaccurate measure of performance.
>>>
>> I would say it as a momentary thing only when the frequency is being ramped up/down.
> This exact behavior would depend on how different limiting functions are implemented.
> So this would vary from one platform to another.
>>
>>> We need to include some reasonable delay between the two calls. What is reasonable
>>> is debatable - so this can be few(2-10) microseconds defined as default. If the same value
>>> doesn't work for all the platforms, we might need to add a platform specific value.
>>>
>> cppc_get_perf_ctrs itself is a slow call, we have to format the CPC packet and ring a doorbell and then the response to be read from the shared registers. My initial implementation had a delay but in testing,
>> I found that it was unnecessary to have such a delay. Can you please
>> let me know whether it works without delay in your platform?
>>
>> Or else let me know whether udelay(10) is sufficient in between the
>> calls.
> Feedback counters need not be in PCC .
> 2us should be sufficient.
Yes I will add this to v2.
>>>> +
>>>>    static struct cpufreq_driver cppc_cpufreq_driver = {
>>>>        .flags = CPUFREQ_CONST_LOOPS,
>>>>        .verify = cppc_verify_policy,
>>>>        .target = cppc_cpufreq_set_target,
>>>> +    .get = cppc_cpufreq_get_rate,
>>>>        .init = cppc_cpufreq_cpu_init,
>>>>        .stop_cpu = cppc_cpufreq_stop_cpu,
>>>>        .name = "cppc_cpufreq",
>>>
>

Prakash, Prashanth May 29, 2018, 3:44 p.m. UTC | #7

On 5/28/2018 1:09 AM, George Cherian wrote:
> Hi Prashanth,
>
> On 05/26/2018 02:30 AM, Prakash, Prashanth wrote:
>>
>> On 5/25/2018 12:27 AM, George Cherian wrote:
>>> Hi Prashanth,
>>>
>>> On 05/25/2018 12:55 AM, Prakash, Prashanth wrote:
>>>> Hi George,
>>>>
>>>> On 5/22/2018 5:42 AM, George Cherian wrote:
>>>>> Per Section 8.4.7.1.3 of ACPI 6.2, The platform provides performance
>>>>> feedback via set of performance counters. To determine the actual
>>>>> performance level delivered over time, OSPM may read a set of
>>>>> performance counters from the Reference Performance Counter Register
>>>>> and the Delivered Performance Counter Register.
>>>>>
>>>>> OSPM calculates the delivered performance over a given time period by
>>>>> taking a beginning and ending snapshot of both the reference and
>>>>> delivered performance counters, and calculating:
>>>>>
>>>>> delivered_perf = reference_perf X (delta of delivered_perf counter / delta of reference_perf counter).
>>>>>
>>>>> Implement the above and hook this to the cpufreq->get method.
>>>>>
>>>>> Signed-off-by: George Cherian <george.cherian@cavium.com>
>>>>> ---
>>>>>    drivers/cpufreq/cppc_cpufreq.c | 44 ++++++++++++++++++++++++++++++++++++++++++
>>>>>    1 file changed, 44 insertions(+)
>>>>>
>>>>> diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
>>>>> index b15115a..a046915 100644
>>>>> --- a/drivers/cpufreq/cppc_cpufreq.c
>>>>> +++ b/drivers/cpufreq/cppc_cpufreq.c
>>>>> @@ -240,10 +240,54 @@ static int cppc_cpufreq_cpu_init(struct cpufreq_policy *policy)
>>>>>        return ret;
>>>>>    }
>>>>>    +static int cppc_get_rate_from_fbctrs(struct cppc_perf_fb_ctrs fb_ctrs_t0,
>>>>> +                     struct cppc_perf_fb_ctrs fb_ctrs_t1)
>>>>> +{
>>>>> +    u64 delta_reference, delta_delivered;
>>>>> +    u64 reference_perf, ratio;
>>>>> +
>>>>> +    reference_perf = fb_ctrs_t0.reference_perf;
>>>>> +    if (fb_ctrs_t1.reference > fb_ctrs_t0.reference)
>>>>> +        delta_reference = fb_ctrs_t1.reference - fb_ctrs_t0.reference;
>>>>> +    else /* Counters would have wrapped-around */
>>>>> +        delta_reference  = ((u64)(~((u64)0)) - fb_ctrs_t0.reference) +
>>>>> +                    fb_ctrs_t1.reference;
>>>>> +
>>>>> +    if (fb_ctrs_t1.delivered > fb_ctrs_t0.delivered)
>>>>> +        delta_delivered = fb_ctrs_t1.delivered - fb_ctrs_t0.delivered;
>>>>> +    else /* Counters would have wrapped-around */
>>>>> +        delta_delivered  = ((u64)(~((u64)0)) - fb_ctrs_t0.delivered) +
>>>>> +                    fb_ctrs_t1.delivered;
>>>> We need to check that the wraparound time is long enough to make sure that
>>>> the counters cannot wrap around more than once. We can register a  get() api
>>>> only after checking that wraparound time value is reasonably high.
>>>>
>>>> I am not aware of any platforms where wraparound time is soo short, but
>>>> wouldn't hurt to check once during init.
>>> By design the wraparound time is a 64 bit counter, for that matter even
>>> all the feedback counters too are 64 bit counters. I don't see any
>>> chance in which the counters can wraparound twice in back to back reads.
>>> The only situation is in which system itself is running at a really high
>>> frequency. Even in that case today's spec is not sufficient to support the same.
>>
>> The spec doesn't say these have to be 64bit registers.  The wraparound
>> counter register is in spec to communicate the worst case(shortest)
>> counter rollover time.
>
> Spec says these are 32 or 64 bit registers. Spec also defines counter
> wraparound time in seconds. The minimum value possible is 1 as zero means the counters are never assumed to wrap around. Even in platforms with value set as 1 (1 sec) I dont really see a situation in which
> the counter can wraparound twice if we are putting a delay of 2usec
> between sampling.
ok.
>
>>
>> As as mentioned before this is just a defensive check to make sure that
>> the platform has not set it to some very low number (which is allowed
>> by the spec).
> It might be unnecessary to have a check like this.
>>
>>>
>>>>> +
>>>>> +    if (delta_reference)  /* Check to avoid divide-by zero */
>>>>> +        ratio = (delta_delivered * 1000) / delta_reference;
>>>> Why not just return the computed value here instead of *1000 and later /1000?
>>>> return (ref_per * delta_del) / delta_ref;
>>> Yes.
>>>>> +    else
>>>>> +        return -EINVAL;
>>>> Instead of EINVAL, i think we should return current frequency.
>>>>
>>> Sorry, I didn't get you, How do you calculate the current frequency?
>>> Did you mean reference performance?
>> I mean the performance that OSPM/Linux had requested earlier.
>> i.e the desired_perf
> Okay, I will make necessary changes for this in v2.
>
>>>
>>>> The counters can pause if CPUs are in idle state during our sampling interval, so
>>>> If the counters did not progress, it is reasonable to assume the delivered perf was
>>>> equal to desired perf.
>>> No, that is wrong. Here the check is for reference performance delta.
>>> This counter can never pause. In case of cpuidle only the delivered counters could pause. Delivered counters will pause only if the particular core enters power down mode, Otherwise we would be still clocking the core and we should be getting a delta across 2 sampling periods. In case if the reference counter is paused which by design is not correct then there is no point in returning reference performance numbers. That too is wrong. In case the low level FW is not updating the
>>> counters properly then it should be evident till Linux, instead of returning a bogus frequency.
>>
>> Again you are describing how it works on a specific platform and not
>> how it is described in spec. Section 8.4.7.1.3.1.1 of ACPI 6.2 states
>> "The Reference Performance Counter Register counts at a fixed rate
>> any time the processor is active."
>>  > Implies the counters *may* pause in idle state -I can imagine an
>> implementation where you can keep this counter running and
>> account for it via delivered counter, but we cannot make any
>> assumptions outside of what the spec describes.
>>
>>>>
>>>> Even if platform wanted to limit, since the CPUs were asleep(idle) we could not have
>>>> observed lower performance, so we will not throw off  any logic that could be driven
>>>> using the returned value.
>>>>> +
>>>>> +    return (reference_perf * ratio) / 1000;
>>>> This should be converted to KHz as cpufreq is not aware of CPPC abstract scale
>>> In our platform all performance registers are implemented in KHz. Because of which we never had an issue with conversion. I am  not
>>> aware whether ACPI mandates to use any particular unit. How is that
>>> implemented in your platform? Just to avoid any extra conversion don't
>>> you feel it is better to always report in KHz from firmware.
>> Again think of spec not a specific platform :)
>> - The CPPC spec works on abstract scale and cpufreq works in KHz.
>> - The above computed value is in abstract scale
>> - The abstarct scale may be in KHz on your platform, but we cannot assume the
>> same about all the platforms
> For now can I assume it to be in KHz only?

No, it will break platforms where abstract scale is not in KHz.

> I am not sure how to convert the abstract scale to Khz.
> Can you please give me some pointers on the same?

Take a look at cppc_cpufreq_perf_to_khz and cppc_cpufreq_khz_to_perf
in the same file (cppc_cpufreq.c). We use this in almost every function
registered with core cpufreq.
||
> In spec there is currently no interface which tells what is the abstract
> scale!!

CPPC v3 adds some additional hooks for this. On CPPC v2, we try to use
few DMI table entries to get the ratio between abstract scale and KHz.

>>>
>>>>> +}
>>>>> +
>>>>> +static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum)
>>>>> +{
>>>>> +    struct cppc_perf_fb_ctrs fb_ctrs_t0 = {0}, fb_ctrs_t1 = {0};
>>>>> +    int ret;
>>>>> +
>>>>> +    ret = cppc_get_perf_ctrs(cpunum, &fb_ctrs_t0);
>>>>> +    if (ret)
>>>>> +        return ret;
>>>>> +
>>>>> +    ret = cppc_get_perf_ctrs(cpunum, &fb_ctrs_t1);
>>>>> +    if (ret)
>>>>> +        return ret;
>>>>> +
>>>>> +    return cppc_get_rate_from_fbctrs(fb_ctrs_t0, fb_ctrs_t1);
>>>>> +}
>>>> We need to make sure that we get a reasonably sample so make sure the reported
>>>> performance is accurate.
>>>> The counters can capture short transient throttling/limiting, so by sampling a really
>>>> short duration of time we could return quite inaccurate measure of performance.
>>>>
>>> I would say it as a momentary thing only when the frequency is being ramped up/down.
>> This exact behavior would depend on how different limiting functions are implemented.
>> So this would vary from one platform to another.
>>>
>>>> We need to include some reasonable delay between the two calls. What is reasonable
>>>> is debatable - so this can be few(2-10) microseconds defined as default. If the same value
>>>> doesn't work for all the platforms, we might need to add a platform specific value.
>>>>
>>> cppc_get_perf_ctrs itself is a slow call, we have to format the CPC packet and ring a doorbell and then the response to be read from the shared registers. My initial implementation had a delay but in testing,
>>> I found that it was unnecessary to have such a delay. Can you please
>>> let me know whether it works without delay in your platform?
>>>
>>> Or else let me know whether udelay(10) is sufficient in between the
>>> calls.
>> Feedback counters need not be in PCC .
>> 2us should be sufficient.
> Yes I will add this to v2.
>>>>> +
>>>>>    static struct cpufreq_driver cppc_cpufreq_driver = {
>>>>>        .flags = CPUFREQ_CONST_LOOPS,
>>>>>        .verify = cppc_verify_policy,
>>>>>        .target = cppc_cpufreq_set_target,
>>>>> +    .get = cppc_cpufreq_get_rate,
>>>>>        .init = cppc_cpufreq_cpu_init,
>>>>>        .stop_cpu = cppc_cpufreq_stop_cpu,
>>>>>        .name = "cppc_cpufreq",
>>>>
>>

George Cherian May 31, 2018, 6:33 a.m. UTC | #8

Hi Prashanth,
On 05/29/2018 09:14 PM, Prakash, Prashanth wrote:
> 
> On 5/28/2018 1:09 AM, George Cherian wrote:
>> Hi Prashanth,
>>
>> On 05/26/2018 02:30 AM, Prakash, Prashanth wrote:
>>>
>>> On 5/25/2018 12:27 AM, George Cherian wrote:
>>>> Hi Prashanth,
>>>>
>>>> On 05/25/2018 12:55 AM, Prakash, Prashanth wrote:
>>>>> Hi George,
>>>>>
>>>>> On 5/22/2018 5:42 AM, George Cherian wrote:
>>>>>> Per Section 8.4.7.1.3 of ACPI 6.2, The platform provides performance
>>>>>> feedback via set of performance counters. To determine the actual
>>>>>> performance level delivered over time, OSPM may read a set of
>>>>>> performance counters from the Reference Performance Counter Register
>>>>>> and the Delivered Performance Counter Register.
>>>>>>
>>>>>> OSPM calculates the delivered performance over a given time period by
>>>>>> taking a beginning and ending snapshot of both the reference and
>>>>>> delivered performance counters, and calculating:
>>>>>>
>>>>>> delivered_perf = reference_perf X (delta of delivered_perf counter / delta of reference_perf counter).
>>>>>>
>>>>>> Implement the above and hook this to the cpufreq->get method.
>>>>>>
>>>>>> Signed-off-by: George Cherian <george.cherian@cavium.com>
>>>>>> ---
>>>>>>     drivers/cpufreq/cppc_cpufreq.c | 44 ++++++++++++++++++++++++++++++++++++++++++
>>>>>>     1 file changed, 44 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
>>>>>> index b15115a..a046915 100644
>>>>>> --- a/drivers/cpufreq/cppc_cpufreq.c
>>>>>> +++ b/drivers/cpufreq/cppc_cpufreq.c
>>>>>> @@ -240,10 +240,54 @@ static int cppc_cpufreq_cpu_init(struct cpufreq_policy *policy)
>>>>>>         return ret;
>>>>>>     }
>>>>>>     +static int cppc_get_rate_from_fbctrs(struct cppc_perf_fb_ctrs fb_ctrs_t0,
>>>>>> +                     struct cppc_perf_fb_ctrs fb_ctrs_t1)
>>>>>> +{
>>>>>> +    u64 delta_reference, delta_delivered;
>>>>>> +    u64 reference_perf, ratio;
>>>>>> +
>>>>>> +    reference_perf = fb_ctrs_t0.reference_perf;
>>>>>> +    if (fb_ctrs_t1.reference > fb_ctrs_t0.reference)
>>>>>> +        delta_reference = fb_ctrs_t1.reference - fb_ctrs_t0.reference;
>>>>>> +    else /* Counters would have wrapped-around */
>>>>>> +        delta_reference  = ((u64)(~((u64)0)) - fb_ctrs_t0.reference) +
>>>>>> +                    fb_ctrs_t1.reference;
>>>>>> +
>>>>>> +    if (fb_ctrs_t1.delivered > fb_ctrs_t0.delivered)
>>>>>> +        delta_delivered = fb_ctrs_t1.delivered - fb_ctrs_t0.delivered;
>>>>>> +    else /* Counters would have wrapped-around */
>>>>>> +        delta_delivered  = ((u64)(~((u64)0)) - fb_ctrs_t0.delivered) +
>>>>>> +                    fb_ctrs_t1.delivered;
>>>>> We need to check that the wraparound time is long enough to make sure that
>>>>> the counters cannot wrap around more than once. We can register a  get() api
>>>>> only after checking that wraparound time value is reasonably high.
>>>>>
>>>>> I am not aware of any platforms where wraparound time is soo short, but
>>>>> wouldn't hurt to check once during init.
>>>> By design the wraparound time is a 64 bit counter, for that matter even
>>>> all the feedback counters too are 64 bit counters. I don't see any
>>>> chance in which the counters can wraparound twice in back to back reads.
>>>> The only situation is in which system itself is running at a really high
>>>> frequency. Even in that case today's spec is not sufficient to support the same.
>>>
>>> The spec doesn't say these have to be 64bit registers.  The wraparound
>>> counter register is in spec to communicate the worst case(shortest)
>>> counter rollover time.
>>
>> Spec says these are 32 or 64 bit registers. Spec also defines counter
>> wraparound time in seconds. The minimum value possible is 1 as zero means the counters are never assumed to wrap around. Even in platforms with value set as 1 (1 sec) I dont really see a situation in which
>> the counter can wraparound twice if we are putting a delay of 2usec
>> between sampling.
> ok.
Thanks
>>
>>>
>>> As as mentioned before this is just a defensive check to make sure that
>>> the platform has not set it to some very low number (which is allowed
>>> by the spec).
>> It might be unnecessary to have a check like this.
>>>
>>>>
>>>>>> +
>>>>>> +    if (delta_reference)  /* Check to avoid divide-by zero */
>>>>>> +        ratio = (delta_delivered * 1000) / delta_reference;
>>>>> Why not just return the computed value here instead of *1000 and later /1000?
>>>>> return (ref_per * delta_del) / delta_ref;
>>>> Yes.
>>>>>> +    else
>>>>>> +        return -EINVAL;
>>>>> Instead of EINVAL, i think we should return current frequency.
>>>>>
>>>> Sorry, I didn't get you, How do you calculate the current frequency?
>>>> Did you mean reference performance?
>>> I mean the performance that OSPM/Linux had requested earlier.
>>> i.e the desired_perf
>> Okay, I will make necessary changes for this in v2.
>>
>>>>
>>>>> The counters can pause if CPUs are in idle state during our sampling interval, so
>>>>> If the counters did not progress, it is reasonable to assume the delivered perf was
>>>>> equal to desired perf.
>>>> No, that is wrong. Here the check is for reference performance delta.
>>>> This counter can never pause. In case of cpuidle only the delivered counters could pause. Delivered counters will pause only if the particular core enters power down mode, Otherwise we would be still clocking the core and we should be getting a delta across 2 sampling periods. In case if the reference counter is paused which by design is not correct then there is no point in returning reference performance numbers. That too is wrong. In case the low level FW is not updating the
>>>> counters properly then it should be evident till Linux, instead of returning a bogus frequency.
>>>
>>> Again you are describing how it works on a specific platform and not
>>> how it is described in spec. Section 8.4.7.1.3.1.1 of ACPI 6.2 states
>>> "The Reference Performance Counter Register counts at a fixed rate
>>> any time the processor is active."
>>>   > Implies the counters *may* pause in idle state -I can imagine an
>>> implementation where you can keep this counter running and
>>> account for it via delivered counter, but we cannot make any
>>> assumptions outside of what the spec describes.
>>>
>>>>>
>>>>> Even if platform wanted to limit, since the CPUs were asleep(idle) we could not have
>>>>> observed lower performance, so we will not throw off  any logic that could be driven
>>>>> using the returned value.
>>>>>> +
>>>>>> +    return (reference_perf * ratio) / 1000;
>>>>> This should be converted to KHz as cpufreq is not aware of CPPC abstract scale
>>>> In our platform all performance registers are implemented in KHz. Because of which we never had an issue with conversion. I am  not
>>>> aware whether ACPI mandates to use any particular unit. How is that
>>>> implemented in your platform? Just to avoid any extra conversion don't
>>>> you feel it is better to always report in KHz from firmware.
>>> Again think of spec not a specific platform :)
>>> - The CPPC spec works on abstract scale and cpufreq works in KHz.
>>> - The above computed value is in abstract scale
>>> - The abstarct scale may be in KHz on your platform, but we cannot assume the
>>> same about all the platforms
>> For now can I assume it to be in KHz only?
> 
> No, it will break platforms where abstract scale is not in KHz.
> 
>> I am not sure how to convert the abstract scale to Khz.
>> Can you please give me some pointers on the same?
> 
> Take a look at cppc_cpufreq_perf_to_khz and cppc_cpufreq_khz_to_perf
> in the same file (cppc_cpufreq.c). We use this in almost every function
> registered with core cpufreq.
> ||
>> In spec there is currently no interface which tells what is the abstract
>> scale!!
> 
> CPPC v3 adds some additional hooks for this. On CPPC v2, we try to use
> few DMI table entries to get the ratio between abstract scale and KHz.

Okay I will address these in v2.
> 
>>>>
>>>>>> +}
>>>>>> +
>>>>>> +static unsigned int cppc_cpufreq_get_rate(unsigned int cpunum)
>>>>>> +{
>>>>>> +    struct cppc_perf_fb_ctrs fb_ctrs_t0 = {0}, fb_ctrs_t1 = {0};
>>>>>> +    int ret;
>>>>>> +
>>>>>> +    ret = cppc_get_perf_ctrs(cpunum, &fb_ctrs_t0);
>>>>>> +    if (ret)
>>>>>> +        return ret;
>>>>>> +
>>>>>> +    ret = cppc_get_perf_ctrs(cpunum, &fb_ctrs_t1);
>>>>>> +    if (ret)
>>>>>> +        return ret;
>>>>>> +
>>>>>> +    return cppc_get_rate_from_fbctrs(fb_ctrs_t0, fb_ctrs_t1);
>>>>>> +}
>>>>> We need to make sure that we get a reasonably sample so make sure the reported
>>>>> performance is accurate.
>>>>> The counters can capture short transient throttling/limiting, so by sampling a really
>>>>> short duration of time we could return quite inaccurate measure of performance.
>>>>>
>>>> I would say it as a momentary thing only when the frequency is being ramped up/down.
>>> This exact behavior would depend on how different limiting functions are implemented.
>>> So this would vary from one platform to another.
>>>>
>>>>> We need to include some reasonable delay between the two calls. What is reasonable
>>>>> is debatable - so this can be few(2-10) microseconds defined as default. If the same value
>>>>> doesn't work for all the platforms, we might need to add a platform specific value.
>>>>>
>>>> cppc_get_perf_ctrs itself is a slow call, we have to format the CPC packet and ring a doorbell and then the response to be read from the shared registers. My initial implementation had a delay but in testing,
>>>> I found that it was unnecessary to have such a delay. Can you please
>>>> let me know whether it works without delay in your platform?
>>>>
>>>> Or else let me know whether udelay(10) is sufficient in between the
>>>> calls.
>>> Feedback counters need not be in PCC .
>>> 2us should be sufficient.
>> Yes I will add this to v2.
>>>>>> +
>>>>>>     static struct cpufreq_driver cppc_cpufreq_driver = {
>>>>>>         .flags = CPUFREQ_CONST_LOOPS,
>>>>>>         .verify = cppc_verify_policy,
>>>>>>         .target = cppc_cpufreq_set_target,
>>>>>> +    .get = cppc_cpufreq_get_rate,
>>>>>>         .init = cppc_cpufreq_cpu_init,
>>>>>>         .stop_cpu = cppc_cpufreq_stop_cpu,
>>>>>>         .name = "cppc_cpufreq",
>>>>>
>>>
>

cpufreq / CPPC: Add cpuinfo_cur_freq support for CPPC

Commit Message

Comments

Patch