diff mbox

Commit 554c8aa8ecad causing severe performance degression with pcc-cpufreq

Message ID 20180717065048.74mmgk4t5utjaa6a@suselix (mailing list archive)
State Not Applicable, archived
Headers show

Commit Message

Andreas Herrmann July 17, 2018, 6:50 a.m. UTC
Hello,

I've recently noticed that commit 554c8aa8ecad ("sched: idle: Select
idle state before stopping the tick") causes severe performance drop
for systems using pcc-cpufreq driver. Depending on the number of CPUs
the system might be almost unusable. The OS jitter for 4.17.y and
4.18.-rcx kernels is off the charts, you can even spot it with top
command (issued when the system is supposedly idle), e.g.

 top - 14:44:24 up 2 min,  1 user,  load average: 90.11, 38.20, 14.38
 Tasks: 1199 total, 109 running, 541 sleeping,   0 stopped,   0 zombie
 %Cpu(s):  1.2 us, 58.7 sy,  0.0 ni, 39.3 id,  0.6 wa,  0.0 hi,  0.3 si,  0.0 st
 KiB Mem:  13137064+total,  1192168 used, 13017848+free,     2340 buffers
 KiB Swap:  2104316 total,        0 used,  2104316 free.   522296 cached Mem

   PID USER      PR  NI    VIRT    RES    SHR S    %CPU  %MEM     TIME+ COMMAND
  3373 root      20   0  982024  49916  36120 R  96.691 0.038   0:19.54 kubelet
    67 root      20   0       0      0      0 R  78.676 0.000   0:49.36 kworker/9:0
    25 root      20   0       0      0      0 R  78.125 0.000   0:49.67 kworker/2:0
   182 root      20   0       0      0      0 R  75.735 0.000   1:18.17 kworker/28:0
    43 root      20   0       0      0      0 R  75.000 0.000   0:11.56 kworker/5:0
   103 root      20   0       0      0      0 R  74.449 0.000   0:46.83 kworker/15:0
   334 root      20   0       0      0      0 R  72.978 0.000   1:06.88 kworker/53:0
   789 root      20   0       0      0      0 R  69.853 0.000   1:29.50 kworker/38:1
   418 root      20   0       0      0      0 R  69.301 0.000   0:41.33 kworker/67:0
   779 root      20   0       0      0      0 R  68.934 0.000   1:33.60 kworker/27:1
   773 root      20   0       0      0      0 R  68.566 0.000   1:37.91 kworker/22:1
   762 root      20   0       0      0      0 R  68.015 0.000   1:41.01 kworker/11:1
   769 root      20   0       0      0      0 R  67.647 0.000   1:37.65 kworker/18:1
   805 root      20   0       0      0      0 R  67.096 0.000   1:30.96 kworker/54:1
   840 root      20   0       0      0      0 R  66.912 0.000   1:23.82 kworker/89:1
   812 root      20   0       0      0      0 R  66.728 0.000   1:31.89 kworker/59:1
   847 root      20   0       0      0      0 R  66.360 0.000   1:28.40 kworker/96:1
   763 root      20   0       0      0      0 R  66.176 0.000   1:42.57 kworker/12:1
   772 root      20   0       0      0      0 R  66.176 0.000   1:12.58 kworker/21:1
   821 root      20   0       0      0      0 R  66.176 0.000   1:29.62 kworker/69:1
   923 root      20   0       0      0      0 R  65.809 0.000   1:44.32 kworker/3:18
  1284 root      20   0       0      0      0 R  65.809 0.000   1:23.50 kworker/101:2
    61 root      20   0       0      0      0 R  65.625 0.000   1:29.37 kworker/8:0
  3531 root      20   0   24384   3768   2356 R  65.625 0.003   0:08.91 top
   771 root      20   0       0      0      0 R  65.074 0.000   1:37.90 kworker/20:1
   767 root      20   0       0      0      0 R  64.706 0.000   1:38.01 kworker/16:1
   764 root      20   0       0      0      0 R  64.522 0.000   1:40.28 kworker/13:1
   765 root      20   0       0      0      0 R  64.154 0.000   1:40.13 kworker/14:1

When I apply below patch (trying to revert essential parts of commit
554c8aa8ecad) behaviour seems back to normal.

I know that pcc-cpufreq driver is not "state-of-the-art" when it comes
to cpufreq drivers and you better not use it. But I wonder whether
commit 554c8aa8ecad ("sched: idle: Select idle state before stopping
the tick") introduced bad behaviour for other cases as well.

I'll send some performance results to illustrate the issue asap. I've
also tried to modify pcc-cpufreq to reduce the amount of frequency
changes triggered by this driver but this does not help for kernels
where commit 554c8aa8ecad is applied.


Andreas

---

Comments

Rafael J. Wysocki July 17, 2018, 7:33 a.m. UTC | #1
Hi,

Thanks for your report!

On Tue, Jul 17, 2018 at 8:50 AM, Andreas Herrmann <aherrmann@suse.com> wrote:
> Hello,
>
> I've recently noticed that commit 554c8aa8ecad ("sched: idle: Select
> idle state before stopping the tick") causes severe performance drop
> for systems using pcc-cpufreq driver. Depending on the number of CPUs
> the system might be almost unusable. The OS jitter for 4.17.y and
> 4.18.-rcx kernels is off the charts, you can even spot it with top
> command (issued when the system is supposedly idle), e.g.
>
>  top - 14:44:24 up 2 min,  1 user,  load average: 90.11, 38.20, 14.38
>  Tasks: 1199 total, 109 running, 541 sleeping,   0 stopped,   0 zombie
>  %Cpu(s):  1.2 us, 58.7 sy,  0.0 ni, 39.3 id,  0.6 wa,  0.0 hi,  0.3 si,  0.0 st
>  KiB Mem:  13137064+total,  1192168 used, 13017848+free,     2340 buffers
>  KiB Swap:  2104316 total,        0 used,  2104316 free.   522296 cached Mem
>
>    PID USER      PR  NI    VIRT    RES    SHR S    %CPU  %MEM     TIME+ COMMAND
>   3373 root      20   0  982024  49916  36120 R  96.691 0.038   0:19.54 kubelet
>     67 root      20   0       0      0      0 R  78.676 0.000   0:49.36 kworker/9:0
>     25 root      20   0       0      0      0 R  78.125 0.000   0:49.67 kworker/2:0
>    182 root      20   0       0      0      0 R  75.735 0.000   1:18.17 kworker/28:0
>     43 root      20   0       0      0      0 R  75.000 0.000   0:11.56 kworker/5:0
>    103 root      20   0       0      0      0 R  74.449 0.000   0:46.83 kworker/15:0
>    334 root      20   0       0      0      0 R  72.978 0.000   1:06.88 kworker/53:0
>    789 root      20   0       0      0      0 R  69.853 0.000   1:29.50 kworker/38:1
>    418 root      20   0       0      0      0 R  69.301 0.000   0:41.33 kworker/67:0
>    779 root      20   0       0      0      0 R  68.934 0.000   1:33.60 kworker/27:1
>    773 root      20   0       0      0      0 R  68.566 0.000   1:37.91 kworker/22:1
>    762 root      20   0       0      0      0 R  68.015 0.000   1:41.01 kworker/11:1
>    769 root      20   0       0      0      0 R  67.647 0.000   1:37.65 kworker/18:1
>    805 root      20   0       0      0      0 R  67.096 0.000   1:30.96 kworker/54:1
>    840 root      20   0       0      0      0 R  66.912 0.000   1:23.82 kworker/89:1
>    812 root      20   0       0      0      0 R  66.728 0.000   1:31.89 kworker/59:1
>    847 root      20   0       0      0      0 R  66.360 0.000   1:28.40 kworker/96:1
>    763 root      20   0       0      0      0 R  66.176 0.000   1:42.57 kworker/12:1
>    772 root      20   0       0      0      0 R  66.176 0.000   1:12.58 kworker/21:1
>    821 root      20   0       0      0      0 R  66.176 0.000   1:29.62 kworker/69:1
>    923 root      20   0       0      0      0 R  65.809 0.000   1:44.32 kworker/3:18
>   1284 root      20   0       0      0      0 R  65.809 0.000   1:23.50 kworker/101:2
>     61 root      20   0       0      0      0 R  65.625 0.000   1:29.37 kworker/8:0
>   3531 root      20   0   24384   3768   2356 R  65.625 0.003   0:08.91 top
>    771 root      20   0       0      0      0 R  65.074 0.000   1:37.90 kworker/20:1
>    767 root      20   0       0      0      0 R  64.706 0.000   1:38.01 kworker/16:1
>    764 root      20   0       0      0      0 R  64.522 0.000   1:40.28 kworker/13:1
>    765 root      20   0       0      0      0 R  64.154 0.000   1:40.13 kworker/14:1
>
> When I apply below patch (trying to revert essential parts of commit
> 554c8aa8ecad) behaviour seems back to normal.

Well, that basically defeats the purpose of the change in commit
554c8aa8ecad, so it's not what I'd like to do to fix this problem.

Also it would be good to understand what actually happens.

> I know that pcc-cpufreq driver is not "state-of-the-art" when it comes
> to cpufreq drivers and you better not use it.

That's exactly right.

> But I wonder whether commit 554c8aa8ecad ("sched: idle: Select idle state before
> stopping the tick") introduced bad behaviour for other cases as well.

It has been tested quite extensively in that respect, although
admittedly not with the pcc-cpufreq driver.

Nothing bad related to it has been has been reported so far, FWIW.

> I'll send some performance results to illustrate the issue asap. I've
> also tried to modify pcc-cpufreq to reduce the amount of frequency
> changes triggered by this driver but this does not help for kernels
> where commit 554c8aa8ecad is applied.

Can you replace pcc-cpufreq with a different cpufreq driver on the
affected systems?  If so, do performance numbers look bad after that
too?

Thanks,
Rafael
Rafael J. Wysocki July 17, 2018, 8:03 a.m. UTC | #2
On Tue, Jul 17, 2018 at 9:33 AM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> Hi,
>
> Thanks for your report!
>
> On Tue, Jul 17, 2018 at 8:50 AM, Andreas Herrmann <aherrmann@suse.com> wrote:
>> Hello,
>>
>> I've recently noticed that commit 554c8aa8ecad ("sched: idle: Select
>> idle state before stopping the tick") causes severe performance drop
>> for systems using pcc-cpufreq driver. Depending on the number of CPUs
>> the system might be almost unusable. The OS jitter for 4.17.y and
>> 4.18.-rcx kernels is off the charts, you can even spot it with top
>> command (issued when the system is supposedly idle), e.g.
>>
>>  top - 14:44:24 up 2 min,  1 user,  load average: 90.11, 38.20, 14.38
>>  Tasks: 1199 total, 109 running, 541 sleeping,   0 stopped,   0 zombie
>>  %Cpu(s):  1.2 us, 58.7 sy,  0.0 ni, 39.3 id,  0.6 wa,  0.0 hi,  0.3 si,  0.0 st
>>  KiB Mem:  13137064+total,  1192168 used, 13017848+free,     2340 buffers
>>  KiB Swap:  2104316 total,        0 used,  2104316 free.   522296 cached Mem
>>
>>    PID USER      PR  NI    VIRT    RES    SHR S    %CPU  %MEM     TIME+ COMMAND
>>   3373 root      20   0  982024  49916  36120 R  96.691 0.038   0:19.54 kubelet
>>     67 root      20   0       0      0      0 R  78.676 0.000   0:49.36 kworker/9:0
>>     25 root      20   0       0      0      0 R  78.125 0.000   0:49.67 kworker/2:0
>>    182 root      20   0       0      0      0 R  75.735 0.000   1:18.17 kworker/28:0
>>     43 root      20   0       0      0      0 R  75.000 0.000   0:11.56 kworker/5:0
>>    103 root      20   0       0      0      0 R  74.449 0.000   0:46.83 kworker/15:0
>>    334 root      20   0       0      0      0 R  72.978 0.000   1:06.88 kworker/53:0
>>    789 root      20   0       0      0      0 R  69.853 0.000   1:29.50 kworker/38:1
>>    418 root      20   0       0      0      0 R  69.301 0.000   0:41.33 kworker/67:0
>>    779 root      20   0       0      0      0 R  68.934 0.000   1:33.60 kworker/27:1
>>    773 root      20   0       0      0      0 R  68.566 0.000   1:37.91 kworker/22:1
>>    762 root      20   0       0      0      0 R  68.015 0.000   1:41.01 kworker/11:1
>>    769 root      20   0       0      0      0 R  67.647 0.000   1:37.65 kworker/18:1
>>    805 root      20   0       0      0      0 R  67.096 0.000   1:30.96 kworker/54:1
>>    840 root      20   0       0      0      0 R  66.912 0.000   1:23.82 kworker/89:1
>>    812 root      20   0       0      0      0 R  66.728 0.000   1:31.89 kworker/59:1
>>    847 root      20   0       0      0      0 R  66.360 0.000   1:28.40 kworker/96:1
>>    763 root      20   0       0      0      0 R  66.176 0.000   1:42.57 kworker/12:1
>>    772 root      20   0       0      0      0 R  66.176 0.000   1:12.58 kworker/21:1
>>    821 root      20   0       0      0      0 R  66.176 0.000   1:29.62 kworker/69:1
>>    923 root      20   0       0      0      0 R  65.809 0.000   1:44.32 kworker/3:18
>>   1284 root      20   0       0      0      0 R  65.809 0.000   1:23.50 kworker/101:2
>>     61 root      20   0       0      0      0 R  65.625 0.000   1:29.37 kworker/8:0
>>   3531 root      20   0   24384   3768   2356 R  65.625 0.003   0:08.91 top
>>    771 root      20   0       0      0      0 R  65.074 0.000   1:37.90 kworker/20:1
>>    767 root      20   0       0      0      0 R  64.706 0.000   1:38.01 kworker/16:1
>>    764 root      20   0       0      0      0 R  64.522 0.000   1:40.28 kworker/13:1
>>    765 root      20   0       0      0      0 R  64.154 0.000   1:40.13 kworker/14:1
>>
>> When I apply below patch (trying to revert essential parts of commit
>> 554c8aa8ecad) behaviour seems back to normal.
>
> Well, that basically defeats the purpose of the change in commit
> 554c8aa8ecad, so it's not what I'd like to do to fix this problem.
>
> Also it would be good to understand what actually happens.
>
>> I know that pcc-cpufreq driver is not "state-of-the-art" when it comes
>> to cpufreq drivers and you better not use it.
>
> That's exactly right.
>
>> But I wonder whether commit 554c8aa8ecad ("sched: idle: Select idle state before
>> stopping the tick") introduced bad behaviour for other cases as well.
>
> It has been tested quite extensively in that respect, although
> admittedly not with the pcc-cpufreq driver.
>
> Nothing bad related to it has been has been reported so far, FWIW.
>
>> I'll send some performance results to illustrate the issue asap. I've
>> also tried to modify pcc-cpufreq to reduce the amount of frequency
>> changes triggered by this driver but this does not help for kernels
>> where commit 554c8aa8ecad is applied.
>
> Can you replace pcc-cpufreq with a different cpufreq driver on the
> affected systems?  If so, do performance numbers look bad after that
> too?

Also, what cpufreq governor do you use with pcc-cpufreq?  Does
changing it to something like "performance" improve things?
Daniel Lezcano July 17, 2018, 8:08 a.m. UTC | #3
On 17/07/2018 09:33, Rafael J. Wysocki wrote:
> Hi,
> 
> Thanks for your report!
> 
> On Tue, Jul 17, 2018 at 8:50 AM, Andreas Herrmann <aherrmann@suse.com> wrote:
>> Hello,
>>
>> I've recently noticed that commit 554c8aa8ecad ("sched: idle: Select
>> idle state before stopping the tick") causes severe performance drop
>> for systems using pcc-cpufreq driver. Depending on the number of CPUs
>> the system might be almost unusable. The OS jitter for 4.17.y and
>> 4.18.-rcx kernels is off the charts, you can even spot it with top
>> command (issued when the system is supposedly idle), e.g.
>>
>>  top - 14:44:24 up 2 min,  1 user,  load average: 90.11, 38.20, 14.38
>>  Tasks: 1199 total, 109 running, 541 sleeping,   0 stopped,   0 zombie
>>  %Cpu(s):  1.2 us, 58.7 sy,  0.0 ni, 39.3 id,  0.6 wa,  0.0 hi,  0.3 si,  0.0 st
>>  KiB Mem:  13137064+total,  1192168 used, 13017848+free,     2340 buffers
>>  KiB Swap:  2104316 total,        0 used,  2104316 free.   522296 cached Mem
>>
>>    PID USER      PR  NI    VIRT    RES    SHR S    %CPU  %MEM     TIME+ COMMAND
>>   3373 root      20   0  982024  49916  36120 R  96.691 0.038   0:19.54 kubelet
>>     67 root      20   0       0      0      0 R  78.676 0.000   0:49.36 kworker/9:0
>>     25 root      20   0       0      0      0 R  78.125 0.000   0:49.67 kworker/2:0
>>    182 root      20   0       0      0      0 R  75.735 0.000   1:18.17 kworker/28:0
>>     43 root      20   0       0      0      0 R  75.000 0.000   0:11.56 kworker/5:0
>>    103 root      20   0       0      0      0 R  74.449 0.000   0:46.83 kworker/15:0
>>    334 root      20   0       0      0      0 R  72.978 0.000   1:06.88 kworker/53:0
>>    789 root      20   0       0      0      0 R  69.853 0.000   1:29.50 kworker/38:1
>>    418 root      20   0       0      0      0 R  69.301 0.000   0:41.33 kworker/67:0
>>    779 root      20   0       0      0      0 R  68.934 0.000   1:33.60 kworker/27:1
>>    773 root      20   0       0      0      0 R  68.566 0.000   1:37.91 kworker/22:1
>>    762 root      20   0       0      0      0 R  68.015 0.000   1:41.01 kworker/11:1
>>    769 root      20   0       0      0      0 R  67.647 0.000   1:37.65 kworker/18:1
>>    805 root      20   0       0      0      0 R  67.096 0.000   1:30.96 kworker/54:1
>>    840 root      20   0       0      0      0 R  66.912 0.000   1:23.82 kworker/89:1
>>    812 root      20   0       0      0      0 R  66.728 0.000   1:31.89 kworker/59:1
>>    847 root      20   0       0      0      0 R  66.360 0.000   1:28.40 kworker/96:1
>>    763 root      20   0       0      0      0 R  66.176 0.000   1:42.57 kworker/12:1
>>    772 root      20   0       0      0      0 R  66.176 0.000   1:12.58 kworker/21:1
>>    821 root      20   0       0      0      0 R  66.176 0.000   1:29.62 kworker/69:1
>>    923 root      20   0       0      0      0 R  65.809 0.000   1:44.32 kworker/3:18
>>   1284 root      20   0       0      0      0 R  65.809 0.000   1:23.50 kworker/101:2
>>     61 root      20   0       0      0      0 R  65.625 0.000   1:29.37 kworker/8:0
>>   3531 root      20   0   24384   3768   2356 R  65.625 0.003   0:08.91 top
>>    771 root      20   0       0      0      0 R  65.074 0.000   1:37.90 kworker/20:1
>>    767 root      20   0       0      0      0 R  64.706 0.000   1:38.01 kworker/16:1
>>    764 root      20   0       0      0      0 R  64.522 0.000   1:40.28 kworker/13:1
>>    765 root      20   0       0      0      0 R  64.154 0.000   1:40.13 kworker/14:1
>>
>> When I apply below patch (trying to revert essential parts of commit
>> 554c8aa8ecad) behaviour seems back to normal.
> 
> Well, that basically defeats the purpose of the change in commit
> 554c8aa8ecad, so it's not what I'd like to do to fix this problem.
> 
> Also it would be good to understand what actually happens.
> 
>> I know that pcc-cpufreq driver is not "state-of-the-art" when it comes
>> to cpufreq drivers and you better not use it.
> 
> That's exactly right.
> 
>> But I wonder whether commit 554c8aa8ecad ("sched: idle: Select idle state before
>> stopping the tick") introduced bad behaviour for other cases as well.
> 
> It has been tested quite extensively in that respect, although
> admittedly not with the pcc-cpufreq driver.
> 
> Nothing bad related to it has been has been reported so far, FWIW.

I confirm we benchmarked on ARM64 and we did not notice a big
performance drop expect a few percent due to more cluster idle states
which is logical and compensated by a big energy improvement.

>> I'll send some performance results to illustrate the issue asap. I've
>> also tried to modify pcc-cpufreq to reduce the amount of frequency
>> changes triggered by this driver but this does not help for kernels
>> where commit 554c8aa8ecad is applied.
> 
> Can you replace pcc-cpufreq with a different cpufreq driver on the
> affected systems?  If so, do performance numbers look bad after that
> too?
> 
> Thanks,
> Rafael
>
Peter Zijlstra July 17, 2018, 8:15 a.m. UTC | #4
On Tue, Jul 17, 2018 at 08:50:48AM +0200, Andreas Herrmann wrote:
> I've recently noticed that commit 554c8aa8ecad ("sched: idle: Select
> idle state before stopping the tick") causes severe performance drop
> for systems using pcc-cpufreq driver. Depending on the number of CPUs
> the system might be almost unusable. 

Have you looked at that driver? It features a global lock which is taken
for all cpufreq operations. Once you get past a handful of cpus that
will constrict your system exactly as you say.

What kind of machine are you running this on, and can you configure the
thing to use anything else? If so, do so and never look back.
Andreas Herrmann July 17, 2018, 8:36 a.m. UTC | #5
On Tue, Jul 17, 2018 at 09:33:53AM +0200, Rafael J. Wysocki wrote:
> Hi,
> 
> Thanks for your report!
> 
> On Tue, Jul 17, 2018 at 8:50 AM, Andreas Herrmann <aherrmann@suse.com> wrote:
> > Hello,
> >
> > I've recently noticed that commit 554c8aa8ecad ("sched: idle: Select
> > idle state before stopping the tick") causes severe performance drop
> > for systems using pcc-cpufreq driver. Depending on the number of CPUs
> > the system might be almost unusable. The OS jitter for 4.17.y and
> > 4.18.-rcx kernels is off the charts, you can even spot it with top
> > command (issued when the system is supposedly idle), e.g.
> >
> >  top - 14:44:24 up 2 min,  1 user,  load average: 90.11, 38.20, 14.38
> >  Tasks: 1199 total, 109 running, 541 sleeping,   0 stopped,   0 zombie
> >  %Cpu(s):  1.2 us, 58.7 sy,  0.0 ni, 39.3 id,  0.6 wa,  0.0 hi,  0.3 si,  0.0 st
> >  KiB Mem:  13137064+total,  1192168 used, 13017848+free,     2340 buffers
> >  KiB Swap:  2104316 total,        0 used,  2104316 free.   522296 cached Mem
> >
> >    PID USER      PR  NI    VIRT    RES    SHR S    %CPU  %MEM     TIME+ COMMAND
> >   3373 root      20   0  982024  49916  36120 R  96.691 0.038   0:19.54 kubelet
> >     67 root      20   0       0      0      0 R  78.676 0.000   0:49.36 kworker/9:0
> >     25 root      20   0       0      0      0 R  78.125 0.000   0:49.67 kworker/2:0
> >    182 root      20   0       0      0      0 R  75.735 0.000   1:18.17 kworker/28:0
> >     43 root      20   0       0      0      0 R  75.000 0.000   0:11.56 kworker/5:0
> >    103 root      20   0       0      0      0 R  74.449 0.000   0:46.83 kworker/15:0
> >    334 root      20   0       0      0      0 R  72.978 0.000   1:06.88 kworker/53:0
> >    789 root      20   0       0      0      0 R  69.853 0.000   1:29.50 kworker/38:1
> >    418 root      20   0       0      0      0 R  69.301 0.000   0:41.33 kworker/67:0
> >    779 root      20   0       0      0      0 R  68.934 0.000   1:33.60 kworker/27:1
> >    773 root      20   0       0      0      0 R  68.566 0.000   1:37.91 kworker/22:1
> >    762 root      20   0       0      0      0 R  68.015 0.000   1:41.01 kworker/11:1
> >    769 root      20   0       0      0      0 R  67.647 0.000   1:37.65 kworker/18:1
> >    805 root      20   0       0      0      0 R  67.096 0.000   1:30.96 kworker/54:1
> >    840 root      20   0       0      0      0 R  66.912 0.000   1:23.82 kworker/89:1
> >    812 root      20   0       0      0      0 R  66.728 0.000   1:31.89 kworker/59:1
> >    847 root      20   0       0      0      0 R  66.360 0.000   1:28.40 kworker/96:1
> >    763 root      20   0       0      0      0 R  66.176 0.000   1:42.57 kworker/12:1
> >    772 root      20   0       0      0      0 R  66.176 0.000   1:12.58 kworker/21:1
> >    821 root      20   0       0      0      0 R  66.176 0.000   1:29.62 kworker/69:1
> >    923 root      20   0       0      0      0 R  65.809 0.000   1:44.32 kworker/3:18
> >   1284 root      20   0       0      0      0 R  65.809 0.000   1:23.50 kworker/101:2
> >     61 root      20   0       0      0      0 R  65.625 0.000   1:29.37 kworker/8:0
> >   3531 root      20   0   24384   3768   2356 R  65.625 0.003   0:08.91 top
> >    771 root      20   0       0      0      0 R  65.074 0.000   1:37.90 kworker/20:1
> >    767 root      20   0       0      0      0 R  64.706 0.000   1:38.01 kworker/16:1
> >    764 root      20   0       0      0      0 R  64.522 0.000   1:40.28 kworker/13:1
> >    765 root      20   0       0      0      0 R  64.154 0.000   1:40.13 kworker/14:1
> >
> > When I apply below patch (trying to revert essential parts of commit
> > 554c8aa8ecad) behaviour seems back to normal.
> 
> Well, that basically defeats the purpose of the change in commit
> 554c8aa8ecad, so it's not what I'd like to do to fix this problem.
> 
> Also it would be good to understand what actually happens.
> 
> > I know that pcc-cpufreq driver is not "state-of-the-art" when it comes
> > to cpufreq drivers and you better not use it.
> 
> That's exactly right.
> 
> > But I wonder whether commit 554c8aa8ecad ("sched: idle: Select idle state before
> > stopping the tick") introduced bad behaviour for other cases as well.
> 
> It has been tested quite extensively in that respect, although
> admittedly not with the pcc-cpufreq driver.
> 
> Nothing bad related to it has been has been reported so far, FWIW.
> 
> > I'll send some performance results to illustrate the issue asap. I've
> > also tried to modify pcc-cpufreq to reduce the amount of frequency
> > changes triggered by this driver but this does not help for kernels
> > where commit 554c8aa8ecad is applied.
> 
> Can you replace pcc-cpufreq with a different cpufreq driver on the
> affected systems?  If so, do performance numbers look bad after that
> too?

I have no performance numbers yet for other cpufreq drivers on this
system (checking this commit).  But I'll look it at next.


Andreas
Andreas Herrmann July 17, 2018, 8:50 a.m. UTC | #6
On Tue, Jul 17, 2018 at 10:03:41AM +0200, Rafael J. Wysocki wrote:
> On Tue, Jul 17, 2018 at 9:33 AM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > Hi,
> >
> > Thanks for your report!
> >
> > On Tue, Jul 17, 2018 at 8:50 AM, Andreas Herrmann <aherrmann@suse.com> wrote:
> >> Hello,
> >>
> >> I've recently noticed that commit 554c8aa8ecad ("sched: idle: Select
> >> idle state before stopping the tick") causes severe performance drop
> >> for systems using pcc-cpufreq driver. Depending on the number of CPUs
> >> the system might be almost unusable. The OS jitter for 4.17.y and
> >> 4.18.-rcx kernels is off the charts, you can even spot it with top
> >> command (issued when the system is supposedly idle), e.g.
> >>
> >>  top - 14:44:24 up 2 min,  1 user,  load average: 90.11, 38.20, 14.38
> >>  Tasks: 1199 total, 109 running, 541 sleeping,   0 stopped,   0 zombie
> >>  %Cpu(s):  1.2 us, 58.7 sy,  0.0 ni, 39.3 id,  0.6 wa,  0.0 hi,  0.3 si,  0.0 st
> >>  KiB Mem:  13137064+total,  1192168 used, 13017848+free,     2340 buffers
> >>  KiB Swap:  2104316 total,        0 used,  2104316 free.   522296 cached Mem
> >>
> >>    PID USER      PR  NI    VIRT    RES    SHR S    %CPU  %MEM     TIME+ COMMAND
> >>   3373 root      20   0  982024  49916  36120 R  96.691 0.038   0:19.54 kubelet
> >>     67 root      20   0       0      0      0 R  78.676 0.000   0:49.36 kworker/9:0
> >>     25 root      20   0       0      0      0 R  78.125 0.000   0:49.67 kworker/2:0
> >>    182 root      20   0       0      0      0 R  75.735 0.000   1:18.17 kworker/28:0
> >>     43 root      20   0       0      0      0 R  75.000 0.000   0:11.56 kworker/5:0
> >>    103 root      20   0       0      0      0 R  74.449 0.000   0:46.83 kworker/15:0
> >>    334 root      20   0       0      0      0 R  72.978 0.000   1:06.88 kworker/53:0
> >>    789 root      20   0       0      0      0 R  69.853 0.000   1:29.50 kworker/38:1
> >>    418 root      20   0       0      0      0 R  69.301 0.000   0:41.33 kworker/67:0
> >>    779 root      20   0       0      0      0 R  68.934 0.000   1:33.60 kworker/27:1
> >>    773 root      20   0       0      0      0 R  68.566 0.000   1:37.91 kworker/22:1
> >>    762 root      20   0       0      0      0 R  68.015 0.000   1:41.01 kworker/11:1
> >>    769 root      20   0       0      0      0 R  67.647 0.000   1:37.65 kworker/18:1
> >>    805 root      20   0       0      0      0 R  67.096 0.000   1:30.96 kworker/54:1
> >>    840 root      20   0       0      0      0 R  66.912 0.000   1:23.82 kworker/89:1
> >>    812 root      20   0       0      0      0 R  66.728 0.000   1:31.89 kworker/59:1
> >>    847 root      20   0       0      0      0 R  66.360 0.000   1:28.40 kworker/96:1
> >>    763 root      20   0       0      0      0 R  66.176 0.000   1:42.57 kworker/12:1
> >>    772 root      20   0       0      0      0 R  66.176 0.000   1:12.58 kworker/21:1
> >>    821 root      20   0       0      0      0 R  66.176 0.000   1:29.62 kworker/69:1
> >>    923 root      20   0       0      0      0 R  65.809 0.000   1:44.32 kworker/3:18
> >>   1284 root      20   0       0      0      0 R  65.809 0.000   1:23.50 kworker/101:2
> >>     61 root      20   0       0      0      0 R  65.625 0.000   1:29.37 kworker/8:0
> >>   3531 root      20   0   24384   3768   2356 R  65.625 0.003   0:08.91 top
> >>    771 root      20   0       0      0      0 R  65.074 0.000   1:37.90 kworker/20:1
> >>    767 root      20   0       0      0      0 R  64.706 0.000   1:38.01 kworker/16:1
> >>    764 root      20   0       0      0      0 R  64.522 0.000   1:40.28 kworker/13:1
> >>    765 root      20   0       0      0      0 R  64.154 0.000   1:40.13 kworker/14:1
> >>
> >> When I apply below patch (trying to revert essential parts of commit
> >> 554c8aa8ecad) behaviour seems back to normal.
> >
> > Well, that basically defeats the purpose of the change in commit
> > 554c8aa8ecad, so it's not what I'd like to do to fix this problem.
> >
> > Also it would be good to understand what actually happens.
> >
> >> I know that pcc-cpufreq driver is not "state-of-the-art" when it comes
> >> to cpufreq drivers and you better not use it.
> >
> > That's exactly right.
> >
> >> But I wonder whether commit 554c8aa8ecad ("sched: idle: Select idle state before
> >> stopping the tick") introduced bad behaviour for other cases as well.
> >
> > It has been tested quite extensively in that respect, although
> > admittedly not with the pcc-cpufreq driver.
> >
> > Nothing bad related to it has been has been reported so far, FWIW.
> >
> >> I'll send some performance results to illustrate the issue asap. I've
> >> also tried to modify pcc-cpufreq to reduce the amount of frequency
> >> changes triggered by this driver but this does not help for kernels
> >> where commit 554c8aa8ecad is applied.
> >
> > Can you replace pcc-cpufreq with a different cpufreq driver on the
> > affected systems?  If so, do performance numbers look bad after that
> > too?
> 
> Also, what cpufreq governor do you use with pcc-cpufreq?

Ondemand governor. Which triggers a lot of PCC related platform calls.
And as Peter noticed already the driver has a severe bottleneck (lock
protecting shared memory used for all CPUs to pass data to/from
platform for PCC calls).

> Does changing it to something like "performance" improve things?

With performance governor above mentioned bottleneck is no issue.

On balance before this commit users could use pcc-cpufreq but had
already suboptimal performance (compared to say intel_pstate driver
which can be used changing BIOS options). Starting with this commit
systems using pcc-cpufreq are unusable with high number of CPUs (top
output above is for system with 120 CPUs).

So should the driver be removed (sooner or later), or this behaviour
be documented somewhere, or just leave it as is.


Andreas
Rafael J. Wysocki July 17, 2018, 8:52 a.m. UTC | #7
On Tue, Jul 17, 2018 at 10:36 AM, Andreas Herrmann <aherrmann@suse.com> wrote:
> On Tue, Jul 17, 2018 at 09:33:53AM +0200, Rafael J. Wysocki wrote:
>> Hi,
>>
>> Thanks for your report!
>>
>> On Tue, Jul 17, 2018 at 8:50 AM, Andreas Herrmann <aherrmann@suse.com> wrote:
>> > Hello,
>> >
>> > I've recently noticed that commit 554c8aa8ecad ("sched: idle: Select
>> > idle state before stopping the tick") causes severe performance drop
>> > for systems using pcc-cpufreq driver. Depending on the number of CPUs
>> > the system might be almost unusable. The OS jitter for 4.17.y and
>> > 4.18.-rcx kernels is off the charts, you can even spot it with top
>> > command (issued when the system is supposedly idle), e.g.
>> >
>> >  top - 14:44:24 up 2 min,  1 user,  load average: 90.11, 38.20, 14.38
>> >  Tasks: 1199 total, 109 running, 541 sleeping,   0 stopped,   0 zombie
>> >  %Cpu(s):  1.2 us, 58.7 sy,  0.0 ni, 39.3 id,  0.6 wa,  0.0 hi,  0.3 si,  0.0 st
>> >  KiB Mem:  13137064+total,  1192168 used, 13017848+free,     2340 buffers
>> >  KiB Swap:  2104316 total,        0 used,  2104316 free.   522296 cached Mem
>> >
>> >    PID USER      PR  NI    VIRT    RES    SHR S    %CPU  %MEM     TIME+ COMMAND
>> >   3373 root      20   0  982024  49916  36120 R  96.691 0.038   0:19.54 kubelet
>> >     67 root      20   0       0      0      0 R  78.676 0.000   0:49.36 kworker/9:0
>> >     25 root      20   0       0      0      0 R  78.125 0.000   0:49.67 kworker/2:0
>> >    182 root      20   0       0      0      0 R  75.735 0.000   1:18.17 kworker/28:0
>> >     43 root      20   0       0      0      0 R  75.000 0.000   0:11.56 kworker/5:0
>> >    103 root      20   0       0      0      0 R  74.449 0.000   0:46.83 kworker/15:0
>> >    334 root      20   0       0      0      0 R  72.978 0.000   1:06.88 kworker/53:0
>> >    789 root      20   0       0      0      0 R  69.853 0.000   1:29.50 kworker/38:1
>> >    418 root      20   0       0      0      0 R  69.301 0.000   0:41.33 kworker/67:0
>> >    779 root      20   0       0      0      0 R  68.934 0.000   1:33.60 kworker/27:1
>> >    773 root      20   0       0      0      0 R  68.566 0.000   1:37.91 kworker/22:1
>> >    762 root      20   0       0      0      0 R  68.015 0.000   1:41.01 kworker/11:1
>> >    769 root      20   0       0      0      0 R  67.647 0.000   1:37.65 kworker/18:1
>> >    805 root      20   0       0      0      0 R  67.096 0.000   1:30.96 kworker/54:1
>> >    840 root      20   0       0      0      0 R  66.912 0.000   1:23.82 kworker/89:1
>> >    812 root      20   0       0      0      0 R  66.728 0.000   1:31.89 kworker/59:1
>> >    847 root      20   0       0      0      0 R  66.360 0.000   1:28.40 kworker/96:1
>> >    763 root      20   0       0      0      0 R  66.176 0.000   1:42.57 kworker/12:1
>> >    772 root      20   0       0      0      0 R  66.176 0.000   1:12.58 kworker/21:1
>> >    821 root      20   0       0      0      0 R  66.176 0.000   1:29.62 kworker/69:1
>> >    923 root      20   0       0      0      0 R  65.809 0.000   1:44.32 kworker/3:18
>> >   1284 root      20   0       0      0      0 R  65.809 0.000   1:23.50 kworker/101:2
>> >     61 root      20   0       0      0      0 R  65.625 0.000   1:29.37 kworker/8:0
>> >   3531 root      20   0   24384   3768   2356 R  65.625 0.003   0:08.91 top
>> >    771 root      20   0       0      0      0 R  65.074 0.000   1:37.90 kworker/20:1
>> >    767 root      20   0       0      0      0 R  64.706 0.000   1:38.01 kworker/16:1
>> >    764 root      20   0       0      0      0 R  64.522 0.000   1:40.28 kworker/13:1
>> >    765 root      20   0       0      0      0 R  64.154 0.000   1:40.13 kworker/14:1
>> >
>> > When I apply below patch (trying to revert essential parts of commit
>> > 554c8aa8ecad) behaviour seems back to normal.
>>
>> Well, that basically defeats the purpose of the change in commit
>> 554c8aa8ecad, so it's not what I'd like to do to fix this problem.
>>
>> Also it would be good to understand what actually happens.
>>
>> > I know that pcc-cpufreq driver is not "state-of-the-art" when it comes
>> > to cpufreq drivers and you better not use it.
>>
>> That's exactly right.
>>
>> > But I wonder whether commit 554c8aa8ecad ("sched: idle: Select idle state before
>> > stopping the tick") introduced bad behaviour for other cases as well.
>>
>> It has been tested quite extensively in that respect, although
>> admittedly not with the pcc-cpufreq driver.
>>
>> Nothing bad related to it has been has been reported so far, FWIW.
>>
>> > I'll send some performance results to illustrate the issue asap. I've
>> > also tried to modify pcc-cpufreq to reduce the amount of frequency
>> > changes triggered by this driver but this does not help for kernels
>> > where commit 554c8aa8ecad is applied.
>>
>> Can you replace pcc-cpufreq with a different cpufreq driver on the
>> affected systems?  If so, do performance numbers look bad after that
>> too?
>
> I have no performance numbers yet for other cpufreq drivers on this
> system (checking this commit).  But I'll look it at next.

Thanks!

Generally speaking, pcc-cpufreq is fundamentally not scalable, so the
additional concurrency brought in by the commit in question may have
exposed that weakness if that driver is run on a system with multiple
logical CPUs.
Rafael J. Wysocki July 17, 2018, 8:58 a.m. UTC | #8
On Tue, Jul 17, 2018 at 10:50 AM, Andreas Herrmann <aherrmann@suse.com> wrote:
> On Tue, Jul 17, 2018 at 10:03:41AM +0200, Rafael J. Wysocki wrote:
>> On Tue, Jul 17, 2018 at 9:33 AM, Rafael J. Wysocki <rafael@kernel.org> wrote:
>> > Hi,
>> >
>> > Thanks for your report!
>> >
>> > On Tue, Jul 17, 2018 at 8:50 AM, Andreas Herrmann <aherrmann@suse.com> wrote:
>> >> Hello,
>> >>
>> >> I've recently noticed that commit 554c8aa8ecad ("sched: idle: Select
>> >> idle state before stopping the tick") causes severe performance drop
>> >> for systems using pcc-cpufreq driver. Depending on the number of CPUs
>> >> the system might be almost unusable. The OS jitter for 4.17.y and
>> >> 4.18.-rcx kernels is off the charts, you can even spot it with top
>> >> command (issued when the system is supposedly idle), e.g.
>> >>
>> >>  top - 14:44:24 up 2 min,  1 user,  load average: 90.11, 38.20, 14.38
>> >>  Tasks: 1199 total, 109 running, 541 sleeping,   0 stopped,   0 zombie
>> >>  %Cpu(s):  1.2 us, 58.7 sy,  0.0 ni, 39.3 id,  0.6 wa,  0.0 hi,  0.3 si,  0.0 st
>> >>  KiB Mem:  13137064+total,  1192168 used, 13017848+free,     2340 buffers
>> >>  KiB Swap:  2104316 total,        0 used,  2104316 free.   522296 cached Mem
>> >>
>> >>    PID USER      PR  NI    VIRT    RES    SHR S    %CPU  %MEM     TIME+ COMMAND
>> >>   3373 root      20   0  982024  49916  36120 R  96.691 0.038   0:19.54 kubelet
>> >>     67 root      20   0       0      0      0 R  78.676 0.000   0:49.36 kworker/9:0
>> >>     25 root      20   0       0      0      0 R  78.125 0.000   0:49.67 kworker/2:0
>> >>    182 root      20   0       0      0      0 R  75.735 0.000   1:18.17 kworker/28:0
>> >>     43 root      20   0       0      0      0 R  75.000 0.000   0:11.56 kworker/5:0
>> >>    103 root      20   0       0      0      0 R  74.449 0.000   0:46.83 kworker/15:0
>> >>    334 root      20   0       0      0      0 R  72.978 0.000   1:06.88 kworker/53:0
>> >>    789 root      20   0       0      0      0 R  69.853 0.000   1:29.50 kworker/38:1
>> >>    418 root      20   0       0      0      0 R  69.301 0.000   0:41.33 kworker/67:0
>> >>    779 root      20   0       0      0      0 R  68.934 0.000   1:33.60 kworker/27:1
>> >>    773 root      20   0       0      0      0 R  68.566 0.000   1:37.91 kworker/22:1
>> >>    762 root      20   0       0      0      0 R  68.015 0.000   1:41.01 kworker/11:1
>> >>    769 root      20   0       0      0      0 R  67.647 0.000   1:37.65 kworker/18:1
>> >>    805 root      20   0       0      0      0 R  67.096 0.000   1:30.96 kworker/54:1
>> >>    840 root      20   0       0      0      0 R  66.912 0.000   1:23.82 kworker/89:1
>> >>    812 root      20   0       0      0      0 R  66.728 0.000   1:31.89 kworker/59:1
>> >>    847 root      20   0       0      0      0 R  66.360 0.000   1:28.40 kworker/96:1
>> >>    763 root      20   0       0      0      0 R  66.176 0.000   1:42.57 kworker/12:1
>> >>    772 root      20   0       0      0      0 R  66.176 0.000   1:12.58 kworker/21:1
>> >>    821 root      20   0       0      0      0 R  66.176 0.000   1:29.62 kworker/69:1
>> >>    923 root      20   0       0      0      0 R  65.809 0.000   1:44.32 kworker/3:18
>> >>   1284 root      20   0       0      0      0 R  65.809 0.000   1:23.50 kworker/101:2
>> >>     61 root      20   0       0      0      0 R  65.625 0.000   1:29.37 kworker/8:0
>> >>   3531 root      20   0   24384   3768   2356 R  65.625 0.003   0:08.91 top
>> >>    771 root      20   0       0      0      0 R  65.074 0.000   1:37.90 kworker/20:1
>> >>    767 root      20   0       0      0      0 R  64.706 0.000   1:38.01 kworker/16:1
>> >>    764 root      20   0       0      0      0 R  64.522 0.000   1:40.28 kworker/13:1
>> >>    765 root      20   0       0      0      0 R  64.154 0.000   1:40.13 kworker/14:1
>> >>
>> >> When I apply below patch (trying to revert essential parts of commit
>> >> 554c8aa8ecad) behaviour seems back to normal.
>> >
>> > Well, that basically defeats the purpose of the change in commit
>> > 554c8aa8ecad, so it's not what I'd like to do to fix this problem.
>> >
>> > Also it would be good to understand what actually happens.
>> >
>> >> I know that pcc-cpufreq driver is not "state-of-the-art" when it comes
>> >> to cpufreq drivers and you better not use it.
>> >
>> > That's exactly right.
>> >
>> >> But I wonder whether commit 554c8aa8ecad ("sched: idle: Select idle state before
>> >> stopping the tick") introduced bad behaviour for other cases as well.
>> >
>> > It has been tested quite extensively in that respect, although
>> > admittedly not with the pcc-cpufreq driver.
>> >
>> > Nothing bad related to it has been has been reported so far, FWIW.
>> >
>> >> I'll send some performance results to illustrate the issue asap. I've
>> >> also tried to modify pcc-cpufreq to reduce the amount of frequency
>> >> changes triggered by this driver but this does not help for kernels
>> >> where commit 554c8aa8ecad is applied.
>> >
>> > Can you replace pcc-cpufreq with a different cpufreq driver on the
>> > affected systems?  If so, do performance numbers look bad after that
>> > too?
>>
>> Also, what cpufreq governor do you use with pcc-cpufreq?
>
> Ondemand governor. Which triggers a lot of PCC related platform calls.
> And as Peter noticed already the driver has a severe bottleneck (lock
> protecting shared memory used for all CPUs to pass data to/from
> platform for PCC calls).
>
>> Does changing it to something like "performance" improve things?
>
> With performance governor above mentioned bottleneck is no issue.

OK

> On balance before this commit users could use pcc-cpufreq but had
> already suboptimal performance (compared to say intel_pstate driver
> which can be used changing BIOS options). Starting with this commit
> systems using pcc-cpufreq are unusable with high number of CPUs (top
> output above is for system with 120 CPUs).

I see. :-)

> So should the driver be removed (sooner or later), or this behaviour
> be documented somewhere, or just leave it as is.

At least it should be documented in the driver that it is not scalable
and not for use on many-CPU systems (with "many" meaning anything
greater than 4 probably).

Or we could just make the driver not load if the number of CPUs in the
system is greater than 4 or similar.
Andreas Herrmann July 17, 2018, 9:05 a.m. UTC | #9
On Tue, Jul 17, 2018 at 10:15:07AM +0200, Peter Zijlstra wrote:
> On Tue, Jul 17, 2018 at 08:50:48AM +0200, Andreas Herrmann wrote:
> > I've recently noticed that commit 554c8aa8ecad ("sched: idle: Select
> > idle state before stopping the tick") causes severe performance drop
> > for systems using pcc-cpufreq driver. Depending on the number of CPUs
> > the system might be almost unusable. 
> 
> Have you looked at that driver? It features a global lock which is taken
> for all cpufreq operations. Once you get past a handful of cpus that
> will constrict your system exactly as you say.

Yes, I figured that already and you better use another cpufreq driver
on those systems. (But that such systems with this driver are almost
unusable is a new feature ;-)

> What kind of machine are you running this on, and can you configure the
> thing to use anything else?

Its an older system.

HP ProLiant DL580 Gen8, CPUs are Intel(R) Xeon(R) CPU E7-4890
4 nodes, With HT enabled 120 logical CPUs.

Yes you can configure it using something else.

For the record. On this and similar systems pcc-cpufreq driver is used
when BIOS setting for power management option "Power Regulator" is set
to

  "Dynamic Power Savings Mode"

AFAIK this is the default setting for this option. When changing this
setting to "OS Control Mode" intel_pstate driver should be loaded.

> If so, do so and never look back.

Ok, I am fine with this.


Thanks,

Andreas
Rafael J. Wysocki July 17, 2018, 9:06 a.m. UTC | #10
On Tue, Jul 17, 2018 at 10:50 AM, Andreas Herrmann <aherrmann@suse.com> wrote:

[cut]

>
> On balance before this commit users could use pcc-cpufreq but had
> already suboptimal performance (compared to say intel_pstate driver
> which can be used changing BIOS options).

BTW, I wonder why you need to change the BIOS options for intel_pstate to load.

It should be initialized before pcc-cpufreq (according to their
respective initcall levels), so in theory intel_pstate should be used
by default on the affected systems anyway.

What BIOS settings need to be changed for that?
Andreas Herrmann July 17, 2018, 9:11 a.m. UTC | #11
On Tue, Jul 17, 2018 at 11:06:29AM +0200, Rafael J. Wysocki wrote:
> On Tue, Jul 17, 2018 at 10:50 AM, Andreas Herrmann <aherrmann@suse.com> wrote:
> 
> [cut]
> 
> >
> > On balance before this commit users could use pcc-cpufreq but had
> > already suboptimal performance (compared to say intel_pstate driver
> > which can be used changing BIOS options).
> 
> BTW, I wonder why you need to change the BIOS options for intel_pstate to load.

I think this is because of (in intel_pstate_init()):

        /*
         * The Intel pstate driver will be ignored if the platform
         * firmware has its own power management modes.
         */
        if (intel_pstate_platform_pwr_mgmt_exists())
                return -ENODEV;


> It should be initialized before pcc-cpufreq (according to their
> respective initcall levels), so in theory intel_pstate should be used
> by default on the affected systems anyway.

> What BIOS settings need to be changed for that?

Already answered in other mail.


Andreas
Rafael J. Wysocki July 17, 2018, 9:23 a.m. UTC | #12
On Tue, Jul 17, 2018 at 11:11 AM, Andreas Herrmann <aherrmann@suse.com> wrote:
> On Tue, Jul 17, 2018 at 11:06:29AM +0200, Rafael J. Wysocki wrote:
>> On Tue, Jul 17, 2018 at 10:50 AM, Andreas Herrmann <aherrmann@suse.com> wrote:
>>
>> [cut]
>>
>> >
>> > On balance before this commit users could use pcc-cpufreq but had
>> > already suboptimal performance (compared to say intel_pstate driver
>> > which can be used changing BIOS options).
>>
>> BTW, I wonder why you need to change the BIOS options for intel_pstate to load.
>
> I think this is because of (in intel_pstate_init()):
>
>         /*
>          * The Intel pstate driver will be ignored if the platform
>          * firmware has its own power management modes.
>          */
>         if (intel_pstate_platform_pwr_mgmt_exists())
>                 return -ENODEV;
>

OK, because of the "Proliant" entry, right?

So it looks like we have an issue there.  We find the entry and we
look for _PSS.  It is not there, so we assume that the firmware is
expected to control performance, which is not the case.  It looks like
we also should look for the presence of the PCC interface in there.

I can provide a patch for that, will you be able to test it?

>> It should be initialized before pcc-cpufreq (according to their
>> respective initcall levels), so in theory intel_pstate should be used
>> by default on the affected systems anyway.
>
>> What BIOS settings need to be changed for that?
>
> Already answered in other mail.

Indeed.
Andreas Herrmann July 17, 2018, 9:27 a.m. UTC | #13
On Tue, Jul 17, 2018 at 11:23:25AM +0200, Rafael J. Wysocki wrote:
> On Tue, Jul 17, 2018 at 11:11 AM, Andreas Herrmann <aherrmann@suse.com> wrote:
> > On Tue, Jul 17, 2018 at 11:06:29AM +0200, Rafael J. Wysocki wrote:
> >> On Tue, Jul 17, 2018 at 10:50 AM, Andreas Herrmann <aherrmann@suse.com> wrote:
> >>
> >> [cut]
> >>
> >> >
> >> > On balance before this commit users could use pcc-cpufreq but had
> >> > already suboptimal performance (compared to say intel_pstate driver
> >> > which can be used changing BIOS options).
> >>
> >> BTW, I wonder why you need to change the BIOS options for intel_pstate to load.
> >
> > I think this is because of (in intel_pstate_init()):
> >
> >         /*
> >          * The Intel pstate driver will be ignored if the platform
> >          * firmware has its own power management modes.
> >          */
> >         if (intel_pstate_platform_pwr_mgmt_exists())
> >                 return -ENODEV;
> >
> 
> OK, because of the "Proliant" entry, right?
> 
> So it looks like we have an issue there.  We find the entry and we
> look for _PSS.  It is not there, so we assume that the firmware is
> expected to control performance, which is not the case.  It looks like
> we also should look for the presence of the PCC interface in there.
> 
> I can provide a patch for that, will you be able to test it?

Yes, I can test it.

> >> It should be initialized before pcc-cpufreq (according to their
> >> respective initcall levels), so in theory intel_pstate should be used
> >> by default on the affected systems anyway.
> >
> >> What BIOS settings need to be changed for that?
> >
> > Already answered in other mail.
> 
> Indeed.


Andreas
Andreas Herrmann July 17, 2018, 9:36 a.m. UTC | #14
On Tue, Jul 17, 2018 at 11:27:21AM +0200, Andreas Herrmann wrote:
> On Tue, Jul 17, 2018 at 11:23:25AM +0200, Rafael J. Wysocki wrote:
> > On Tue, Jul 17, 2018 at 11:11 AM, Andreas Herrmann <aherrmann@suse.com> wrote:
> > > On Tue, Jul 17, 2018 at 11:06:29AM +0200, Rafael J. Wysocki wrote:
> > >> On Tue, Jul 17, 2018 at 10:50 AM, Andreas Herrmann <aherrmann@suse.com> wrote:
> > >>
> > >> [cut]
> > >>
> > >> >
> > >> > On balance before this commit users could use pcc-cpufreq but had
> > >> > already suboptimal performance (compared to say intel_pstate driver
> > >> > which can be used changing BIOS options).
> > >>
> > >> BTW, I wonder why you need to change the BIOS options for intel_pstate to load.
> > >
> > > I think this is because of (in intel_pstate_init()):
> > >
> > >         /*
> > >          * The Intel pstate driver will be ignored if the platform
> > >          * firmware has its own power management modes.
> > >          */
> > >         if (intel_pstate_platform_pwr_mgmt_exists())
> > >                 return -ENODEV;
> > >
> > 
> > OK, because of the "Proliant" entry, right?
> > 
> > So it looks like we have an issue there.  We find the entry and we
> > look for _PSS.  It is not there, so we assume that the firmware is
> > expected to control performance, which is not the case.

FYI, there is another BIOS setting on those systems. It's called
"Collaborative Power Control" (AFAIK enabled by default).

Only if this is disabled, firmware is (alone) in control of
performance. (And of course in this case neither pcc-cpufreq nor
intel_pstate will be loaded).

> > It looks like we also should look for the presence of the PCC
> > interface in there.
> >
> > I can provide a patch for that, will you be able to test it?
> 
> Yes, I can test it.
> 
> > >> It should be initialized before pcc-cpufreq (according to their
> > >> respective initcall levels), so in theory intel_pstate should be used
> > >> by default on the affected systems anyway.
> > >
> > >> What BIOS settings need to be changed for that?
> > >
> > > Already answered in other mail.
> > 
> > Indeed.
> 
> 
> Andreas
>
Andreas Herrmann July 17, 2018, 10:18 a.m. UTC | #15
On Tue, Jul 17, 2018 at 11:36:20AM +0200, Andreas Herrmann wrote:
> On Tue, Jul 17, 2018 at 11:27:21AM +0200, Andreas Herrmann wrote:
> > On Tue, Jul 17, 2018 at 11:23:25AM +0200, Rafael J. Wysocki wrote:
> > > On Tue, Jul 17, 2018 at 11:11 AM, Andreas Herrmann <aherrmann@suse.com> wrote:
> > > > On Tue, Jul 17, 2018 at 11:06:29AM +0200, Rafael J. Wysocki wrote:
> > > >> On Tue, Jul 17, 2018 at 10:50 AM, Andreas Herrmann <aherrmann@suse.com> wrote:
> > > >>
> > > >> [cut]
> > > >>
> > > >> >
> > > >> > On balance before this commit users could use pcc-cpufreq but had
> > > >> > already suboptimal performance (compared to say intel_pstate driver
> > > >> > which can be used changing BIOS options).
> > > >>
> > > >> BTW, I wonder why you need to change the BIOS options for intel_pstate to load.
> > > >
> > > > I think this is because of (in intel_pstate_init()):
> > > >
> > > >         /*
> > > >          * The Intel pstate driver will be ignored if the platform
> > > >          * firmware has its own power management modes.
> > > >          */
> > > >         if (intel_pstate_platform_pwr_mgmt_exists())
> > > >                 return -ENODEV;
> > > >
> > > 
> > > OK, because of the "Proliant" entry, right?
> > > 
> > > So it looks like we have an issue there.  We find the entry and we
> > > look for _PSS.  It is not there, so we assume that the firmware is
> > > expected to control performance, which is not the case.

> FYI, there is another BIOS setting on those systems. It's called
> "Collaborative Power Control" (AFAIK enabled by default).
> 
> Only if this is disabled, firmware is (alone) in control of
> performance. (And of course in this case neither pcc-cpufreq nor
> intel_pstate will be loaded).

To clarify:

(i)  When "Dynamic Power Savings Mode" is enabled _PSS objects are
     missing (in fact they seem to be renamed to "XPSS").
     pcc-cpufreq driver evaluates PCCH header and loads.

     Same when "Collaborative Power Control" is disabled. No _PSS but
     XPSS objects. pcc-cpufreq driver fails to evaluate PCCH header,
     no cpufreq driver is loaded.

(ii) There are _PSS objects when "OS Control Mode" is selected.

     (But I think when "Collaborative Power Control" is disabled there
     are no _PSS objects either and the "OS Control Mode" selection
     does not matter.)

> > > It looks like we also should look for the presence of the PCC
> > > interface in there.
> > >
> > > I can provide a patch for that, will you be able to test it?
> > 
> > Yes, I can test it.
> > 
> > > >> It should be initialized before pcc-cpufreq (according to their
> > > >> respective initcall levels), so in theory intel_pstate should be used
> > > >> by default on the affected systems anyway.
> > > >
> > > >> What BIOS settings need to be changed for that?
> > > >
> > > > Already answered in other mail.
> > > 
> > > Indeed.
> > 
> > 
> > Andreas
> >
Andreas Herrmann July 18, 2018, 3:25 p.m. UTC | #16
I think I still owe some performance numbers to show what is wrong
with systems using pcc-cpufreq with Linux after commit 554c8aa8ecad.

Following are results for kernbench tests (from MMTests test suite).
That's just a kernel compile with different number of compile jobs.
As result the time is measured, 5 runs are done for each configuration and
average values calculated.

I've restricted maximum number of jobs to 30. That means that tests
were done for 2, 4, 8, 16, and 30 compile jobs. I had bound all tests
to node 0. (I've used something like "numactl -N 0 ./run-mmtests.sh
--run-monitor <test_name>" to start those tests.)

Tests were done with kernel 4.18.0-rc3 on an HP DL580 Gen8 with Intel
Xeon CPU E7-4890 with latest BIOS installed. System had 4 nodes, 15
CPUs per node (30 logical CPUs with HT enabled). pcc-cpufreq was
active and ondemand governor in use.

I've tested with different number of online CPUs which better
illustrates how idle online CPUs interfere with compile load on node 0
(due to the jitter caused by pcc-cpufreq and its locking).

Average mean for user/system/elapsed time and standard deviation for each
subtest (=number of compile jobs) are as follows:

(Nodes)                      N0                N01                   N0                  N01                 N0123
 (CPUs)                  15CPUs             30CPUs               30CPUs               60CPUs               120CPUs
Amean   user-2   640.82 (0.00%)  675.90   (-5.47%)   789.03   (-23.13%)  1448.58  (-126.05%)  3575.79   (-458.01%)
Amean   user-4   652.18 (0.00%)  689.12   (-5.67%)   868.19   (-33.12%)  1846.66  (-183.15%)  5437.37   (-733.73%)
Amean   user-8   695.00 (0.00%)  732.22   (-5.35%)  1138.30   (-63.78%)  2598.74  (-273.92%)  7413.43   (-966.67%)
Amean   user-16  653.94 (0.00%)  772.48  (-18.13%)  1734.80  (-165.29%)  2699.65  (-312.83%)  9224.47  (-1310.61%)
Amean   user-30  634.91 (0.00%)  701.11  (-10.43%)  1197.37   (-88.59%)  1360.02  (-114.21%)  3732.34   (-487.85%)
Amean   syst-2   235.45 (0.00%)  235.68   (-0.10%)   321.99   (-36.76%)   574.44  (-143.98%)   869.35   (-269.23%)
Amean   syst-4   239.34 (0.00%)  243.09   (-1.57%)   345.07   (-44.18%)   621.00  (-159.47%)  1145.13   (-378.46%)
Amean   syst-8   246.51 (0.00%)  254.83   (-3.37%)   387.49   (-57.19%)   786.63  (-219.10%)  1406.17   (-470.42%)
Amean   syst-16  110.85 (0.00%)  122.21  (-10.25%)   408.25  (-268.31%)   644.41  (-481.36%)  1513.04  (-1264.99%)
Amean   syst-30   82.74 (0.00%)   94.07  (-13.69%)   155.38   (-87.80%)   207.03  (-150.22%)   547.73   (-562.01%)
Amean   elsp-2   625.33 (0.00%)  724.51  (-15.86%)   792.47   (-26.73%)  1537.44  (-145.86%)  3510.22   (-461.34%)
Amean   elsp-4   482.02 (0.00%)  568.26  (-17.89%)   670.26   (-39.05%)  1257.34  (-160.85%)  3120.89   (-547.46%)
Amean   elsp-8   267.75 (0.00%)  337.88  (-26.19%)   430.56   (-60.80%)   978.47  (-265.44%)  2321.91   (-767.18%)
Amean   elsp-16   63.55 (0.00%)   71.79  (-12.97%)   224.83  (-253.79%)   403.94  (-535.65%)  1121.04  (-1664.09%)
Amean   elsp-30   56.76 (0.00%)   62.82  (-10.69%)    66.50   (-17.16%)   124.20  (-118.84%)   303.47   (-434.70%)
Stddev  user-2     1.36 (0.00%)    1.94  (-42.57%)    16.17 (-1090.46%)   119.09 (-8669.75%)   382.74 (-28085.60%)
Stddev  user-4     2.81 (0.00%)    5.08  (-80.78%)     4.88   (-73.66%)   252.56 (-8881.80%)  1133.02 (-40193.16%)
Stddev  user-8     2.30 (0.00%)   15.58 (-578.28%)    30.60 (-1232.63%)   279.35 (-12064.01%) 1050.00 (-45621.61%)
Stddev  user-16    6.76 (0.00%)   25.52 (-277.80%)    78.44 (-1060.97%)   118.29 (-1650.94%)   724.11 (-10617.95%)
Stddev  user-30    0.51 (0.00%)    1.80 (-249.13%)    12.63 (-2354.11%)    25.82 (-4915.43%)  1098.82 (-213365.28%)
Stddev  syst-2     1.52 (0.00%)    2.76  (-81.04%)     3.98  (-161.58%)    36.35 (-2287.16%)    59.09  (-3781.09%)
Stddev  syst-4     2.39 (0.00%)    1.55   (35.25%)     3.24  ( -35.92%)    51.51 (-2057.65%)   175.75  (-7262.43%)
Stddev  syst-8     1.08 (0.00%)    3.70 (-241.40%)     6.83  (-531.33%)    65.80 (-5977.97%)   151.17 (-13864.10%)
Stddev  syst-16    3.78 (0.00%)    5.58  (-47.53%)     4.63  ( -22.44%)    47.90 (-1167.18%)    99.94  (-2543.88%)
Stddev  syst-30    0.31 (0.00%)    0.38  (-22.41%)     3.01  (-862.79%)    27.45 (-8688.85%)   137.94 (-44072.77%)
Stddev  elsp-2    55.14 (0.00%)   55.04    (0.18%)    95.33  ( -72.90%)   103.91   (-88.45%)   302.31   (-448.29%)
Stddev  elsp-4    60.90 (0.00%)   84.42  (-38.62%)    18.92  (  68.94%)   197.60  (-224.46%)   323.53   (-431.24%)
Stddev  elsp-8    16.77 (0.00%)   30.77  (-83.47%)    49.57  (-195.57%)    79.02  (-371.16%)   261.85  (-1461.28%)
Stddev  elsp-16    1.99 (0.00%)    2.88  (-44.60%)    28.11 (-1311.79%)   101.81 (-5012.88%)    62.29  (-3028.36%)
Stddev  elsp-30    0.65 (0.00%)    1.04  (-59.06%)     1.64  (-151.81%)    41.84 (-6308.81%)    75.37 (-11445.61%)

Overall test time for each mmtests invocation was as follows (this is
also given for number-of-cpu configs for which I did not provide
details above).

               N0      N01       N0     N012    N0123      N01    N0123    N0123     N012    N0123     N0123
           15CPUs   30CPUs   30CPUs   45CPUs   60CPUs   60CPUs   75CPUs   90CPUs   90CPUs  105CPUs   120CPUs
User     17196.67 18714.36 30105.65 19239.27 19505.35 53089.39 22690.33 26731.06 38131.74 47627.61 153424.99
System    4807.98  4970.89  8533.95  5136.97  5184.24 16351.67  6135.29  7152.66 10920.76 12362.39  32129.74
Elapsed   7796.46  9166.55 11518.51  9274.77  9030.39 25465.38  9361.60 10677.63 15633.49 18900.46  60908.28

The results given for 120 online CPUs on nodes 0-3 indicate what I
meant with the "system being almost unusable". When trying to gather
results with kernel 4.17.5 and 120 CPUs, one iteration of kernbench (1
kernel compile) with 2 jobs even took about 6 hours. Maybe it was an
extreme outlier but I dismissed to further use that kernel (w/o
modifications) for further tests.


Andreas
Andreas Herrmann July 18, 2018, 3:31 p.m. UTC | #17
On Wed, Jul 18, 2018 at 05:25:56PM +0200, Andreas Herrmann wrote:

 ...

> Average mean for user/system/elapsed time and standard deviation for each

Oops, of course I meant arithmetic mean.

 ...
Andreas Herrmann July 19, 2018, 11:04 a.m. UTC | #18
For the sake of completeness following are given the remaining sets of
kernbench results related to this thread.

Setup for kernbench test is as described in previous mails but now all
120 logical CPUs were online in all tests. Test runs were still pinned
to node 0.

Common legend for below tables is:

   OSCM: "OS Control Mode"
   DPSM: "Dynamic Power Savings Mode"
idle_rb: partial rollback of 554c8aa8ecad ("sched: idle: Select idle
         state before stopping the tick") as described in initial mail
         of this thread

(A) intel_pstate (in powersave mode) performance wrt effect of commit
    554c8aa8ecad and wrt to potential interference from platform code

 Kernel v4.18-rc5-36-g30b06abfb92b + patch for intel_pstate to load it
 instead of pcc-cpufreq when system is in DPSM.

 Detailed results for each number of compile jobs:
 (OSCM is baseline, values in parenthesis show comparison to baseline)

                    OSCM                OSCM                DPSM                DPSM
                                     idle_rb                                 idle_rb
 Amean  user-2    600.58   596.38 (   0.70%)   685.94 ( -14.21%)   688.78 ( -14.69%)
 Amean  user-4    583.90   586.34 (  -0.42%)   626.37 (  -7.27%)   622.17 (  -6.55%)
 Amean  user-8    584.78   581.52 (   0.56%)   600.89 (  -2.75%)   595.53 (  -1.84%)
 Amean  user-16   705.07   688.62 (   2.33%)   705.16 (  -0.01%)   682.44 (   3.21%)
 Amean  user-30  1017.25  1022.39 (  -0.51%)  1025.23 (  -0.78%)  1022.61 (  -0.53%)
 Amean  syst-2    172.17   174.08 (  -1.11%)   184.73 (  -7.30%)   186.13 (  -8.11%)
 Amean  syst-4    183.88   180.44 (   1.87%)   191.70 (  -4.25%)   192.24 (  -4.54%)
 Amean  syst-8    193.40   193.81 (  -0.21%)   198.01 (  -2.38%)   193.96 (  -0.29%)
 Amean  syst-16   183.97   180.40 (   1.94%)   184.00 (  -0.01%)   182.10 (   1.02%)
 Amean  syst-30   122.36   122.08 (   0.23%)   122.53 (  -0.14%)   122.17 (   0.15%)
 Amean  elsp-2    610.90   634.64 (  -3.89%)   667.67 (  -9.29%)   661.81 (  -8.33%)
 Amean  elsp-4    413.54   488.02 ( -18.01%)   433.79 (  -4.90%)   407.30 (   1.51%)
 Amean  elsp-8    261.85   218.25 (  16.65%)   246.62 (   5.82%)   219.55 (  16.15%)
 Amean  elsp-16    89.27    99.36 ( -11.30%)    92.74 (  -3.89%)   102.74 ( -15.09%)
 Amean  elsp-30    47.07    47.04 (   0.08%)    48.82 (  -3.72%)    48.28 (  -2.57%)
 Stddev user-2      6.06     7.53 ( -24.21%)    31.88 (-425.98%)    25.79 (-325.57%)
 Stddev user-4      7.05    14.48 (-105.40%)    11.82 ( -67.63%)    12.14 ( -72.22%)
 Stddev user-8      5.69     1.18 (  79.28%)    18.75 (-229.45%)     7.03 ( -23.51%)
 Stddev user-16     6.41    15.74 (-145.55%)    12.87 (-100.75%)    10.59 ( -65.19%)
 Stddev user-30     2.62     2.80 (  -6.56%)     2.92 ( -11.31%)     2.45 (   6.52%)
 Stddev syst-2      3.48     2.81 (  19.28%)     2.27 (  34.73%)     1.47 (  57.83%)
 Stddev syst-4      4.04     4.69 ( -16.03%)     2.16 (  46.42%)     0.84 (  79.32%)
 Stddev syst-8      3.96     1.42 (  64.11%)     2.34 (  40.98%)     1.93 (  51.24%)
 Stddev syst-16     2.01     2.33 ( -15.76%)     1.33 (  33.89%)     1.94 (   3.74%)
 Stddev syst-30     0.76     0.38 (  50.10%)     0.91 ( -19.48%)     0.17 (  77.86%)
 Stddev elsp-2     44.55    58.37 ( -31.01%)   110.11 (-147.15%)    82.81 ( -85.88%)
 Stddev elsp-4     62.39   109.75 ( -75.90%)    48.32 (  22.56%)    47.10 (  24.52%)
 Stddev elsp-8     59.01    25.95 (  56.02%)    71.44 ( -21.07%)    37.83 (  35.89%)
 Stddev elsp-16    10.47    23.88 (-128.08%)    11.98 ( -14.41%)    15.42 ( -47.32%)
 Stddev elsp-30     0.26     0.64 (-142.06%)     0.39 ( -46.53%)     0.44 ( -66.71%)

 Overall test time:

               OSCM     OSCM     DPSM     DPSM
                     idle_rb           idle_rb
 User      18681.59 18599.99 19450.38 19289.33
 System     4487.76  4458.55  4620.80  4595.13
 Elapsed    7407.07  7725.86  7765.91  7502.72

 Overall test run-time is comparable. Commit 554c8aa8ecad does not
 seem to have a significant impact on performance (I don't have
 numbers for power consumption). Comparing OSCM vs. DPSM: it seems
 that its better to switch system into OSCM.


(B) performance of intel_pstate (in powersave mode and system in DPSM)
    vs. pcc-cpufreq (with ondemand governor)

 Results for pcc-cpufreq were obtained with v4.17.5+misc modifications.

 intel_pstate results were obtained with v4.18-rc5-36-g30b06abfb92b +
 patch for intel_pstate to load it instead of pcc-cpufreq when system
 is in DPSM.

 So strictly speaking this is no correct comparison but at least it
 gives an idea where the limits are with pcc-cpufreq and why its
 better to just switch to intel_pstate.
 
 pcc-cpufreq driver modifications were

 freqtable: pcc-cpufreq modified to use fixed table of 4 frequencies
  deadband: pcc-cpufreq modified to re-introduce so called deadband
            effect which keeps CPU at minimum frequency if target
            frequency would be in the calculated deadband

           intel_pstate        pcc-cpufreq        pcc-cpufreq        pcc-cpufreq
                   DPSM            idle_rb  idle_rb+freqtable   idle_rb+deadband
 Amean  user-2   685.94  834.15 ( -21.61%)  648.68 (   5.43%)  636.63 (   7.19%)
 Amean  user-4   626.37  902.09 ( -44.02%)  657.43 (  -4.96%)  615.49 (   1.74%)
 Amean  user-8   600.89 1078.37 ( -79.46%)  723.05 ( -20.33%)  646.23 (  -7.55%)
 Amean  user-16  705.16 1640.89 (-132.70%) 1096.61 ( -55.51%)  904.17 ( -28.22%)
 Amean  user-30 1025.23 1463.90 ( -42.79%) 1156.17 ( -12.77%) 1151.40 ( -12.31%)
 Amean  syst-2   184.73  232.17 ( -25.68%)  178.24 (   3.51%)  172.09 (   6.84%)
 Amean  syst-4   191.70  257.22 ( -34.18%)  194.16 (  -1.29%)  188.10 (   1.88%)
 Amean  syst-8   198.01  313.67 ( -58.41%)  228.34 ( -15.31%)  206.99 (  -4.53%)
 Amean  syst-16  184.00  393.92 (-114.09%)  279.89 ( -52.12%)  241.83 ( -31.43%)
 Amean  syst-30  122.53  185.98 ( -51.79%)  143.28 ( -16.94%)  140.45 ( -14.62%)
 Amean  elsp-2   667.67  769.28 ( -15.22%)  635.68 (   4.79%)  651.51 (   2.42%)
 Amean  elsp-4   433.79  614.27 ( -41.60%)  440.45 (  -1.53%)  392.80 (   9.45%)
 Amean  elsp-8   246.62  397.54 ( -61.19%)  252.27 (  -2.29%)  239.21 (   3.01%)
 Amean  elsp-16   92.74  207.43 (-123.68%)  138.00 ( -48.81%)  119.98 ( -29.37%)
 Amean  elsp-30   48.82   72.66 ( -48.83%)   55.95 ( -14.60%)   54.32 ( -11.27%)
 Stddev user-2    31.88   15.22 (  52.26%)    7.77 (  75.63%)    6.63 (  79.21%)
 Stddev user-4    11.82   32.20 (-172.49%)    3.37 (  71.44%)    6.44 (  45.49%)
 Stddev user-8    18.75   33.99 ( -81.29%)    6.96 (  62.86%)    5.82 (  68.97%)
 Stddev user-16   12.87   70.72 (-449.46%)   31.19 (-142.30%)   28.88 (-124.40%)
 Stddev user-30    2.92   26.08 (-792.64%)    6.16 (-110.99%)   10.90 (-273.16%)
 Stddev syst-2     2.27    4.44 ( -95.54%)    4.15 ( -82.48%)    2.09 (   8.11%)
 Stddev syst-4     2.16    8.46 (-290.74%)    3.71 ( -71.58%)    2.45 ( -12.99%)
 Stddev syst-8     2.34   10.73 (-359.70%)    3.98 ( -70.62%)    4.39 ( -87.80%)
 Stddev syst-16    1.33   11.44 (-759.46%)    2.14 ( -60.49%)    2.93 (-120.24%)
 Stddev syst-30    0.91    4.88 (-436.79%)    1.37 ( -50.11%)    2.36 (-159.71%)
 Stddev elsp-2   110.11   85.53 (  22.32%)   87.11 (  20.89%)   37.33 (  66.10%)
 Stddev elsp-4    48.32  130.17 (-169.39%)   59.81 ( -23.79%)   26.15 (  45.88%)
 Stddev elsp-8    71.44   86.47 ( -21.03%)   12.87 (  81.98%)   43.88 (  38.58%)
 Stddev elsp-16   11.98   13.63 ( -13.82%)    8.94 (  25.35%)    5.97 (  50.15%)
 Stddev elsp-30    0.39    2.64 (-582.23%)    0.62 ( -58.97%)    0.95 (-144.47%)

         intel_pstate pcc-cpufreq pcc-cpufreq pcc-cpufreq
                 DPSM     idle_rb    idle_rb+    idle_rb+
                                    freqtable    deadband
 User        19450.38    31273.96    22689.14    21050.35
 System       4620.80     7327.67     5364.63     4984.36
 Elapsed      7765.91    10997.49     7935.53     7593.74

 Again I have no numbers for power consumption.

 Note that I've stopped an attempt to collect results for pcc-cpufreq
 with unmodififed v4.17.5 (ie. w/o idle_rb) after the first iteration
 (compiling kernel with 2 jobs) took several hours.


Andreas
diff mbox

Patch

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 1a3e9bddd17b..377a62ec475c 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -185,18 +185,13 @@  static void cpuidle_idle_call(void)
 	} else {
 		bool stop_tick = true;
 
+		tick_nohz_idle_stop_tick();
+		rcu_idle_enter();
 		/*
 		 * Ask the cpuidle framework to choose a convenient idle state.
 		 */
 		next_state = cpuidle_select(drv, dev, &stop_tick);
 
-		if (stop_tick)
-			tick_nohz_idle_stop_tick();
-		else
-			tick_nohz_idle_retain_tick();
-
-		rcu_idle_enter();
-
 		entered_state = call_cpuidle(drv, dev, next_state);
 		/*
 		 * Give the governor an opportunity to reflect on the outcome