Message ID | 20180717065048.74mmgk4t5utjaa6a@suselix (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers | show |
Hi, Thanks for your report! On Tue, Jul 17, 2018 at 8:50 AM, Andreas Herrmann <aherrmann@suse.com> wrote: > Hello, > > I've recently noticed that commit 554c8aa8ecad ("sched: idle: Select > idle state before stopping the tick") causes severe performance drop > for systems using pcc-cpufreq driver. Depending on the number of CPUs > the system might be almost unusable. The OS jitter for 4.17.y and > 4.18.-rcx kernels is off the charts, you can even spot it with top > command (issued when the system is supposedly idle), e.g. > > top - 14:44:24 up 2 min, 1 user, load average: 90.11, 38.20, 14.38 > Tasks: 1199 total, 109 running, 541 sleeping, 0 stopped, 0 zombie > %Cpu(s): 1.2 us, 58.7 sy, 0.0 ni, 39.3 id, 0.6 wa, 0.0 hi, 0.3 si, 0.0 st > KiB Mem: 13137064+total, 1192168 used, 13017848+free, 2340 buffers > KiB Swap: 2104316 total, 0 used, 2104316 free. 522296 cached Mem > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 3373 root 20 0 982024 49916 36120 R 96.691 0.038 0:19.54 kubelet > 67 root 20 0 0 0 0 R 78.676 0.000 0:49.36 kworker/9:0 > 25 root 20 0 0 0 0 R 78.125 0.000 0:49.67 kworker/2:0 > 182 root 20 0 0 0 0 R 75.735 0.000 1:18.17 kworker/28:0 > 43 root 20 0 0 0 0 R 75.000 0.000 0:11.56 kworker/5:0 > 103 root 20 0 0 0 0 R 74.449 0.000 0:46.83 kworker/15:0 > 334 root 20 0 0 0 0 R 72.978 0.000 1:06.88 kworker/53:0 > 789 root 20 0 0 0 0 R 69.853 0.000 1:29.50 kworker/38:1 > 418 root 20 0 0 0 0 R 69.301 0.000 0:41.33 kworker/67:0 > 779 root 20 0 0 0 0 R 68.934 0.000 1:33.60 kworker/27:1 > 773 root 20 0 0 0 0 R 68.566 0.000 1:37.91 kworker/22:1 > 762 root 20 0 0 0 0 R 68.015 0.000 1:41.01 kworker/11:1 > 769 root 20 0 0 0 0 R 67.647 0.000 1:37.65 kworker/18:1 > 805 root 20 0 0 0 0 R 67.096 0.000 1:30.96 kworker/54:1 > 840 root 20 0 0 0 0 R 66.912 0.000 1:23.82 kworker/89:1 > 812 root 20 0 0 0 0 R 66.728 0.000 1:31.89 kworker/59:1 > 847 root 20 0 0 0 0 R 66.360 0.000 1:28.40 kworker/96:1 > 763 root 20 0 0 0 0 R 66.176 0.000 1:42.57 kworker/12:1 > 772 root 20 0 0 0 0 R 66.176 0.000 1:12.58 kworker/21:1 > 821 root 20 0 0 0 0 R 66.176 0.000 1:29.62 kworker/69:1 > 923 root 20 0 0 0 0 R 65.809 0.000 1:44.32 kworker/3:18 > 1284 root 20 0 0 0 0 R 65.809 0.000 1:23.50 kworker/101:2 > 61 root 20 0 0 0 0 R 65.625 0.000 1:29.37 kworker/8:0 > 3531 root 20 0 24384 3768 2356 R 65.625 0.003 0:08.91 top > 771 root 20 0 0 0 0 R 65.074 0.000 1:37.90 kworker/20:1 > 767 root 20 0 0 0 0 R 64.706 0.000 1:38.01 kworker/16:1 > 764 root 20 0 0 0 0 R 64.522 0.000 1:40.28 kworker/13:1 > 765 root 20 0 0 0 0 R 64.154 0.000 1:40.13 kworker/14:1 > > When I apply below patch (trying to revert essential parts of commit > 554c8aa8ecad) behaviour seems back to normal. Well, that basically defeats the purpose of the change in commit 554c8aa8ecad, so it's not what I'd like to do to fix this problem. Also it would be good to understand what actually happens. > I know that pcc-cpufreq driver is not "state-of-the-art" when it comes > to cpufreq drivers and you better not use it. That's exactly right. > But I wonder whether commit 554c8aa8ecad ("sched: idle: Select idle state before > stopping the tick") introduced bad behaviour for other cases as well. It has been tested quite extensively in that respect, although admittedly not with the pcc-cpufreq driver. Nothing bad related to it has been has been reported so far, FWIW. > I'll send some performance results to illustrate the issue asap. I've > also tried to modify pcc-cpufreq to reduce the amount of frequency > changes triggered by this driver but this does not help for kernels > where commit 554c8aa8ecad is applied. Can you replace pcc-cpufreq with a different cpufreq driver on the affected systems? If so, do performance numbers look bad after that too? Thanks, Rafael
On Tue, Jul 17, 2018 at 9:33 AM, Rafael J. Wysocki <rafael@kernel.org> wrote: > Hi, > > Thanks for your report! > > On Tue, Jul 17, 2018 at 8:50 AM, Andreas Herrmann <aherrmann@suse.com> wrote: >> Hello, >> >> I've recently noticed that commit 554c8aa8ecad ("sched: idle: Select >> idle state before stopping the tick") causes severe performance drop >> for systems using pcc-cpufreq driver. Depending on the number of CPUs >> the system might be almost unusable. The OS jitter for 4.17.y and >> 4.18.-rcx kernels is off the charts, you can even spot it with top >> command (issued when the system is supposedly idle), e.g. >> >> top - 14:44:24 up 2 min, 1 user, load average: 90.11, 38.20, 14.38 >> Tasks: 1199 total, 109 running, 541 sleeping, 0 stopped, 0 zombie >> %Cpu(s): 1.2 us, 58.7 sy, 0.0 ni, 39.3 id, 0.6 wa, 0.0 hi, 0.3 si, 0.0 st >> KiB Mem: 13137064+total, 1192168 used, 13017848+free, 2340 buffers >> KiB Swap: 2104316 total, 0 used, 2104316 free. 522296 cached Mem >> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> 3373 root 20 0 982024 49916 36120 R 96.691 0.038 0:19.54 kubelet >> 67 root 20 0 0 0 0 R 78.676 0.000 0:49.36 kworker/9:0 >> 25 root 20 0 0 0 0 R 78.125 0.000 0:49.67 kworker/2:0 >> 182 root 20 0 0 0 0 R 75.735 0.000 1:18.17 kworker/28:0 >> 43 root 20 0 0 0 0 R 75.000 0.000 0:11.56 kworker/5:0 >> 103 root 20 0 0 0 0 R 74.449 0.000 0:46.83 kworker/15:0 >> 334 root 20 0 0 0 0 R 72.978 0.000 1:06.88 kworker/53:0 >> 789 root 20 0 0 0 0 R 69.853 0.000 1:29.50 kworker/38:1 >> 418 root 20 0 0 0 0 R 69.301 0.000 0:41.33 kworker/67:0 >> 779 root 20 0 0 0 0 R 68.934 0.000 1:33.60 kworker/27:1 >> 773 root 20 0 0 0 0 R 68.566 0.000 1:37.91 kworker/22:1 >> 762 root 20 0 0 0 0 R 68.015 0.000 1:41.01 kworker/11:1 >> 769 root 20 0 0 0 0 R 67.647 0.000 1:37.65 kworker/18:1 >> 805 root 20 0 0 0 0 R 67.096 0.000 1:30.96 kworker/54:1 >> 840 root 20 0 0 0 0 R 66.912 0.000 1:23.82 kworker/89:1 >> 812 root 20 0 0 0 0 R 66.728 0.000 1:31.89 kworker/59:1 >> 847 root 20 0 0 0 0 R 66.360 0.000 1:28.40 kworker/96:1 >> 763 root 20 0 0 0 0 R 66.176 0.000 1:42.57 kworker/12:1 >> 772 root 20 0 0 0 0 R 66.176 0.000 1:12.58 kworker/21:1 >> 821 root 20 0 0 0 0 R 66.176 0.000 1:29.62 kworker/69:1 >> 923 root 20 0 0 0 0 R 65.809 0.000 1:44.32 kworker/3:18 >> 1284 root 20 0 0 0 0 R 65.809 0.000 1:23.50 kworker/101:2 >> 61 root 20 0 0 0 0 R 65.625 0.000 1:29.37 kworker/8:0 >> 3531 root 20 0 24384 3768 2356 R 65.625 0.003 0:08.91 top >> 771 root 20 0 0 0 0 R 65.074 0.000 1:37.90 kworker/20:1 >> 767 root 20 0 0 0 0 R 64.706 0.000 1:38.01 kworker/16:1 >> 764 root 20 0 0 0 0 R 64.522 0.000 1:40.28 kworker/13:1 >> 765 root 20 0 0 0 0 R 64.154 0.000 1:40.13 kworker/14:1 >> >> When I apply below patch (trying to revert essential parts of commit >> 554c8aa8ecad) behaviour seems back to normal. > > Well, that basically defeats the purpose of the change in commit > 554c8aa8ecad, so it's not what I'd like to do to fix this problem. > > Also it would be good to understand what actually happens. > >> I know that pcc-cpufreq driver is not "state-of-the-art" when it comes >> to cpufreq drivers and you better not use it. > > That's exactly right. > >> But I wonder whether commit 554c8aa8ecad ("sched: idle: Select idle state before >> stopping the tick") introduced bad behaviour for other cases as well. > > It has been tested quite extensively in that respect, although > admittedly not with the pcc-cpufreq driver. > > Nothing bad related to it has been has been reported so far, FWIW. > >> I'll send some performance results to illustrate the issue asap. I've >> also tried to modify pcc-cpufreq to reduce the amount of frequency >> changes triggered by this driver but this does not help for kernels >> where commit 554c8aa8ecad is applied. > > Can you replace pcc-cpufreq with a different cpufreq driver on the > affected systems? If so, do performance numbers look bad after that > too? Also, what cpufreq governor do you use with pcc-cpufreq? Does changing it to something like "performance" improve things?
On 17/07/2018 09:33, Rafael J. Wysocki wrote: > Hi, > > Thanks for your report! > > On Tue, Jul 17, 2018 at 8:50 AM, Andreas Herrmann <aherrmann@suse.com> wrote: >> Hello, >> >> I've recently noticed that commit 554c8aa8ecad ("sched: idle: Select >> idle state before stopping the tick") causes severe performance drop >> for systems using pcc-cpufreq driver. Depending on the number of CPUs >> the system might be almost unusable. The OS jitter for 4.17.y and >> 4.18.-rcx kernels is off the charts, you can even spot it with top >> command (issued when the system is supposedly idle), e.g. >> >> top - 14:44:24 up 2 min, 1 user, load average: 90.11, 38.20, 14.38 >> Tasks: 1199 total, 109 running, 541 sleeping, 0 stopped, 0 zombie >> %Cpu(s): 1.2 us, 58.7 sy, 0.0 ni, 39.3 id, 0.6 wa, 0.0 hi, 0.3 si, 0.0 st >> KiB Mem: 13137064+total, 1192168 used, 13017848+free, 2340 buffers >> KiB Swap: 2104316 total, 0 used, 2104316 free. 522296 cached Mem >> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> 3373 root 20 0 982024 49916 36120 R 96.691 0.038 0:19.54 kubelet >> 67 root 20 0 0 0 0 R 78.676 0.000 0:49.36 kworker/9:0 >> 25 root 20 0 0 0 0 R 78.125 0.000 0:49.67 kworker/2:0 >> 182 root 20 0 0 0 0 R 75.735 0.000 1:18.17 kworker/28:0 >> 43 root 20 0 0 0 0 R 75.000 0.000 0:11.56 kworker/5:0 >> 103 root 20 0 0 0 0 R 74.449 0.000 0:46.83 kworker/15:0 >> 334 root 20 0 0 0 0 R 72.978 0.000 1:06.88 kworker/53:0 >> 789 root 20 0 0 0 0 R 69.853 0.000 1:29.50 kworker/38:1 >> 418 root 20 0 0 0 0 R 69.301 0.000 0:41.33 kworker/67:0 >> 779 root 20 0 0 0 0 R 68.934 0.000 1:33.60 kworker/27:1 >> 773 root 20 0 0 0 0 R 68.566 0.000 1:37.91 kworker/22:1 >> 762 root 20 0 0 0 0 R 68.015 0.000 1:41.01 kworker/11:1 >> 769 root 20 0 0 0 0 R 67.647 0.000 1:37.65 kworker/18:1 >> 805 root 20 0 0 0 0 R 67.096 0.000 1:30.96 kworker/54:1 >> 840 root 20 0 0 0 0 R 66.912 0.000 1:23.82 kworker/89:1 >> 812 root 20 0 0 0 0 R 66.728 0.000 1:31.89 kworker/59:1 >> 847 root 20 0 0 0 0 R 66.360 0.000 1:28.40 kworker/96:1 >> 763 root 20 0 0 0 0 R 66.176 0.000 1:42.57 kworker/12:1 >> 772 root 20 0 0 0 0 R 66.176 0.000 1:12.58 kworker/21:1 >> 821 root 20 0 0 0 0 R 66.176 0.000 1:29.62 kworker/69:1 >> 923 root 20 0 0 0 0 R 65.809 0.000 1:44.32 kworker/3:18 >> 1284 root 20 0 0 0 0 R 65.809 0.000 1:23.50 kworker/101:2 >> 61 root 20 0 0 0 0 R 65.625 0.000 1:29.37 kworker/8:0 >> 3531 root 20 0 24384 3768 2356 R 65.625 0.003 0:08.91 top >> 771 root 20 0 0 0 0 R 65.074 0.000 1:37.90 kworker/20:1 >> 767 root 20 0 0 0 0 R 64.706 0.000 1:38.01 kworker/16:1 >> 764 root 20 0 0 0 0 R 64.522 0.000 1:40.28 kworker/13:1 >> 765 root 20 0 0 0 0 R 64.154 0.000 1:40.13 kworker/14:1 >> >> When I apply below patch (trying to revert essential parts of commit >> 554c8aa8ecad) behaviour seems back to normal. > > Well, that basically defeats the purpose of the change in commit > 554c8aa8ecad, so it's not what I'd like to do to fix this problem. > > Also it would be good to understand what actually happens. > >> I know that pcc-cpufreq driver is not "state-of-the-art" when it comes >> to cpufreq drivers and you better not use it. > > That's exactly right. > >> But I wonder whether commit 554c8aa8ecad ("sched: idle: Select idle state before >> stopping the tick") introduced bad behaviour for other cases as well. > > It has been tested quite extensively in that respect, although > admittedly not with the pcc-cpufreq driver. > > Nothing bad related to it has been has been reported so far, FWIW. I confirm we benchmarked on ARM64 and we did not notice a big performance drop expect a few percent due to more cluster idle states which is logical and compensated by a big energy improvement. >> I'll send some performance results to illustrate the issue asap. I've >> also tried to modify pcc-cpufreq to reduce the amount of frequency >> changes triggered by this driver but this does not help for kernels >> where commit 554c8aa8ecad is applied. > > Can you replace pcc-cpufreq with a different cpufreq driver on the > affected systems? If so, do performance numbers look bad after that > too? > > Thanks, > Rafael >
On Tue, Jul 17, 2018 at 08:50:48AM +0200, Andreas Herrmann wrote: > I've recently noticed that commit 554c8aa8ecad ("sched: idle: Select > idle state before stopping the tick") causes severe performance drop > for systems using pcc-cpufreq driver. Depending on the number of CPUs > the system might be almost unusable. Have you looked at that driver? It features a global lock which is taken for all cpufreq operations. Once you get past a handful of cpus that will constrict your system exactly as you say. What kind of machine are you running this on, and can you configure the thing to use anything else? If so, do so and never look back.
On Tue, Jul 17, 2018 at 09:33:53AM +0200, Rafael J. Wysocki wrote: > Hi, > > Thanks for your report! > > On Tue, Jul 17, 2018 at 8:50 AM, Andreas Herrmann <aherrmann@suse.com> wrote: > > Hello, > > > > I've recently noticed that commit 554c8aa8ecad ("sched: idle: Select > > idle state before stopping the tick") causes severe performance drop > > for systems using pcc-cpufreq driver. Depending on the number of CPUs > > the system might be almost unusable. The OS jitter for 4.17.y and > > 4.18.-rcx kernels is off the charts, you can even spot it with top > > command (issued when the system is supposedly idle), e.g. > > > > top - 14:44:24 up 2 min, 1 user, load average: 90.11, 38.20, 14.38 > > Tasks: 1199 total, 109 running, 541 sleeping, 0 stopped, 0 zombie > > %Cpu(s): 1.2 us, 58.7 sy, 0.0 ni, 39.3 id, 0.6 wa, 0.0 hi, 0.3 si, 0.0 st > > KiB Mem: 13137064+total, 1192168 used, 13017848+free, 2340 buffers > > KiB Swap: 2104316 total, 0 used, 2104316 free. 522296 cached Mem > > > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > > 3373 root 20 0 982024 49916 36120 R 96.691 0.038 0:19.54 kubelet > > 67 root 20 0 0 0 0 R 78.676 0.000 0:49.36 kworker/9:0 > > 25 root 20 0 0 0 0 R 78.125 0.000 0:49.67 kworker/2:0 > > 182 root 20 0 0 0 0 R 75.735 0.000 1:18.17 kworker/28:0 > > 43 root 20 0 0 0 0 R 75.000 0.000 0:11.56 kworker/5:0 > > 103 root 20 0 0 0 0 R 74.449 0.000 0:46.83 kworker/15:0 > > 334 root 20 0 0 0 0 R 72.978 0.000 1:06.88 kworker/53:0 > > 789 root 20 0 0 0 0 R 69.853 0.000 1:29.50 kworker/38:1 > > 418 root 20 0 0 0 0 R 69.301 0.000 0:41.33 kworker/67:0 > > 779 root 20 0 0 0 0 R 68.934 0.000 1:33.60 kworker/27:1 > > 773 root 20 0 0 0 0 R 68.566 0.000 1:37.91 kworker/22:1 > > 762 root 20 0 0 0 0 R 68.015 0.000 1:41.01 kworker/11:1 > > 769 root 20 0 0 0 0 R 67.647 0.000 1:37.65 kworker/18:1 > > 805 root 20 0 0 0 0 R 67.096 0.000 1:30.96 kworker/54:1 > > 840 root 20 0 0 0 0 R 66.912 0.000 1:23.82 kworker/89:1 > > 812 root 20 0 0 0 0 R 66.728 0.000 1:31.89 kworker/59:1 > > 847 root 20 0 0 0 0 R 66.360 0.000 1:28.40 kworker/96:1 > > 763 root 20 0 0 0 0 R 66.176 0.000 1:42.57 kworker/12:1 > > 772 root 20 0 0 0 0 R 66.176 0.000 1:12.58 kworker/21:1 > > 821 root 20 0 0 0 0 R 66.176 0.000 1:29.62 kworker/69:1 > > 923 root 20 0 0 0 0 R 65.809 0.000 1:44.32 kworker/3:18 > > 1284 root 20 0 0 0 0 R 65.809 0.000 1:23.50 kworker/101:2 > > 61 root 20 0 0 0 0 R 65.625 0.000 1:29.37 kworker/8:0 > > 3531 root 20 0 24384 3768 2356 R 65.625 0.003 0:08.91 top > > 771 root 20 0 0 0 0 R 65.074 0.000 1:37.90 kworker/20:1 > > 767 root 20 0 0 0 0 R 64.706 0.000 1:38.01 kworker/16:1 > > 764 root 20 0 0 0 0 R 64.522 0.000 1:40.28 kworker/13:1 > > 765 root 20 0 0 0 0 R 64.154 0.000 1:40.13 kworker/14:1 > > > > When I apply below patch (trying to revert essential parts of commit > > 554c8aa8ecad) behaviour seems back to normal. > > Well, that basically defeats the purpose of the change in commit > 554c8aa8ecad, so it's not what I'd like to do to fix this problem. > > Also it would be good to understand what actually happens. > > > I know that pcc-cpufreq driver is not "state-of-the-art" when it comes > > to cpufreq drivers and you better not use it. > > That's exactly right. > > > But I wonder whether commit 554c8aa8ecad ("sched: idle: Select idle state before > > stopping the tick") introduced bad behaviour for other cases as well. > > It has been tested quite extensively in that respect, although > admittedly not with the pcc-cpufreq driver. > > Nothing bad related to it has been has been reported so far, FWIW. > > > I'll send some performance results to illustrate the issue asap. I've > > also tried to modify pcc-cpufreq to reduce the amount of frequency > > changes triggered by this driver but this does not help for kernels > > where commit 554c8aa8ecad is applied. > > Can you replace pcc-cpufreq with a different cpufreq driver on the > affected systems? If so, do performance numbers look bad after that > too? I have no performance numbers yet for other cpufreq drivers on this system (checking this commit). But I'll look it at next. Andreas
On Tue, Jul 17, 2018 at 10:03:41AM +0200, Rafael J. Wysocki wrote: > On Tue, Jul 17, 2018 at 9:33 AM, Rafael J. Wysocki <rafael@kernel.org> wrote: > > Hi, > > > > Thanks for your report! > > > > On Tue, Jul 17, 2018 at 8:50 AM, Andreas Herrmann <aherrmann@suse.com> wrote: > >> Hello, > >> > >> I've recently noticed that commit 554c8aa8ecad ("sched: idle: Select > >> idle state before stopping the tick") causes severe performance drop > >> for systems using pcc-cpufreq driver. Depending on the number of CPUs > >> the system might be almost unusable. The OS jitter for 4.17.y and > >> 4.18.-rcx kernels is off the charts, you can even spot it with top > >> command (issued when the system is supposedly idle), e.g. > >> > >> top - 14:44:24 up 2 min, 1 user, load average: 90.11, 38.20, 14.38 > >> Tasks: 1199 total, 109 running, 541 sleeping, 0 stopped, 0 zombie > >> %Cpu(s): 1.2 us, 58.7 sy, 0.0 ni, 39.3 id, 0.6 wa, 0.0 hi, 0.3 si, 0.0 st > >> KiB Mem: 13137064+total, 1192168 used, 13017848+free, 2340 buffers > >> KiB Swap: 2104316 total, 0 used, 2104316 free. 522296 cached Mem > >> > >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > >> 3373 root 20 0 982024 49916 36120 R 96.691 0.038 0:19.54 kubelet > >> 67 root 20 0 0 0 0 R 78.676 0.000 0:49.36 kworker/9:0 > >> 25 root 20 0 0 0 0 R 78.125 0.000 0:49.67 kworker/2:0 > >> 182 root 20 0 0 0 0 R 75.735 0.000 1:18.17 kworker/28:0 > >> 43 root 20 0 0 0 0 R 75.000 0.000 0:11.56 kworker/5:0 > >> 103 root 20 0 0 0 0 R 74.449 0.000 0:46.83 kworker/15:0 > >> 334 root 20 0 0 0 0 R 72.978 0.000 1:06.88 kworker/53:0 > >> 789 root 20 0 0 0 0 R 69.853 0.000 1:29.50 kworker/38:1 > >> 418 root 20 0 0 0 0 R 69.301 0.000 0:41.33 kworker/67:0 > >> 779 root 20 0 0 0 0 R 68.934 0.000 1:33.60 kworker/27:1 > >> 773 root 20 0 0 0 0 R 68.566 0.000 1:37.91 kworker/22:1 > >> 762 root 20 0 0 0 0 R 68.015 0.000 1:41.01 kworker/11:1 > >> 769 root 20 0 0 0 0 R 67.647 0.000 1:37.65 kworker/18:1 > >> 805 root 20 0 0 0 0 R 67.096 0.000 1:30.96 kworker/54:1 > >> 840 root 20 0 0 0 0 R 66.912 0.000 1:23.82 kworker/89:1 > >> 812 root 20 0 0 0 0 R 66.728 0.000 1:31.89 kworker/59:1 > >> 847 root 20 0 0 0 0 R 66.360 0.000 1:28.40 kworker/96:1 > >> 763 root 20 0 0 0 0 R 66.176 0.000 1:42.57 kworker/12:1 > >> 772 root 20 0 0 0 0 R 66.176 0.000 1:12.58 kworker/21:1 > >> 821 root 20 0 0 0 0 R 66.176 0.000 1:29.62 kworker/69:1 > >> 923 root 20 0 0 0 0 R 65.809 0.000 1:44.32 kworker/3:18 > >> 1284 root 20 0 0 0 0 R 65.809 0.000 1:23.50 kworker/101:2 > >> 61 root 20 0 0 0 0 R 65.625 0.000 1:29.37 kworker/8:0 > >> 3531 root 20 0 24384 3768 2356 R 65.625 0.003 0:08.91 top > >> 771 root 20 0 0 0 0 R 65.074 0.000 1:37.90 kworker/20:1 > >> 767 root 20 0 0 0 0 R 64.706 0.000 1:38.01 kworker/16:1 > >> 764 root 20 0 0 0 0 R 64.522 0.000 1:40.28 kworker/13:1 > >> 765 root 20 0 0 0 0 R 64.154 0.000 1:40.13 kworker/14:1 > >> > >> When I apply below patch (trying to revert essential parts of commit > >> 554c8aa8ecad) behaviour seems back to normal. > > > > Well, that basically defeats the purpose of the change in commit > > 554c8aa8ecad, so it's not what I'd like to do to fix this problem. > > > > Also it would be good to understand what actually happens. > > > >> I know that pcc-cpufreq driver is not "state-of-the-art" when it comes > >> to cpufreq drivers and you better not use it. > > > > That's exactly right. > > > >> But I wonder whether commit 554c8aa8ecad ("sched: idle: Select idle state before > >> stopping the tick") introduced bad behaviour for other cases as well. > > > > It has been tested quite extensively in that respect, although > > admittedly not with the pcc-cpufreq driver. > > > > Nothing bad related to it has been has been reported so far, FWIW. > > > >> I'll send some performance results to illustrate the issue asap. I've > >> also tried to modify pcc-cpufreq to reduce the amount of frequency > >> changes triggered by this driver but this does not help for kernels > >> where commit 554c8aa8ecad is applied. > > > > Can you replace pcc-cpufreq with a different cpufreq driver on the > > affected systems? If so, do performance numbers look bad after that > > too? > > Also, what cpufreq governor do you use with pcc-cpufreq? Ondemand governor. Which triggers a lot of PCC related platform calls. And as Peter noticed already the driver has a severe bottleneck (lock protecting shared memory used for all CPUs to pass data to/from platform for PCC calls). > Does changing it to something like "performance" improve things? With performance governor above mentioned bottleneck is no issue. On balance before this commit users could use pcc-cpufreq but had already suboptimal performance (compared to say intel_pstate driver which can be used changing BIOS options). Starting with this commit systems using pcc-cpufreq are unusable with high number of CPUs (top output above is for system with 120 CPUs). So should the driver be removed (sooner or later), or this behaviour be documented somewhere, or just leave it as is. Andreas
On Tue, Jul 17, 2018 at 10:36 AM, Andreas Herrmann <aherrmann@suse.com> wrote: > On Tue, Jul 17, 2018 at 09:33:53AM +0200, Rafael J. Wysocki wrote: >> Hi, >> >> Thanks for your report! >> >> On Tue, Jul 17, 2018 at 8:50 AM, Andreas Herrmann <aherrmann@suse.com> wrote: >> > Hello, >> > >> > I've recently noticed that commit 554c8aa8ecad ("sched: idle: Select >> > idle state before stopping the tick") causes severe performance drop >> > for systems using pcc-cpufreq driver. Depending on the number of CPUs >> > the system might be almost unusable. The OS jitter for 4.17.y and >> > 4.18.-rcx kernels is off the charts, you can even spot it with top >> > command (issued when the system is supposedly idle), e.g. >> > >> > top - 14:44:24 up 2 min, 1 user, load average: 90.11, 38.20, 14.38 >> > Tasks: 1199 total, 109 running, 541 sleeping, 0 stopped, 0 zombie >> > %Cpu(s): 1.2 us, 58.7 sy, 0.0 ni, 39.3 id, 0.6 wa, 0.0 hi, 0.3 si, 0.0 st >> > KiB Mem: 13137064+total, 1192168 used, 13017848+free, 2340 buffers >> > KiB Swap: 2104316 total, 0 used, 2104316 free. 522296 cached Mem >> > >> > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> > 3373 root 20 0 982024 49916 36120 R 96.691 0.038 0:19.54 kubelet >> > 67 root 20 0 0 0 0 R 78.676 0.000 0:49.36 kworker/9:0 >> > 25 root 20 0 0 0 0 R 78.125 0.000 0:49.67 kworker/2:0 >> > 182 root 20 0 0 0 0 R 75.735 0.000 1:18.17 kworker/28:0 >> > 43 root 20 0 0 0 0 R 75.000 0.000 0:11.56 kworker/5:0 >> > 103 root 20 0 0 0 0 R 74.449 0.000 0:46.83 kworker/15:0 >> > 334 root 20 0 0 0 0 R 72.978 0.000 1:06.88 kworker/53:0 >> > 789 root 20 0 0 0 0 R 69.853 0.000 1:29.50 kworker/38:1 >> > 418 root 20 0 0 0 0 R 69.301 0.000 0:41.33 kworker/67:0 >> > 779 root 20 0 0 0 0 R 68.934 0.000 1:33.60 kworker/27:1 >> > 773 root 20 0 0 0 0 R 68.566 0.000 1:37.91 kworker/22:1 >> > 762 root 20 0 0 0 0 R 68.015 0.000 1:41.01 kworker/11:1 >> > 769 root 20 0 0 0 0 R 67.647 0.000 1:37.65 kworker/18:1 >> > 805 root 20 0 0 0 0 R 67.096 0.000 1:30.96 kworker/54:1 >> > 840 root 20 0 0 0 0 R 66.912 0.000 1:23.82 kworker/89:1 >> > 812 root 20 0 0 0 0 R 66.728 0.000 1:31.89 kworker/59:1 >> > 847 root 20 0 0 0 0 R 66.360 0.000 1:28.40 kworker/96:1 >> > 763 root 20 0 0 0 0 R 66.176 0.000 1:42.57 kworker/12:1 >> > 772 root 20 0 0 0 0 R 66.176 0.000 1:12.58 kworker/21:1 >> > 821 root 20 0 0 0 0 R 66.176 0.000 1:29.62 kworker/69:1 >> > 923 root 20 0 0 0 0 R 65.809 0.000 1:44.32 kworker/3:18 >> > 1284 root 20 0 0 0 0 R 65.809 0.000 1:23.50 kworker/101:2 >> > 61 root 20 0 0 0 0 R 65.625 0.000 1:29.37 kworker/8:0 >> > 3531 root 20 0 24384 3768 2356 R 65.625 0.003 0:08.91 top >> > 771 root 20 0 0 0 0 R 65.074 0.000 1:37.90 kworker/20:1 >> > 767 root 20 0 0 0 0 R 64.706 0.000 1:38.01 kworker/16:1 >> > 764 root 20 0 0 0 0 R 64.522 0.000 1:40.28 kworker/13:1 >> > 765 root 20 0 0 0 0 R 64.154 0.000 1:40.13 kworker/14:1 >> > >> > When I apply below patch (trying to revert essential parts of commit >> > 554c8aa8ecad) behaviour seems back to normal. >> >> Well, that basically defeats the purpose of the change in commit >> 554c8aa8ecad, so it's not what I'd like to do to fix this problem. >> >> Also it would be good to understand what actually happens. >> >> > I know that pcc-cpufreq driver is not "state-of-the-art" when it comes >> > to cpufreq drivers and you better not use it. >> >> That's exactly right. >> >> > But I wonder whether commit 554c8aa8ecad ("sched: idle: Select idle state before >> > stopping the tick") introduced bad behaviour for other cases as well. >> >> It has been tested quite extensively in that respect, although >> admittedly not with the pcc-cpufreq driver. >> >> Nothing bad related to it has been has been reported so far, FWIW. >> >> > I'll send some performance results to illustrate the issue asap. I've >> > also tried to modify pcc-cpufreq to reduce the amount of frequency >> > changes triggered by this driver but this does not help for kernels >> > where commit 554c8aa8ecad is applied. >> >> Can you replace pcc-cpufreq with a different cpufreq driver on the >> affected systems? If so, do performance numbers look bad after that >> too? > > I have no performance numbers yet for other cpufreq drivers on this > system (checking this commit). But I'll look it at next. Thanks! Generally speaking, pcc-cpufreq is fundamentally not scalable, so the additional concurrency brought in by the commit in question may have exposed that weakness if that driver is run on a system with multiple logical CPUs.
On Tue, Jul 17, 2018 at 10:50 AM, Andreas Herrmann <aherrmann@suse.com> wrote: > On Tue, Jul 17, 2018 at 10:03:41AM +0200, Rafael J. Wysocki wrote: >> On Tue, Jul 17, 2018 at 9:33 AM, Rafael J. Wysocki <rafael@kernel.org> wrote: >> > Hi, >> > >> > Thanks for your report! >> > >> > On Tue, Jul 17, 2018 at 8:50 AM, Andreas Herrmann <aherrmann@suse.com> wrote: >> >> Hello, >> >> >> >> I've recently noticed that commit 554c8aa8ecad ("sched: idle: Select >> >> idle state before stopping the tick") causes severe performance drop >> >> for systems using pcc-cpufreq driver. Depending on the number of CPUs >> >> the system might be almost unusable. The OS jitter for 4.17.y and >> >> 4.18.-rcx kernels is off the charts, you can even spot it with top >> >> command (issued when the system is supposedly idle), e.g. >> >> >> >> top - 14:44:24 up 2 min, 1 user, load average: 90.11, 38.20, 14.38 >> >> Tasks: 1199 total, 109 running, 541 sleeping, 0 stopped, 0 zombie >> >> %Cpu(s): 1.2 us, 58.7 sy, 0.0 ni, 39.3 id, 0.6 wa, 0.0 hi, 0.3 si, 0.0 st >> >> KiB Mem: 13137064+total, 1192168 used, 13017848+free, 2340 buffers >> >> KiB Swap: 2104316 total, 0 used, 2104316 free. 522296 cached Mem >> >> >> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> >> 3373 root 20 0 982024 49916 36120 R 96.691 0.038 0:19.54 kubelet >> >> 67 root 20 0 0 0 0 R 78.676 0.000 0:49.36 kworker/9:0 >> >> 25 root 20 0 0 0 0 R 78.125 0.000 0:49.67 kworker/2:0 >> >> 182 root 20 0 0 0 0 R 75.735 0.000 1:18.17 kworker/28:0 >> >> 43 root 20 0 0 0 0 R 75.000 0.000 0:11.56 kworker/5:0 >> >> 103 root 20 0 0 0 0 R 74.449 0.000 0:46.83 kworker/15:0 >> >> 334 root 20 0 0 0 0 R 72.978 0.000 1:06.88 kworker/53:0 >> >> 789 root 20 0 0 0 0 R 69.853 0.000 1:29.50 kworker/38:1 >> >> 418 root 20 0 0 0 0 R 69.301 0.000 0:41.33 kworker/67:0 >> >> 779 root 20 0 0 0 0 R 68.934 0.000 1:33.60 kworker/27:1 >> >> 773 root 20 0 0 0 0 R 68.566 0.000 1:37.91 kworker/22:1 >> >> 762 root 20 0 0 0 0 R 68.015 0.000 1:41.01 kworker/11:1 >> >> 769 root 20 0 0 0 0 R 67.647 0.000 1:37.65 kworker/18:1 >> >> 805 root 20 0 0 0 0 R 67.096 0.000 1:30.96 kworker/54:1 >> >> 840 root 20 0 0 0 0 R 66.912 0.000 1:23.82 kworker/89:1 >> >> 812 root 20 0 0 0 0 R 66.728 0.000 1:31.89 kworker/59:1 >> >> 847 root 20 0 0 0 0 R 66.360 0.000 1:28.40 kworker/96:1 >> >> 763 root 20 0 0 0 0 R 66.176 0.000 1:42.57 kworker/12:1 >> >> 772 root 20 0 0 0 0 R 66.176 0.000 1:12.58 kworker/21:1 >> >> 821 root 20 0 0 0 0 R 66.176 0.000 1:29.62 kworker/69:1 >> >> 923 root 20 0 0 0 0 R 65.809 0.000 1:44.32 kworker/3:18 >> >> 1284 root 20 0 0 0 0 R 65.809 0.000 1:23.50 kworker/101:2 >> >> 61 root 20 0 0 0 0 R 65.625 0.000 1:29.37 kworker/8:0 >> >> 3531 root 20 0 24384 3768 2356 R 65.625 0.003 0:08.91 top >> >> 771 root 20 0 0 0 0 R 65.074 0.000 1:37.90 kworker/20:1 >> >> 767 root 20 0 0 0 0 R 64.706 0.000 1:38.01 kworker/16:1 >> >> 764 root 20 0 0 0 0 R 64.522 0.000 1:40.28 kworker/13:1 >> >> 765 root 20 0 0 0 0 R 64.154 0.000 1:40.13 kworker/14:1 >> >> >> >> When I apply below patch (trying to revert essential parts of commit >> >> 554c8aa8ecad) behaviour seems back to normal. >> > >> > Well, that basically defeats the purpose of the change in commit >> > 554c8aa8ecad, so it's not what I'd like to do to fix this problem. >> > >> > Also it would be good to understand what actually happens. >> > >> >> I know that pcc-cpufreq driver is not "state-of-the-art" when it comes >> >> to cpufreq drivers and you better not use it. >> > >> > That's exactly right. >> > >> >> But I wonder whether commit 554c8aa8ecad ("sched: idle: Select idle state before >> >> stopping the tick") introduced bad behaviour for other cases as well. >> > >> > It has been tested quite extensively in that respect, although >> > admittedly not with the pcc-cpufreq driver. >> > >> > Nothing bad related to it has been has been reported so far, FWIW. >> > >> >> I'll send some performance results to illustrate the issue asap. I've >> >> also tried to modify pcc-cpufreq to reduce the amount of frequency >> >> changes triggered by this driver but this does not help for kernels >> >> where commit 554c8aa8ecad is applied. >> > >> > Can you replace pcc-cpufreq with a different cpufreq driver on the >> > affected systems? If so, do performance numbers look bad after that >> > too? >> >> Also, what cpufreq governor do you use with pcc-cpufreq? > > Ondemand governor. Which triggers a lot of PCC related platform calls. > And as Peter noticed already the driver has a severe bottleneck (lock > protecting shared memory used for all CPUs to pass data to/from > platform for PCC calls). > >> Does changing it to something like "performance" improve things? > > With performance governor above mentioned bottleneck is no issue. OK > On balance before this commit users could use pcc-cpufreq but had > already suboptimal performance (compared to say intel_pstate driver > which can be used changing BIOS options). Starting with this commit > systems using pcc-cpufreq are unusable with high number of CPUs (top > output above is for system with 120 CPUs). I see. :-) > So should the driver be removed (sooner or later), or this behaviour > be documented somewhere, or just leave it as is. At least it should be documented in the driver that it is not scalable and not for use on many-CPU systems (with "many" meaning anything greater than 4 probably). Or we could just make the driver not load if the number of CPUs in the system is greater than 4 or similar.
On Tue, Jul 17, 2018 at 10:15:07AM +0200, Peter Zijlstra wrote: > On Tue, Jul 17, 2018 at 08:50:48AM +0200, Andreas Herrmann wrote: > > I've recently noticed that commit 554c8aa8ecad ("sched: idle: Select > > idle state before stopping the tick") causes severe performance drop > > for systems using pcc-cpufreq driver. Depending on the number of CPUs > > the system might be almost unusable. > > Have you looked at that driver? It features a global lock which is taken > for all cpufreq operations. Once you get past a handful of cpus that > will constrict your system exactly as you say. Yes, I figured that already and you better use another cpufreq driver on those systems. (But that such systems with this driver are almost unusable is a new feature ;-) > What kind of machine are you running this on, and can you configure the > thing to use anything else? Its an older system. HP ProLiant DL580 Gen8, CPUs are Intel(R) Xeon(R) CPU E7-4890 4 nodes, With HT enabled 120 logical CPUs. Yes you can configure it using something else. For the record. On this and similar systems pcc-cpufreq driver is used when BIOS setting for power management option "Power Regulator" is set to "Dynamic Power Savings Mode" AFAIK this is the default setting for this option. When changing this setting to "OS Control Mode" intel_pstate driver should be loaded. > If so, do so and never look back. Ok, I am fine with this. Thanks, Andreas
On Tue, Jul 17, 2018 at 10:50 AM, Andreas Herrmann <aherrmann@suse.com> wrote: [cut] > > On balance before this commit users could use pcc-cpufreq but had > already suboptimal performance (compared to say intel_pstate driver > which can be used changing BIOS options). BTW, I wonder why you need to change the BIOS options for intel_pstate to load. It should be initialized before pcc-cpufreq (according to their respective initcall levels), so in theory intel_pstate should be used by default on the affected systems anyway. What BIOS settings need to be changed for that?
On Tue, Jul 17, 2018 at 11:06:29AM +0200, Rafael J. Wysocki wrote: > On Tue, Jul 17, 2018 at 10:50 AM, Andreas Herrmann <aherrmann@suse.com> wrote: > > [cut] > > > > > On balance before this commit users could use pcc-cpufreq but had > > already suboptimal performance (compared to say intel_pstate driver > > which can be used changing BIOS options). > > BTW, I wonder why you need to change the BIOS options for intel_pstate to load. I think this is because of (in intel_pstate_init()): /* * The Intel pstate driver will be ignored if the platform * firmware has its own power management modes. */ if (intel_pstate_platform_pwr_mgmt_exists()) return -ENODEV; > It should be initialized before pcc-cpufreq (according to their > respective initcall levels), so in theory intel_pstate should be used > by default on the affected systems anyway. > What BIOS settings need to be changed for that? Already answered in other mail. Andreas
On Tue, Jul 17, 2018 at 11:11 AM, Andreas Herrmann <aherrmann@suse.com> wrote: > On Tue, Jul 17, 2018 at 11:06:29AM +0200, Rafael J. Wysocki wrote: >> On Tue, Jul 17, 2018 at 10:50 AM, Andreas Herrmann <aherrmann@suse.com> wrote: >> >> [cut] >> >> > >> > On balance before this commit users could use pcc-cpufreq but had >> > already suboptimal performance (compared to say intel_pstate driver >> > which can be used changing BIOS options). >> >> BTW, I wonder why you need to change the BIOS options for intel_pstate to load. > > I think this is because of (in intel_pstate_init()): > > /* > * The Intel pstate driver will be ignored if the platform > * firmware has its own power management modes. > */ > if (intel_pstate_platform_pwr_mgmt_exists()) > return -ENODEV; > OK, because of the "Proliant" entry, right? So it looks like we have an issue there. We find the entry and we look for _PSS. It is not there, so we assume that the firmware is expected to control performance, which is not the case. It looks like we also should look for the presence of the PCC interface in there. I can provide a patch for that, will you be able to test it? >> It should be initialized before pcc-cpufreq (according to their >> respective initcall levels), so in theory intel_pstate should be used >> by default on the affected systems anyway. > >> What BIOS settings need to be changed for that? > > Already answered in other mail. Indeed.
On Tue, Jul 17, 2018 at 11:23:25AM +0200, Rafael J. Wysocki wrote: > On Tue, Jul 17, 2018 at 11:11 AM, Andreas Herrmann <aherrmann@suse.com> wrote: > > On Tue, Jul 17, 2018 at 11:06:29AM +0200, Rafael J. Wysocki wrote: > >> On Tue, Jul 17, 2018 at 10:50 AM, Andreas Herrmann <aherrmann@suse.com> wrote: > >> > >> [cut] > >> > >> > > >> > On balance before this commit users could use pcc-cpufreq but had > >> > already suboptimal performance (compared to say intel_pstate driver > >> > which can be used changing BIOS options). > >> > >> BTW, I wonder why you need to change the BIOS options for intel_pstate to load. > > > > I think this is because of (in intel_pstate_init()): > > > > /* > > * The Intel pstate driver will be ignored if the platform > > * firmware has its own power management modes. > > */ > > if (intel_pstate_platform_pwr_mgmt_exists()) > > return -ENODEV; > > > > OK, because of the "Proliant" entry, right? > > So it looks like we have an issue there. We find the entry and we > look for _PSS. It is not there, so we assume that the firmware is > expected to control performance, which is not the case. It looks like > we also should look for the presence of the PCC interface in there. > > I can provide a patch for that, will you be able to test it? Yes, I can test it. > >> It should be initialized before pcc-cpufreq (according to their > >> respective initcall levels), so in theory intel_pstate should be used > >> by default on the affected systems anyway. > > > >> What BIOS settings need to be changed for that? > > > > Already answered in other mail. > > Indeed. Andreas
On Tue, Jul 17, 2018 at 11:27:21AM +0200, Andreas Herrmann wrote: > On Tue, Jul 17, 2018 at 11:23:25AM +0200, Rafael J. Wysocki wrote: > > On Tue, Jul 17, 2018 at 11:11 AM, Andreas Herrmann <aherrmann@suse.com> wrote: > > > On Tue, Jul 17, 2018 at 11:06:29AM +0200, Rafael J. Wysocki wrote: > > >> On Tue, Jul 17, 2018 at 10:50 AM, Andreas Herrmann <aherrmann@suse.com> wrote: > > >> > > >> [cut] > > >> > > >> > > > >> > On balance before this commit users could use pcc-cpufreq but had > > >> > already suboptimal performance (compared to say intel_pstate driver > > >> > which can be used changing BIOS options). > > >> > > >> BTW, I wonder why you need to change the BIOS options for intel_pstate to load. > > > > > > I think this is because of (in intel_pstate_init()): > > > > > > /* > > > * The Intel pstate driver will be ignored if the platform > > > * firmware has its own power management modes. > > > */ > > > if (intel_pstate_platform_pwr_mgmt_exists()) > > > return -ENODEV; > > > > > > > OK, because of the "Proliant" entry, right? > > > > So it looks like we have an issue there. We find the entry and we > > look for _PSS. It is not there, so we assume that the firmware is > > expected to control performance, which is not the case. FYI, there is another BIOS setting on those systems. It's called "Collaborative Power Control" (AFAIK enabled by default). Only if this is disabled, firmware is (alone) in control of performance. (And of course in this case neither pcc-cpufreq nor intel_pstate will be loaded). > > It looks like we also should look for the presence of the PCC > > interface in there. > > > > I can provide a patch for that, will you be able to test it? > > Yes, I can test it. > > > >> It should be initialized before pcc-cpufreq (according to their > > >> respective initcall levels), so in theory intel_pstate should be used > > >> by default on the affected systems anyway. > > > > > >> What BIOS settings need to be changed for that? > > > > > > Already answered in other mail. > > > > Indeed. > > > Andreas >
On Tue, Jul 17, 2018 at 11:36:20AM +0200, Andreas Herrmann wrote: > On Tue, Jul 17, 2018 at 11:27:21AM +0200, Andreas Herrmann wrote: > > On Tue, Jul 17, 2018 at 11:23:25AM +0200, Rafael J. Wysocki wrote: > > > On Tue, Jul 17, 2018 at 11:11 AM, Andreas Herrmann <aherrmann@suse.com> wrote: > > > > On Tue, Jul 17, 2018 at 11:06:29AM +0200, Rafael J. Wysocki wrote: > > > >> On Tue, Jul 17, 2018 at 10:50 AM, Andreas Herrmann <aherrmann@suse.com> wrote: > > > >> > > > >> [cut] > > > >> > > > >> > > > > >> > On balance before this commit users could use pcc-cpufreq but had > > > >> > already suboptimal performance (compared to say intel_pstate driver > > > >> > which can be used changing BIOS options). > > > >> > > > >> BTW, I wonder why you need to change the BIOS options for intel_pstate to load. > > > > > > > > I think this is because of (in intel_pstate_init()): > > > > > > > > /* > > > > * The Intel pstate driver will be ignored if the platform > > > > * firmware has its own power management modes. > > > > */ > > > > if (intel_pstate_platform_pwr_mgmt_exists()) > > > > return -ENODEV; > > > > > > > > > > OK, because of the "Proliant" entry, right? > > > > > > So it looks like we have an issue there. We find the entry and we > > > look for _PSS. It is not there, so we assume that the firmware is > > > expected to control performance, which is not the case. > FYI, there is another BIOS setting on those systems. It's called > "Collaborative Power Control" (AFAIK enabled by default). > > Only if this is disabled, firmware is (alone) in control of > performance. (And of course in this case neither pcc-cpufreq nor > intel_pstate will be loaded). To clarify: (i) When "Dynamic Power Savings Mode" is enabled _PSS objects are missing (in fact they seem to be renamed to "XPSS"). pcc-cpufreq driver evaluates PCCH header and loads. Same when "Collaborative Power Control" is disabled. No _PSS but XPSS objects. pcc-cpufreq driver fails to evaluate PCCH header, no cpufreq driver is loaded. (ii) There are _PSS objects when "OS Control Mode" is selected. (But I think when "Collaborative Power Control" is disabled there are no _PSS objects either and the "OS Control Mode" selection does not matter.) > > > It looks like we also should look for the presence of the PCC > > > interface in there. > > > > > > I can provide a patch for that, will you be able to test it? > > > > Yes, I can test it. > > > > > >> It should be initialized before pcc-cpufreq (according to their > > > >> respective initcall levels), so in theory intel_pstate should be used > > > >> by default on the affected systems anyway. > > > > > > > >> What BIOS settings need to be changed for that? > > > > > > > > Already answered in other mail. > > > > > > Indeed. > > > > > > Andreas > >
I think I still owe some performance numbers to show what is wrong with systems using pcc-cpufreq with Linux after commit 554c8aa8ecad. Following are results for kernbench tests (from MMTests test suite). That's just a kernel compile with different number of compile jobs. As result the time is measured, 5 runs are done for each configuration and average values calculated. I've restricted maximum number of jobs to 30. That means that tests were done for 2, 4, 8, 16, and 30 compile jobs. I had bound all tests to node 0. (I've used something like "numactl -N 0 ./run-mmtests.sh --run-monitor <test_name>" to start those tests.) Tests were done with kernel 4.18.0-rc3 on an HP DL580 Gen8 with Intel Xeon CPU E7-4890 with latest BIOS installed. System had 4 nodes, 15 CPUs per node (30 logical CPUs with HT enabled). pcc-cpufreq was active and ondemand governor in use. I've tested with different number of online CPUs which better illustrates how idle online CPUs interfere with compile load on node 0 (due to the jitter caused by pcc-cpufreq and its locking). Average mean for user/system/elapsed time and standard deviation for each subtest (=number of compile jobs) are as follows: (Nodes) N0 N01 N0 N01 N0123 (CPUs) 15CPUs 30CPUs 30CPUs 60CPUs 120CPUs Amean user-2 640.82 (0.00%) 675.90 (-5.47%) 789.03 (-23.13%) 1448.58 (-126.05%) 3575.79 (-458.01%) Amean user-4 652.18 (0.00%) 689.12 (-5.67%) 868.19 (-33.12%) 1846.66 (-183.15%) 5437.37 (-733.73%) Amean user-8 695.00 (0.00%) 732.22 (-5.35%) 1138.30 (-63.78%) 2598.74 (-273.92%) 7413.43 (-966.67%) Amean user-16 653.94 (0.00%) 772.48 (-18.13%) 1734.80 (-165.29%) 2699.65 (-312.83%) 9224.47 (-1310.61%) Amean user-30 634.91 (0.00%) 701.11 (-10.43%) 1197.37 (-88.59%) 1360.02 (-114.21%) 3732.34 (-487.85%) Amean syst-2 235.45 (0.00%) 235.68 (-0.10%) 321.99 (-36.76%) 574.44 (-143.98%) 869.35 (-269.23%) Amean syst-4 239.34 (0.00%) 243.09 (-1.57%) 345.07 (-44.18%) 621.00 (-159.47%) 1145.13 (-378.46%) Amean syst-8 246.51 (0.00%) 254.83 (-3.37%) 387.49 (-57.19%) 786.63 (-219.10%) 1406.17 (-470.42%) Amean syst-16 110.85 (0.00%) 122.21 (-10.25%) 408.25 (-268.31%) 644.41 (-481.36%) 1513.04 (-1264.99%) Amean syst-30 82.74 (0.00%) 94.07 (-13.69%) 155.38 (-87.80%) 207.03 (-150.22%) 547.73 (-562.01%) Amean elsp-2 625.33 (0.00%) 724.51 (-15.86%) 792.47 (-26.73%) 1537.44 (-145.86%) 3510.22 (-461.34%) Amean elsp-4 482.02 (0.00%) 568.26 (-17.89%) 670.26 (-39.05%) 1257.34 (-160.85%) 3120.89 (-547.46%) Amean elsp-8 267.75 (0.00%) 337.88 (-26.19%) 430.56 (-60.80%) 978.47 (-265.44%) 2321.91 (-767.18%) Amean elsp-16 63.55 (0.00%) 71.79 (-12.97%) 224.83 (-253.79%) 403.94 (-535.65%) 1121.04 (-1664.09%) Amean elsp-30 56.76 (0.00%) 62.82 (-10.69%) 66.50 (-17.16%) 124.20 (-118.84%) 303.47 (-434.70%) Stddev user-2 1.36 (0.00%) 1.94 (-42.57%) 16.17 (-1090.46%) 119.09 (-8669.75%) 382.74 (-28085.60%) Stddev user-4 2.81 (0.00%) 5.08 (-80.78%) 4.88 (-73.66%) 252.56 (-8881.80%) 1133.02 (-40193.16%) Stddev user-8 2.30 (0.00%) 15.58 (-578.28%) 30.60 (-1232.63%) 279.35 (-12064.01%) 1050.00 (-45621.61%) Stddev user-16 6.76 (0.00%) 25.52 (-277.80%) 78.44 (-1060.97%) 118.29 (-1650.94%) 724.11 (-10617.95%) Stddev user-30 0.51 (0.00%) 1.80 (-249.13%) 12.63 (-2354.11%) 25.82 (-4915.43%) 1098.82 (-213365.28%) Stddev syst-2 1.52 (0.00%) 2.76 (-81.04%) 3.98 (-161.58%) 36.35 (-2287.16%) 59.09 (-3781.09%) Stddev syst-4 2.39 (0.00%) 1.55 (35.25%) 3.24 ( -35.92%) 51.51 (-2057.65%) 175.75 (-7262.43%) Stddev syst-8 1.08 (0.00%) 3.70 (-241.40%) 6.83 (-531.33%) 65.80 (-5977.97%) 151.17 (-13864.10%) Stddev syst-16 3.78 (0.00%) 5.58 (-47.53%) 4.63 ( -22.44%) 47.90 (-1167.18%) 99.94 (-2543.88%) Stddev syst-30 0.31 (0.00%) 0.38 (-22.41%) 3.01 (-862.79%) 27.45 (-8688.85%) 137.94 (-44072.77%) Stddev elsp-2 55.14 (0.00%) 55.04 (0.18%) 95.33 ( -72.90%) 103.91 (-88.45%) 302.31 (-448.29%) Stddev elsp-4 60.90 (0.00%) 84.42 (-38.62%) 18.92 ( 68.94%) 197.60 (-224.46%) 323.53 (-431.24%) Stddev elsp-8 16.77 (0.00%) 30.77 (-83.47%) 49.57 (-195.57%) 79.02 (-371.16%) 261.85 (-1461.28%) Stddev elsp-16 1.99 (0.00%) 2.88 (-44.60%) 28.11 (-1311.79%) 101.81 (-5012.88%) 62.29 (-3028.36%) Stddev elsp-30 0.65 (0.00%) 1.04 (-59.06%) 1.64 (-151.81%) 41.84 (-6308.81%) 75.37 (-11445.61%) Overall test time for each mmtests invocation was as follows (this is also given for number-of-cpu configs for which I did not provide details above). N0 N01 N0 N012 N0123 N01 N0123 N0123 N012 N0123 N0123 15CPUs 30CPUs 30CPUs 45CPUs 60CPUs 60CPUs 75CPUs 90CPUs 90CPUs 105CPUs 120CPUs User 17196.67 18714.36 30105.65 19239.27 19505.35 53089.39 22690.33 26731.06 38131.74 47627.61 153424.99 System 4807.98 4970.89 8533.95 5136.97 5184.24 16351.67 6135.29 7152.66 10920.76 12362.39 32129.74 Elapsed 7796.46 9166.55 11518.51 9274.77 9030.39 25465.38 9361.60 10677.63 15633.49 18900.46 60908.28 The results given for 120 online CPUs on nodes 0-3 indicate what I meant with the "system being almost unusable". When trying to gather results with kernel 4.17.5 and 120 CPUs, one iteration of kernbench (1 kernel compile) with 2 jobs even took about 6 hours. Maybe it was an extreme outlier but I dismissed to further use that kernel (w/o modifications) for further tests. Andreas
On Wed, Jul 18, 2018 at 05:25:56PM +0200, Andreas Herrmann wrote:
...
> Average mean for user/system/elapsed time and standard deviation for each
Oops, of course I meant arithmetic mean.
...
For the sake of completeness following are given the remaining sets of kernbench results related to this thread. Setup for kernbench test is as described in previous mails but now all 120 logical CPUs were online in all tests. Test runs were still pinned to node 0. Common legend for below tables is: OSCM: "OS Control Mode" DPSM: "Dynamic Power Savings Mode" idle_rb: partial rollback of 554c8aa8ecad ("sched: idle: Select idle state before stopping the tick") as described in initial mail of this thread (A) intel_pstate (in powersave mode) performance wrt effect of commit 554c8aa8ecad and wrt to potential interference from platform code Kernel v4.18-rc5-36-g30b06abfb92b + patch for intel_pstate to load it instead of pcc-cpufreq when system is in DPSM. Detailed results for each number of compile jobs: (OSCM is baseline, values in parenthesis show comparison to baseline) OSCM OSCM DPSM DPSM idle_rb idle_rb Amean user-2 600.58 596.38 ( 0.70%) 685.94 ( -14.21%) 688.78 ( -14.69%) Amean user-4 583.90 586.34 ( -0.42%) 626.37 ( -7.27%) 622.17 ( -6.55%) Amean user-8 584.78 581.52 ( 0.56%) 600.89 ( -2.75%) 595.53 ( -1.84%) Amean user-16 705.07 688.62 ( 2.33%) 705.16 ( -0.01%) 682.44 ( 3.21%) Amean user-30 1017.25 1022.39 ( -0.51%) 1025.23 ( -0.78%) 1022.61 ( -0.53%) Amean syst-2 172.17 174.08 ( -1.11%) 184.73 ( -7.30%) 186.13 ( -8.11%) Amean syst-4 183.88 180.44 ( 1.87%) 191.70 ( -4.25%) 192.24 ( -4.54%) Amean syst-8 193.40 193.81 ( -0.21%) 198.01 ( -2.38%) 193.96 ( -0.29%) Amean syst-16 183.97 180.40 ( 1.94%) 184.00 ( -0.01%) 182.10 ( 1.02%) Amean syst-30 122.36 122.08 ( 0.23%) 122.53 ( -0.14%) 122.17 ( 0.15%) Amean elsp-2 610.90 634.64 ( -3.89%) 667.67 ( -9.29%) 661.81 ( -8.33%) Amean elsp-4 413.54 488.02 ( -18.01%) 433.79 ( -4.90%) 407.30 ( 1.51%) Amean elsp-8 261.85 218.25 ( 16.65%) 246.62 ( 5.82%) 219.55 ( 16.15%) Amean elsp-16 89.27 99.36 ( -11.30%) 92.74 ( -3.89%) 102.74 ( -15.09%) Amean elsp-30 47.07 47.04 ( 0.08%) 48.82 ( -3.72%) 48.28 ( -2.57%) Stddev user-2 6.06 7.53 ( -24.21%) 31.88 (-425.98%) 25.79 (-325.57%) Stddev user-4 7.05 14.48 (-105.40%) 11.82 ( -67.63%) 12.14 ( -72.22%) Stddev user-8 5.69 1.18 ( 79.28%) 18.75 (-229.45%) 7.03 ( -23.51%) Stddev user-16 6.41 15.74 (-145.55%) 12.87 (-100.75%) 10.59 ( -65.19%) Stddev user-30 2.62 2.80 ( -6.56%) 2.92 ( -11.31%) 2.45 ( 6.52%) Stddev syst-2 3.48 2.81 ( 19.28%) 2.27 ( 34.73%) 1.47 ( 57.83%) Stddev syst-4 4.04 4.69 ( -16.03%) 2.16 ( 46.42%) 0.84 ( 79.32%) Stddev syst-8 3.96 1.42 ( 64.11%) 2.34 ( 40.98%) 1.93 ( 51.24%) Stddev syst-16 2.01 2.33 ( -15.76%) 1.33 ( 33.89%) 1.94 ( 3.74%) Stddev syst-30 0.76 0.38 ( 50.10%) 0.91 ( -19.48%) 0.17 ( 77.86%) Stddev elsp-2 44.55 58.37 ( -31.01%) 110.11 (-147.15%) 82.81 ( -85.88%) Stddev elsp-4 62.39 109.75 ( -75.90%) 48.32 ( 22.56%) 47.10 ( 24.52%) Stddev elsp-8 59.01 25.95 ( 56.02%) 71.44 ( -21.07%) 37.83 ( 35.89%) Stddev elsp-16 10.47 23.88 (-128.08%) 11.98 ( -14.41%) 15.42 ( -47.32%) Stddev elsp-30 0.26 0.64 (-142.06%) 0.39 ( -46.53%) 0.44 ( -66.71%) Overall test time: OSCM OSCM DPSM DPSM idle_rb idle_rb User 18681.59 18599.99 19450.38 19289.33 System 4487.76 4458.55 4620.80 4595.13 Elapsed 7407.07 7725.86 7765.91 7502.72 Overall test run-time is comparable. Commit 554c8aa8ecad does not seem to have a significant impact on performance (I don't have numbers for power consumption). Comparing OSCM vs. DPSM: it seems that its better to switch system into OSCM. (B) performance of intel_pstate (in powersave mode and system in DPSM) vs. pcc-cpufreq (with ondemand governor) Results for pcc-cpufreq were obtained with v4.17.5+misc modifications. intel_pstate results were obtained with v4.18-rc5-36-g30b06abfb92b + patch for intel_pstate to load it instead of pcc-cpufreq when system is in DPSM. So strictly speaking this is no correct comparison but at least it gives an idea where the limits are with pcc-cpufreq and why its better to just switch to intel_pstate. pcc-cpufreq driver modifications were freqtable: pcc-cpufreq modified to use fixed table of 4 frequencies deadband: pcc-cpufreq modified to re-introduce so called deadband effect which keeps CPU at minimum frequency if target frequency would be in the calculated deadband intel_pstate pcc-cpufreq pcc-cpufreq pcc-cpufreq DPSM idle_rb idle_rb+freqtable idle_rb+deadband Amean user-2 685.94 834.15 ( -21.61%) 648.68 ( 5.43%) 636.63 ( 7.19%) Amean user-4 626.37 902.09 ( -44.02%) 657.43 ( -4.96%) 615.49 ( 1.74%) Amean user-8 600.89 1078.37 ( -79.46%) 723.05 ( -20.33%) 646.23 ( -7.55%) Amean user-16 705.16 1640.89 (-132.70%) 1096.61 ( -55.51%) 904.17 ( -28.22%) Amean user-30 1025.23 1463.90 ( -42.79%) 1156.17 ( -12.77%) 1151.40 ( -12.31%) Amean syst-2 184.73 232.17 ( -25.68%) 178.24 ( 3.51%) 172.09 ( 6.84%) Amean syst-4 191.70 257.22 ( -34.18%) 194.16 ( -1.29%) 188.10 ( 1.88%) Amean syst-8 198.01 313.67 ( -58.41%) 228.34 ( -15.31%) 206.99 ( -4.53%) Amean syst-16 184.00 393.92 (-114.09%) 279.89 ( -52.12%) 241.83 ( -31.43%) Amean syst-30 122.53 185.98 ( -51.79%) 143.28 ( -16.94%) 140.45 ( -14.62%) Amean elsp-2 667.67 769.28 ( -15.22%) 635.68 ( 4.79%) 651.51 ( 2.42%) Amean elsp-4 433.79 614.27 ( -41.60%) 440.45 ( -1.53%) 392.80 ( 9.45%) Amean elsp-8 246.62 397.54 ( -61.19%) 252.27 ( -2.29%) 239.21 ( 3.01%) Amean elsp-16 92.74 207.43 (-123.68%) 138.00 ( -48.81%) 119.98 ( -29.37%) Amean elsp-30 48.82 72.66 ( -48.83%) 55.95 ( -14.60%) 54.32 ( -11.27%) Stddev user-2 31.88 15.22 ( 52.26%) 7.77 ( 75.63%) 6.63 ( 79.21%) Stddev user-4 11.82 32.20 (-172.49%) 3.37 ( 71.44%) 6.44 ( 45.49%) Stddev user-8 18.75 33.99 ( -81.29%) 6.96 ( 62.86%) 5.82 ( 68.97%) Stddev user-16 12.87 70.72 (-449.46%) 31.19 (-142.30%) 28.88 (-124.40%) Stddev user-30 2.92 26.08 (-792.64%) 6.16 (-110.99%) 10.90 (-273.16%) Stddev syst-2 2.27 4.44 ( -95.54%) 4.15 ( -82.48%) 2.09 ( 8.11%) Stddev syst-4 2.16 8.46 (-290.74%) 3.71 ( -71.58%) 2.45 ( -12.99%) Stddev syst-8 2.34 10.73 (-359.70%) 3.98 ( -70.62%) 4.39 ( -87.80%) Stddev syst-16 1.33 11.44 (-759.46%) 2.14 ( -60.49%) 2.93 (-120.24%) Stddev syst-30 0.91 4.88 (-436.79%) 1.37 ( -50.11%) 2.36 (-159.71%) Stddev elsp-2 110.11 85.53 ( 22.32%) 87.11 ( 20.89%) 37.33 ( 66.10%) Stddev elsp-4 48.32 130.17 (-169.39%) 59.81 ( -23.79%) 26.15 ( 45.88%) Stddev elsp-8 71.44 86.47 ( -21.03%) 12.87 ( 81.98%) 43.88 ( 38.58%) Stddev elsp-16 11.98 13.63 ( -13.82%) 8.94 ( 25.35%) 5.97 ( 50.15%) Stddev elsp-30 0.39 2.64 (-582.23%) 0.62 ( -58.97%) 0.95 (-144.47%) intel_pstate pcc-cpufreq pcc-cpufreq pcc-cpufreq DPSM idle_rb idle_rb+ idle_rb+ freqtable deadband User 19450.38 31273.96 22689.14 21050.35 System 4620.80 7327.67 5364.63 4984.36 Elapsed 7765.91 10997.49 7935.53 7593.74 Again I have no numbers for power consumption. Note that I've stopped an attempt to collect results for pcc-cpufreq with unmodififed v4.17.5 (ie. w/o idle_rb) after the first iteration (compiling kernel with 2 jobs) took several hours. Andreas
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index 1a3e9bddd17b..377a62ec475c 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -185,18 +185,13 @@ static void cpuidle_idle_call(void) } else { bool stop_tick = true; + tick_nohz_idle_stop_tick(); + rcu_idle_enter(); /* * Ask the cpuidle framework to choose a convenient idle state. */ next_state = cpuidle_select(drv, dev, &stop_tick); - if (stop_tick) - tick_nohz_idle_stop_tick(); - else - tick_nohz_idle_retain_tick(); - - rcu_idle_enter(); - entered_state = call_cpuidle(drv, dev, next_state); /* * Give the governor an opportunity to reflect on the outcome