Message ID | 1654227.8mz0SueHsU@kreacher (mailing list archive) |
---|---|
Headers | show |
Series | PM: QoS: Get rid of unuseful code and rework CPU latency QoS interface | expand |
On Wed, 12 Feb 2020 at 00:39, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > > Hi All, > > This series of patches is based on the observation that after commit > c3082a674f46 ("PM: QoS: Get rid of unused flags") the only global PM QoS class > in use is PM_QOS_CPU_DMA_LATENCY, but there is still a significant amount of > code dedicated to the handling of global PM QoS classes in general. That code > takes up space and adds overhead in vain, so it is better to get rid of it. > > Moreover, with that unuseful code removed, the interface for adding QoS > requests for CPU latency becomes inelegant and confusing, so it is better to > clean it up. > > Patches [01/28-12/28] do the first part described above, which also includes > some assorted cleanups of the core PM QoS code that doesn't go away. > > Patches [13/28-25/28] rework the CPU latency QoS interface (in the classic > "define stubs, migrate users, change the API proper" manner), patches > [26-27/28] update the general comments and documentation to match the code > after the previous changes and the last one makes the CPU latency QoS depend > on CPU_IDLE (because cpuidle is the only user of its target value today). > > The majority of the patches in this series don't change the functionality of > the code at all (at least not intentionally). > > Please refer to the changelogs of individual patches for details. > > Thanks! A big thanks for cleaning this up! The PM_QOS_CPU_DMA_LATENCY and friends, has been annoying me for a long time. This certainly makes the code far better and more understandable! I have looked through the series and couldn't find any obvious mistakes, so feel free to add: Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Note, the review tag also means, that's fine for you to pick the mmc patch via your tree. Kind regards Uffe
On Wed, Feb 12, 2020 at 9:38 AM Ulf Hansson <ulf.hansson@linaro.org> wrote: > > On Wed, 12 Feb 2020 at 00:39, Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > > > > Hi All, > > > > This series of patches is based on the observation that after commit > > c3082a674f46 ("PM: QoS: Get rid of unused flags") the only global PM QoS class > > in use is PM_QOS_CPU_DMA_LATENCY, but there is still a significant amount of > > code dedicated to the handling of global PM QoS classes in general. That code > > takes up space and adds overhead in vain, so it is better to get rid of it. > > > > Moreover, with that unuseful code removed, the interface for adding QoS > > requests for CPU latency becomes inelegant and confusing, so it is better to > > clean it up. > > > > Patches [01/28-12/28] do the first part described above, which also includes > > some assorted cleanups of the core PM QoS code that doesn't go away. > > > > Patches [13/28-25/28] rework the CPU latency QoS interface (in the classic > > "define stubs, migrate users, change the API proper" manner), patches > > [26-27/28] update the general comments and documentation to match the code > > after the previous changes and the last one makes the CPU latency QoS depend > > on CPU_IDLE (because cpuidle is the only user of its target value today). > > > > The majority of the patches in this series don't change the functionality of > > the code at all (at least not intentionally). > > > > Please refer to the changelogs of individual patches for details. > > > > Thanks! > > A big thanks for cleaning this up! The PM_QOS_CPU_DMA_LATENCY and > friends, has been annoying me for a long time. This certainly makes > the code far better and more understandable! > > I have looked through the series and couldn't find any obvious > mistakes, so feel free to add: > > Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Thanks for the review, much appreciated! > Note, the review tag also means, that's fine for you to pick the mmc > patch via your tree. Thank you!
On Wed, Feb 12, 2020 at 12:39 AM Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > > Hi All, > > This series of patches is based on the observation that after commit > c3082a674f46 ("PM: QoS: Get rid of unused flags") the only global PM QoS class > in use is PM_QOS_CPU_DMA_LATENCY, but there is still a significant amount of > code dedicated to the handling of global PM QoS classes in general. That code > takes up space and adds overhead in vain, so it is better to get rid of it. > > Moreover, with that unuseful code removed, the interface for adding QoS > requests for CPU latency becomes inelegant and confusing, so it is better to > clean it up. > > Patches [01/28-12/28] do the first part described above, which also includes > some assorted cleanups of the core PM QoS code that doesn't go away. > > Patches [13/28-25/28] rework the CPU latency QoS interface (in the classic > "define stubs, migrate users, change the API proper" manner), patches > [26-27/28] update the general comments and documentation to match the code > after the previous changes and the last one makes the CPU latency QoS depend > on CPU_IDLE (because cpuidle is the only user of its target value today). > > The majority of the patches in this series don't change the functionality of > the code at all (at least not intentionally). > > Please refer to the changelogs of individual patches for details. > > Thanks! This patch series is available in the git branch at git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git cpu-latency-qos for easier access, but please note that it may be updated in response to review comments etc.
"Rafael J. Wysocki" <rjw@rjwysocki.net> writes: > Hi All, > > This series of patches is based on the observation that after commit > c3082a674f46 ("PM: QoS: Get rid of unused flags") the only global PM QoS class > in use is PM_QOS_CPU_DMA_LATENCY, but there is still a significant amount of > code dedicated to the handling of global PM QoS classes in general. That code > takes up space and adds overhead in vain, so it is better to get rid of it. > > Moreover, with that unuseful code removed, the interface for adding QoS > requests for CPU latency becomes inelegant and confusing, so it is better to > clean it up. > > Patches [01/28-12/28] do the first part described above, which also includes > some assorted cleanups of the core PM QoS code that doesn't go away. > > Patches [13/28-25/28] rework the CPU latency QoS interface (in the classic > "define stubs, migrate users, change the API proper" manner), patches > [26-27/28] update the general comments and documentation to match the code > after the previous changes and the last one makes the CPU latency QoS depend > on CPU_IDLE (because cpuidle is the only user of its target value today). > > The majority of the patches in this series don't change the functionality of > the code at all (at least not intentionally). > > Please refer to the changelogs of individual patches for details. > > Thanks! Hi Rafael, I believe some of the interfaces removed here could be useful in the near future. It goes back to the energy efficiency- (and IGP graphics performance-)improving series I submitted a while ago [1]. It relies on some mechanism for the graphics driver to report an I/O bottleneck to CPUFREQ, allowing it to make a more conservative trade-off between energy efficiency and latency, which can greatly reduce the CPU package energy usage of IO-bound applications (in some graphics benchmarks I've seen it reduced by over 40% on my ICL laptop), and therefore also allows TDP-bound applications to obtain a reciprocal improvement in throughput. I'm not particularly fond of the global PM QoS interfaces TBH, it seems like an excessively blunt hammer to me, so I can very much relate to the purpose of this series. However the finer-grained solution I've implemented has seen some push-back from i915 and CPUFREQ devs due to its complexity, since it relies on task scheduler changes in order to track IO bottlenecks per-process (roughly as suggested by Peter Zijlstra during our previous discussions), pretty much in the spirit of PELT but applied to IO utilization. With that in mind I was hoping we could take advantage of PM QoS as a temporary solution [2], by introducing a global PM QoS class similar but with roughly converse semantics to PM_QOS_CPU_DMA_LATENCY, allowing device drivers to report a *lower* bound on CPU latency beyond which PM shall not bother to reduce latency if doing so would have negative consequences on the energy efficiency and/or parallelism of the system. Of course one would expect the current PM_QOS_CPU_DMA_LATENCY upper bound to take precedence over the new lower bound in cases where the former is in conflict with the latter. I can think of several alternatives to that which don't involve temporarily holding off your clean-up, but none of them sound particularly exciting: 1/ Use an interface specific to CPUFREQ, pretty much like the one introduced in my original submission [1]. 2/ Use per-CPU PM QoS, which AFAICT would require the graphics driver to either place a request on every CPU of the system (which would cause a frequent operation to have O(N) complexity on the number of CPUs on the system), or play a cat-and-mouse game with the task scheduler. 3/ Add a new global PM QoS mechanism roughly duplicating the cpu_latency_qos_* interfaces introduced in this series. Drop your change making this available to CPU IDLE only. 3/ Go straight to a scheduling-based approach, which is likely to greatly increase the review effort required to upstream this feature. (Peter might disagree though?) Regards, Francisco. [1] https://lore.kernel.org/linux-pm/20180328063845.4884-1-currojerez@riseup.net/ [2] I've written the code to do this already, but I wasn't able to test and debug it extensively until this week due to the instability of i915 on recent v5.5 kernels that prevented any benchmark run from surviving more than a few hours on my ICL system, hopefully the required i915 fixes will flow back to stable branches soon enough.
On Thu, Feb 13, 2020 at 12:31 AM Francisco Jerez <currojerez@riseup.net> wrote: > > "Rafael J. Wysocki" <rjw@rjwysocki.net> writes: > > > Hi All, > > > > This series of patches is based on the observation that after commit > > c3082a674f46 ("PM: QoS: Get rid of unused flags") the only global PM QoS class > > in use is PM_QOS_CPU_DMA_LATENCY, but there is still a significant amount of > > code dedicated to the handling of global PM QoS classes in general. That code > > takes up space and adds overhead in vain, so it is better to get rid of it. > > > > Moreover, with that unuseful code removed, the interface for adding QoS > > requests for CPU latency becomes inelegant and confusing, so it is better to > > clean it up. > > > > Patches [01/28-12/28] do the first part described above, which also includes > > some assorted cleanups of the core PM QoS code that doesn't go away. > > > > Patches [13/28-25/28] rework the CPU latency QoS interface (in the classic > > "define stubs, migrate users, change the API proper" manner), patches > > [26-27/28] update the general comments and documentation to match the code > > after the previous changes and the last one makes the CPU latency QoS depend > > on CPU_IDLE (because cpuidle is the only user of its target value today). > > > > The majority of the patches in this series don't change the functionality of > > the code at all (at least not intentionally). > > > > Please refer to the changelogs of individual patches for details. > > > > Thanks! > > Hi Rafael, > > I believe some of the interfaces removed here could be useful in the > near future. I disagree. > It goes back to the energy efficiency- (and IGP graphics > performance-)improving series I submitted a while ago [1]. It relies on > some mechanism for the graphics driver to report an I/O bottleneck to > CPUFREQ, allowing it to make a more conservative trade-off between > energy efficiency and latency, which can greatly reduce the CPU package > energy usage of IO-bound applications (in some graphics benchmarks I've > seen it reduced by over 40% on my ICL laptop), and therefore also allows > TDP-bound applications to obtain a reciprocal improvement in throughput. > > I'm not particularly fond of the global PM QoS interfaces TBH, it seems > like an excessively blunt hammer to me, so I can very much relate to the > purpose of this series. However the finer-grained solution I've > implemented has seen some push-back from i915 and CPUFREQ devs due to > its complexity, since it relies on task scheduler changes in order to > track IO bottlenecks per-process (roughly as suggested by Peter Zijlstra > during our previous discussions), pretty much in the spirit of PELT but > applied to IO utilization. > > With that in mind I was hoping we could take advantage of PM QoS as a > temporary solution [2], by introducing a global PM QoS class similar but > with roughly converse semantics to PM_QOS_CPU_DMA_LATENCY, allowing > device drivers to report a *lower* bound on CPU latency beyond which PM > shall not bother to reduce latency if doing so would have negative > consequences on the energy efficiency and/or parallelism of the system. So I really don't quite see how that could be responded to, by cpuidle say. What exactly do you mean by "reducing latency" in particular? > Of course one would expect the current PM_QOS_CPU_DMA_LATENCY upper > bound to take precedence over the new lower bound in cases where the > former is in conflict with the latter. So that needs to be done on top of this series. > I can think of several alternatives to that which don't involve > temporarily holding off your clean-up, The cleanup goes in. Please work on top of it. > but none of them sound particularly exciting: > > 1/ Use an interface specific to CPUFREQ, pretty much like the one > introduced in my original submission [1]. It uses frequency QoS already today, do you really need something else? > 2/ Use per-CPU PM QoS, which AFAICT would require the graphics driver > to either place a request on every CPU of the system (which would > cause a frequent operation to have O(N) complexity on the number of > CPUs on the system), or play a cat-and-mouse game with the task > scheduler. That's in place already too in the form of device PM QoS; see drivers/base/power/qos.c. > 3/ Add a new global PM QoS mechanism roughly duplicating the > cpu_latency_qos_* interfaces introduced in this series. Drop your > change making this available to CPU IDLE only. It sounds like you really want performance for energy efficiency and CPU latency has a little to do with that. > 3/ Go straight to a scheduling-based approach, which is likely to > greatly increase the review effort required to upstream this > feature. (Peter might disagree though?) Are you familiar with the utilization clamps mechanism? Thanks!
On Thu, Feb 13, 2020 at 1:16 AM Rafael J. Wysocki <rafael@kernel.org> wrote: > > On Thu, Feb 13, 2020 at 12:31 AM Francisco Jerez <currojerez@riseup.net> wrote: > > > > "Rafael J. Wysocki" <rjw@rjwysocki.net> writes: > > > > > Hi All, > > > > > > This series of patches is based on the observation that after commit > > > c3082a674f46 ("PM: QoS: Get rid of unused flags") the only global PM QoS class > > > in use is PM_QOS_CPU_DMA_LATENCY, but there is still a significant amount of > > > code dedicated to the handling of global PM QoS classes in general. That code > > > takes up space and adds overhead in vain, so it is better to get rid of it. > > > > > > Moreover, with that unuseful code removed, the interface for adding QoS > > > requests for CPU latency becomes inelegant and confusing, so it is better to > > > clean it up. > > > > > > Patches [01/28-12/28] do the first part described above, which also includes > > > some assorted cleanups of the core PM QoS code that doesn't go away. > > > > > > Patches [13/28-25/28] rework the CPU latency QoS interface (in the classic > > > "define stubs, migrate users, change the API proper" manner), patches > > > [26-27/28] update the general comments and documentation to match the code > > > after the previous changes and the last one makes the CPU latency QoS depend > > > on CPU_IDLE (because cpuidle is the only user of its target value today). > > > > > > The majority of the patches in this series don't change the functionality of > > > the code at all (at least not intentionally). > > > > > > Please refer to the changelogs of individual patches for details. > > > > > > Thanks! > > > > Hi Rafael, > > > > I believe some of the interfaces removed here could be useful in the > > near future. > > I disagree. > > > It goes back to the energy efficiency- (and IGP graphics > > performance-)improving series I submitted a while ago [1]. It relies on > > some mechanism for the graphics driver to report an I/O bottleneck to > > CPUFREQ, allowing it to make a more conservative trade-off between > > energy efficiency and latency, which can greatly reduce the CPU package > > energy usage of IO-bound applications (in some graphics benchmarks I've > > seen it reduced by over 40% on my ICL laptop), and therefore also allows > > TDP-bound applications to obtain a reciprocal improvement in throughput. > > > > I'm not particularly fond of the global PM QoS interfaces TBH, it seems > > like an excessively blunt hammer to me, so I can very much relate to the > > purpose of this series. However the finer-grained solution I've > > implemented has seen some push-back from i915 and CPUFREQ devs due to > > its complexity, since it relies on task scheduler changes in order to > > track IO bottlenecks per-process (roughly as suggested by Peter Zijlstra > > during our previous discussions), pretty much in the spirit of PELT but > > applied to IO utilization. > > > > With that in mind I was hoping we could take advantage of PM QoS as a > > temporary solution [2], by introducing a global PM QoS class similar but > > with roughly converse semantics to PM_QOS_CPU_DMA_LATENCY, allowing > > device drivers to report a *lower* bound on CPU latency beyond which PM > > shall not bother to reduce latency if doing so would have negative > > consequences on the energy efficiency and/or parallelism of the system. > > So I really don't quite see how that could be responded to, by cpuidle > say. What exactly do you mean by "reducing latency" in particular? > > > Of course one would expect the current PM_QOS_CPU_DMA_LATENCY upper > > bound to take precedence over the new lower bound in cases where the > > former is in conflict with the latter. > > So that needs to be done on top of this series. > > > I can think of several alternatives to that which don't involve > > temporarily holding off your clean-up, > > The cleanup goes in. Please work on top of it. > > > but none of them sound particularly exciting: > > > > 1/ Use an interface specific to CPUFREQ, pretty much like the one > > introduced in my original submission [1]. > > It uses frequency QoS already today, do you really need something else? > > > 2/ Use per-CPU PM QoS, which AFAICT would require the graphics driver > > to either place a request on every CPU of the system (which would > > cause a frequent operation to have O(N) complexity on the number of > > CPUs on the system), or play a cat-and-mouse game with the task > > scheduler. > > That's in place already too in the form of device PM QoS; see > drivers/base/power/qos.c. > > > 3/ Add a new global PM QoS mechanism roughly duplicating the > > cpu_latency_qos_* interfaces introduced in this series. Drop your > > change making this available to CPU IDLE only. > > It sounds like you really want performance for energy efficiency and > CPU latency has a little to do with that. > > > 3/ Go straight to a scheduling-based approach, which is likely to > > greatly increase the review effort required to upstream this > > feature. (Peter might disagree though?) > > Are you familiar with the utilization clamps mechanism? And BTW, posting patches as RFC is fine even if they have not been tested. At least you let people know that you work on something this way, so if they work on changes in the same area, they may take that into consideration. Also if there are objections to your proposal, you may save quite a bit of time by sending it early. It is unfortunate that this series has clashed with the changes that you were about to propose, but in this particular case in my view it is better to clean up things and start over. Thanks!
On Wed, Feb 12, 2020 at 5:09 AM Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > > Hi All, > > This series of patches is based on the observation that after commit > c3082a674f46 ("PM: QoS: Get rid of unused flags") the only global PM QoS class > in use is PM_QOS_CPU_DMA_LATENCY, but there is still a significant amount of > code dedicated to the handling of global PM QoS classes in general. That code > takes up space and adds overhead in vain, so it is better to get rid of it. > > Moreover, with that unuseful code removed, the interface for adding QoS > requests for CPU latency becomes inelegant and confusing, so it is better to > clean it up. > > Patches [01/28-12/28] do the first part described above, which also includes > some assorted cleanups of the core PM QoS code that doesn't go away. > > Patches [13/28-25/28] rework the CPU latency QoS interface (in the classic > "define stubs, migrate users, change the API proper" manner), patches > [26-27/28] update the general comments and documentation to match the code > after the previous changes and the last one makes the CPU latency QoS depend > on CPU_IDLE (because cpuidle is the only user of its target value today). > > The majority of the patches in this series don't change the functionality of > the code at all (at least not intentionally). > > Please refer to the changelogs of individual patches for details. Hi Rafael, Nice cleanup to the code and docs. I've reviewed the series, and briefly tested it by setting latencies from userspace. Can we not remove the debugfs interface? It is a quick way to check the global cpu latency clamp on the system from userspace without setting up tracepoints or writing a program to read /dev/cpu_dma_latency. Except for patch 01/28 removing the debugfs interface, please feel to add my Reviewed-by: Amit Kucheria <amit.kucheria@linaro.org> Tested-by: Amit Kucheria <amit.kucheria@linaro.org> Regards, Amit
"Rafael J. Wysocki" <rafael@kernel.org> writes: > On Thu, Feb 13, 2020 at 12:31 AM Francisco Jerez <currojerez@riseup.net> wrote: >> >> "Rafael J. Wysocki" <rjw@rjwysocki.net> writes: >> >> > Hi All, >> > >> > This series of patches is based on the observation that after commit >> > c3082a674f46 ("PM: QoS: Get rid of unused flags") the only global PM QoS class >> > in use is PM_QOS_CPU_DMA_LATENCY, but there is still a significant amount of >> > code dedicated to the handling of global PM QoS classes in general. That code >> > takes up space and adds overhead in vain, so it is better to get rid of it. >> > >> > Moreover, with that unuseful code removed, the interface for adding QoS >> > requests for CPU latency becomes inelegant and confusing, so it is better to >> > clean it up. >> > >> > Patches [01/28-12/28] do the first part described above, which also includes >> > some assorted cleanups of the core PM QoS code that doesn't go away. >> > >> > Patches [13/28-25/28] rework the CPU latency QoS interface (in the classic >> > "define stubs, migrate users, change the API proper" manner), patches >> > [26-27/28] update the general comments and documentation to match the code >> > after the previous changes and the last one makes the CPU latency QoS depend >> > on CPU_IDLE (because cpuidle is the only user of its target value today). >> > >> > The majority of the patches in this series don't change the functionality of >> > the code at all (at least not intentionally). >> > >> > Please refer to the changelogs of individual patches for details. >> > >> > Thanks! >> >> Hi Rafael, >> >> I believe some of the interfaces removed here could be useful in the >> near future. > > I disagree. > >> It goes back to the energy efficiency- (and IGP graphics >> performance-)improving series I submitted a while ago [1]. It relies on >> some mechanism for the graphics driver to report an I/O bottleneck to >> CPUFREQ, allowing it to make a more conservative trade-off between >> energy efficiency and latency, which can greatly reduce the CPU package >> energy usage of IO-bound applications (in some graphics benchmarks I've >> seen it reduced by over 40% on my ICL laptop), and therefore also allows >> TDP-bound applications to obtain a reciprocal improvement in throughput. >> >> I'm not particularly fond of the global PM QoS interfaces TBH, it seems >> like an excessively blunt hammer to me, so I can very much relate to the >> purpose of this series. However the finer-grained solution I've >> implemented has seen some push-back from i915 and CPUFREQ devs due to >> its complexity, since it relies on task scheduler changes in order to >> track IO bottlenecks per-process (roughly as suggested by Peter Zijlstra >> during our previous discussions), pretty much in the spirit of PELT but >> applied to IO utilization. >> >> With that in mind I was hoping we could take advantage of PM QoS as a >> temporary solution [2], by introducing a global PM QoS class similar but >> with roughly converse semantics to PM_QOS_CPU_DMA_LATENCY, allowing >> device drivers to report a *lower* bound on CPU latency beyond which PM >> shall not bother to reduce latency if doing so would have negative >> consequences on the energy efficiency and/or parallelism of the system. > > So I really don't quite see how that could be responded to, by cpuidle > say. What exactly do you mean by "reducing latency" in particular? > cpuidle wouldn't necessarily have to do anything about it since it would be intended merely as a hint that a device in the system other than the CPU has a bottleneck. It could provide a lower bound for the wake-up latency of the idle states that may be considered by cpuidle. It seems to me like it could be useful when a program can tell from the characteristics of the workload that a latency reduction below a certain time bound wouldn't materially affect the performance of the system (e.g. if you have 20 ms to render a GPU-bound frame, you may not care at all about the CPU taking a fraction of a millisecond more to wake up a few times each frame). For cpufreq I was planning to have it influence a time parameter of the utilization averaging done by the governor, which would allow it to have a more optimal response in the long term (in the sense of lowering the energy cost of performing the same work in the specified timeframe), even if such a large time parameter wouldn't normally be considered appropriate for utilization averaging due to latency concerns. >> Of course one would expect the current PM_QOS_CPU_DMA_LATENCY upper >> bound to take precedence over the new lower bound in cases where the >> former is in conflict with the latter. > > So that needs to be done on top of this series. > >> I can think of several alternatives to that which don't involve >> temporarily holding off your clean-up, > > The cleanup goes in. Please work on top of it. > Hopefully we can come up with an alternative in that case. TBH I'd love to see your clean-up go in too, but global PM QoS seemed fairly appealing as a way to split up my work so it could be reviewed incrementally, even though I'm aiming for a finer-grained solution than that. >> but none of them sound particularly exciting: >> >> 1/ Use an interface specific to CPUFREQ, pretty much like the one >> introduced in my original submission [1]. > > It uses frequency QoS already today, do you really need something else? > Yes. I don't see how frequency QoS could be useful for this as-is, unless we're willing to introduce code in every device driver that takes advantage of this and have them monitor the utilization of every CPU in the system, so they can calculate an appropriate max frequency constraint -- One which we can be reasonably certain won't hurt the long-term performance of the CPU cores these constraints are being placed on. >> 2/ Use per-CPU PM QoS, which AFAICT would require the graphics driver >> to either place a request on every CPU of the system (which would >> cause a frequent operation to have O(N) complexity on the number of >> CPUs on the system), or play a cat-and-mouse game with the task >> scheduler. > > That's in place already too in the form of device PM QoS; see > drivers/base/power/qos.c. But wouldn't that have the drawbacks I was talking about above when trying to use it in order to set this kind of constraints on CPU power management? > >> 3/ Add a new global PM QoS mechanism roughly duplicating the >> cpu_latency_qos_* interfaces introduced in this series. Drop your >> change making this available to CPU IDLE only. > > It sounds like you really want performance for energy efficiency and > CPU latency has a little to do with that. > The mechanism I've been working on isn't intended to sacrifice long-term performance of the CPU (e.g. if a CPU core is 100% utilized in the steady state by the same or an unrelated application the CPUFREQ governor should still request the maximum turbo frequency for it), it's only meant to affect the trade-off between energy efficiency and latency (e.g. the time it takes for the CPUFREQ governor to respond to an oscillation of the workload that chooses to opt in). >> 3/ Go straight to a scheduling-based approach, which is likely to >> greatly increase the review effort required to upstream this >> feature. (Peter might disagree though?) > > Are you familiar with the utilization clamps mechanism? > Sure, that would be a possibility as alternative to PM QoS, but it would most likely involve task scheduler changes to get it working effectively, which Srinivas and Rodrigo have asked me to leave out from my next RFC submission in the interest of reviewability. I wouldn't mind plumbing comparable information through utilization clamps instead or as follow-up if you think that's the way forward. > Thanks!
"Rafael J. Wysocki" <rafael@kernel.org> writes: > On Thu, Feb 13, 2020 at 1:16 AM Rafael J. Wysocki <rafael@kernel.org> wrote: >> >> On Thu, Feb 13, 2020 at 12:31 AM Francisco Jerez <currojerez@riseup.net> wrote: >> > >> > "Rafael J. Wysocki" <rjw@rjwysocki.net> writes: >> > >> > > Hi All, >> > > >> > > This series of patches is based on the observation that after commit >> > > c3082a674f46 ("PM: QoS: Get rid of unused flags") the only global PM QoS class >> > > in use is PM_QOS_CPU_DMA_LATENCY, but there is still a significant amount of >> > > code dedicated to the handling of global PM QoS classes in general. That code >> > > takes up space and adds overhead in vain, so it is better to get rid of it. >> > > >> > > Moreover, with that unuseful code removed, the interface for adding QoS >> > > requests for CPU latency becomes inelegant and confusing, so it is better to >> > > clean it up. >> > > >> > > Patches [01/28-12/28] do the first part described above, which also includes >> > > some assorted cleanups of the core PM QoS code that doesn't go away. >> > > >> > > Patches [13/28-25/28] rework the CPU latency QoS interface (in the classic >> > > "define stubs, migrate users, change the API proper" manner), patches >> > > [26-27/28] update the general comments and documentation to match the code >> > > after the previous changes and the last one makes the CPU latency QoS depend >> > > on CPU_IDLE (because cpuidle is the only user of its target value today). >> > > >> > > The majority of the patches in this series don't change the functionality of >> > > the code at all (at least not intentionally). >> > > >> > > Please refer to the changelogs of individual patches for details. >> > > >> > > Thanks! >> > >> > Hi Rafael, >> > >> > I believe some of the interfaces removed here could be useful in the >> > near future. >> >> I disagree. >> >> > It goes back to the energy efficiency- (and IGP graphics >> > performance-)improving series I submitted a while ago [1]. It relies on >> > some mechanism for the graphics driver to report an I/O bottleneck to >> > CPUFREQ, allowing it to make a more conservative trade-off between >> > energy efficiency and latency, which can greatly reduce the CPU package >> > energy usage of IO-bound applications (in some graphics benchmarks I've >> > seen it reduced by over 40% on my ICL laptop), and therefore also allows >> > TDP-bound applications to obtain a reciprocal improvement in throughput. >> > >> > I'm not particularly fond of the global PM QoS interfaces TBH, it seems >> > like an excessively blunt hammer to me, so I can very much relate to the >> > purpose of this series. However the finer-grained solution I've >> > implemented has seen some push-back from i915 and CPUFREQ devs due to >> > its complexity, since it relies on task scheduler changes in order to >> > track IO bottlenecks per-process (roughly as suggested by Peter Zijlstra >> > during our previous discussions), pretty much in the spirit of PELT but >> > applied to IO utilization. >> > >> > With that in mind I was hoping we could take advantage of PM QoS as a >> > temporary solution [2], by introducing a global PM QoS class similar but >> > with roughly converse semantics to PM_QOS_CPU_DMA_LATENCY, allowing >> > device drivers to report a *lower* bound on CPU latency beyond which PM >> > shall not bother to reduce latency if doing so would have negative >> > consequences on the energy efficiency and/or parallelism of the system. >> >> So I really don't quite see how that could be responded to, by cpuidle >> say. What exactly do you mean by "reducing latency" in particular? >> >> > Of course one would expect the current PM_QOS_CPU_DMA_LATENCY upper >> > bound to take precedence over the new lower bound in cases where the >> > former is in conflict with the latter. >> >> So that needs to be done on top of this series. >> >> > I can think of several alternatives to that which don't involve >> > temporarily holding off your clean-up, >> >> The cleanup goes in. Please work on top of it. >> >> > but none of them sound particularly exciting: >> > >> > 1/ Use an interface specific to CPUFREQ, pretty much like the one >> > introduced in my original submission [1]. >> >> It uses frequency QoS already today, do you really need something else? >> >> > 2/ Use per-CPU PM QoS, which AFAICT would require the graphics driver >> > to either place a request on every CPU of the system (which would >> > cause a frequent operation to have O(N) complexity on the number of >> > CPUs on the system), or play a cat-and-mouse game with the task >> > scheduler. >> >> That's in place already too in the form of device PM QoS; see >> drivers/base/power/qos.c. >> >> > 3/ Add a new global PM QoS mechanism roughly duplicating the >> > cpu_latency_qos_* interfaces introduced in this series. Drop your >> > change making this available to CPU IDLE only. >> >> It sounds like you really want performance for energy efficiency and >> CPU latency has a little to do with that. >> >> > 3/ Go straight to a scheduling-based approach, which is likely to >> > greatly increase the review effort required to upstream this >> > feature. (Peter might disagree though?) >> >> Are you familiar with the utilization clamps mechanism? > > And BTW, posting patches as RFC is fine even if they have not been > tested. At least you let people know that you work on something this > way, so if they work on changes in the same area, they may take that > into consideration. > Sure, that was going to be the first RFC. > Also if there are objections to your proposal, you may save quite a > bit of time by sending it early. > > It is unfortunate that this series has clashed with the changes that > you were about to propose, but in this particular case in my view it > is better to clean up things and start over. > Luckily it doesn't clash with the second RFC I was meaning to send, maybe we should just skip the first? Or maybe it's valuable as a curiosity anyway? > Thanks!
On Thu, Feb 13, 2020 at 8:10 AM Amit Kucheria <amit.kucheria@linaro.org> wrote: > > On Wed, Feb 12, 2020 at 5:09 AM Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > > > > Hi All, > > > > This series of patches is based on the observation that after commit > > c3082a674f46 ("PM: QoS: Get rid of unused flags") the only global PM QoS class > > in use is PM_QOS_CPU_DMA_LATENCY, but there is still a significant amount of > > code dedicated to the handling of global PM QoS classes in general. That code > > takes up space and adds overhead in vain, so it is better to get rid of it. > > > > Moreover, with that unuseful code removed, the interface for adding QoS > > requests for CPU latency becomes inelegant and confusing, so it is better to > > clean it up. > > > > Patches [01/28-12/28] do the first part described above, which also includes > > some assorted cleanups of the core PM QoS code that doesn't go away. > > > > Patches [13/28-25/28] rework the CPU latency QoS interface (in the classic > > "define stubs, migrate users, change the API proper" manner), patches > > [26-27/28] update the general comments and documentation to match the code > > after the previous changes and the last one makes the CPU latency QoS depend > > on CPU_IDLE (because cpuidle is the only user of its target value today). > > > > The majority of the patches in this series don't change the functionality of > > the code at all (at least not intentionally). > > > > Please refer to the changelogs of individual patches for details. > > Hi Rafael, > > Nice cleanup to the code and docs. > > I've reviewed the series, and briefly tested it by setting latencies > from userspace. Can we not remove the debugfs interface? It is a quick > way to check the global cpu latency clamp on the system from userspace > without setting up tracepoints or writing a program to read > /dev/cpu_dma_latency. Come on. What about in Python? #!/usr/bin/env python import numpy as np if __name__ == '__main__': f = open("/dev/cpu_dma_latency", "r") print(np.fromfile(f, dtype=np.int32, count=1)) f.close() And probably you can do it in at least 20 different ways. :-) Also note that "echo the_debugfs_thing" does the equivalent, but the conversion takes place in the kernel. Is it really a good idea to carry the whole debugfs interface because of that one conversion? > Except for patch 01/28 removing the debugfs interface, please feel to add my > > Reviewed-by: Amit Kucheria <amit.kucheria@linaro.org> > Tested-by: Amit Kucheria <amit.kucheria@linaro.org> Thanks!
On Thu, Feb 13, 2020 at 11:17 AM Rafael J. Wysocki <rafael@kernel.org> wrote: > > On Thu, Feb 13, 2020 at 8:10 AM Amit Kucheria <amit.kucheria@linaro.org> wrote: > > > > On Wed, Feb 12, 2020 at 5:09 AM Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > > > > > > Hi All, > > > > > > This series of patches is based on the observation that after commit > > > c3082a674f46 ("PM: QoS: Get rid of unused flags") the only global PM QoS class > > > in use is PM_QOS_CPU_DMA_LATENCY, but there is still a significant amount of > > > code dedicated to the handling of global PM QoS classes in general. That code > > > takes up space and adds overhead in vain, so it is better to get rid of it. > > > > > > Moreover, with that unuseful code removed, the interface for adding QoS > > > requests for CPU latency becomes inelegant and confusing, so it is better to > > > clean it up. > > > > > > Patches [01/28-12/28] do the first part described above, which also includes > > > some assorted cleanups of the core PM QoS code that doesn't go away. > > > > > > Patches [13/28-25/28] rework the CPU latency QoS interface (in the classic > > > "define stubs, migrate users, change the API proper" manner), patches > > > [26-27/28] update the general comments and documentation to match the code > > > after the previous changes and the last one makes the CPU latency QoS depend > > > on CPU_IDLE (because cpuidle is the only user of its target value today). > > > > > > The majority of the patches in this series don't change the functionality of > > > the code at all (at least not intentionally). > > > > > > Please refer to the changelogs of individual patches for details. > > > > Hi Rafael, > > > > Nice cleanup to the code and docs. > > > > I've reviewed the series, and briefly tested it by setting latencies > > from userspace. Can we not remove the debugfs interface? It is a quick > > way to check the global cpu latency clamp on the system from userspace > > without setting up tracepoints or writing a program to read > > /dev/cpu_dma_latency. > > Come on. > > What about in Python? > > #!/usr/bin/env python > import numpy as np > > if __name__ == '__main__': > f = open("/dev/cpu_dma_latency", "r") > print(np.fromfile(f, dtype=np.int32, count=1)) > f.close() > > And probably you can do it in at least 20 different ways. :-) > > Also note that "echo the_debugfs_thing" does the equivalent, but the I obviously meant "cat the_debugfs_thing" here, sorry.
On Thu, Feb 13, 2020 at 3:47 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > > On Thu, Feb 13, 2020 at 8:10 AM Amit Kucheria <amit.kucheria@linaro.org> wrote: > > > > On Wed, Feb 12, 2020 at 5:09 AM Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > > > > > > Hi All, > > > > > > This series of patches is based on the observation that after commit > > > c3082a674f46 ("PM: QoS: Get rid of unused flags") the only global PM QoS class > > > in use is PM_QOS_CPU_DMA_LATENCY, but there is still a significant amount of > > > code dedicated to the handling of global PM QoS classes in general. That code > > > takes up space and adds overhead in vain, so it is better to get rid of it. > > > > > > Moreover, with that unuseful code removed, the interface for adding QoS > > > requests for CPU latency becomes inelegant and confusing, so it is better to > > > clean it up. > > > > > > Patches [01/28-12/28] do the first part described above, which also includes > > > some assorted cleanups of the core PM QoS code that doesn't go away. > > > > > > Patches [13/28-25/28] rework the CPU latency QoS interface (in the classic > > > "define stubs, migrate users, change the API proper" manner), patches > > > [26-27/28] update the general comments and documentation to match the code > > > after the previous changes and the last one makes the CPU latency QoS depend > > > on CPU_IDLE (because cpuidle is the only user of its target value today). > > > > > > The majority of the patches in this series don't change the functionality of > > > the code at all (at least not intentionally). > > > > > > Please refer to the changelogs of individual patches for details. > > > > Hi Rafael, > > > > Nice cleanup to the code and docs. > > > > I've reviewed the series, and briefly tested it by setting latencies > > from userspace. Can we not remove the debugfs interface? It is a quick > > way to check the global cpu latency clamp on the system from userspace > > without setting up tracepoints or writing a program to read > > /dev/cpu_dma_latency. > > Come on. > > What about in Python? > > #!/usr/bin/env python > import numpy as np > > if __name__ == '__main__': > f = open("/dev/cpu_dma_latency", "r") > print(np.fromfile(f, dtype=np.int32, count=1)) > f.close() > > And probably you can do it in at least 20 different ways. :-) Indeed, I can, just not as straightforward as "cat /debugfs/filename" when you don't have python or perl in your buildroot initramfs. Some hexdump/od acrobatics will yield the value, I guess. > Also note that "echo the_debugfs_thing" does the equivalent, but the > conversion takes place in the kernel. Is it really a good idea to > carry the whole debugfs interface because of that one conversion? > > > Except for patch 01/28 removing the debugfs interface, please feel to add my > > > > Reviewed-by: Amit Kucheria <amit.kucheria@linaro.org> > > Tested-by: Amit Kucheria <amit.kucheria@linaro.org> > > Thanks!
On Thu, Feb 13, 2020 at 9:07 AM Francisco Jerez <currojerez@riseup.net> wrote: > > "Rafael J. Wysocki" <rafael@kernel.org> writes: > > > On Thu, Feb 13, 2020 at 12:31 AM Francisco Jerez <currojerez@riseup.net> wrote: > >> > >> "Rafael J. Wysocki" <rjw@rjwysocki.net> writes: > >> > >> > Hi All, > >> > > >> > This series of patches is based on the observation that after commit > >> > c3082a674f46 ("PM: QoS: Get rid of unused flags") the only global PM QoS class > >> > in use is PM_QOS_CPU_DMA_LATENCY, but there is still a significant amount of > >> > code dedicated to the handling of global PM QoS classes in general. That code > >> > takes up space and adds overhead in vain, so it is better to get rid of it. > >> > > >> > Moreover, with that unuseful code removed, the interface for adding QoS > >> > requests for CPU latency becomes inelegant and confusing, so it is better to > >> > clean it up. > >> > > >> > Patches [01/28-12/28] do the first part described above, which also includes > >> > some assorted cleanups of the core PM QoS code that doesn't go away. > >> > > >> > Patches [13/28-25/28] rework the CPU latency QoS interface (in the classic > >> > "define stubs, migrate users, change the API proper" manner), patches > >> > [26-27/28] update the general comments and documentation to match the code > >> > after the previous changes and the last one makes the CPU latency QoS depend > >> > on CPU_IDLE (because cpuidle is the only user of its target value today). > >> > > >> > The majority of the patches in this series don't change the functionality of > >> > the code at all (at least not intentionally). > >> > > >> > Please refer to the changelogs of individual patches for details. > >> > > >> > Thanks! > >> > >> Hi Rafael, > >> > >> I believe some of the interfaces removed here could be useful in the > >> near future. > > > > I disagree. > > > >> It goes back to the energy efficiency- (and IGP graphics > >> performance-)improving series I submitted a while ago [1]. It relies on > >> some mechanism for the graphics driver to report an I/O bottleneck to > >> CPUFREQ, allowing it to make a more conservative trade-off between > >> energy efficiency and latency, which can greatly reduce the CPU package > >> energy usage of IO-bound applications (in some graphics benchmarks I've > >> seen it reduced by over 40% on my ICL laptop), and therefore also allows > >> TDP-bound applications to obtain a reciprocal improvement in throughput. > >> > >> I'm not particularly fond of the global PM QoS interfaces TBH, it seems > >> like an excessively blunt hammer to me, so I can very much relate to the > >> purpose of this series. However the finer-grained solution I've > >> implemented has seen some push-back from i915 and CPUFREQ devs due to > >> its complexity, since it relies on task scheduler changes in order to > >> track IO bottlenecks per-process (roughly as suggested by Peter Zijlstra > >> during our previous discussions), pretty much in the spirit of PELT but > >> applied to IO utilization. > >> > >> With that in mind I was hoping we could take advantage of PM QoS as a > >> temporary solution [2], by introducing a global PM QoS class similar but > >> with roughly converse semantics to PM_QOS_CPU_DMA_LATENCY, allowing > >> device drivers to report a *lower* bound on CPU latency beyond which PM > >> shall not bother to reduce latency if doing so would have negative > >> consequences on the energy efficiency and/or parallelism of the system. > > > > So I really don't quite see how that could be responded to, by cpuidle > > say. What exactly do you mean by "reducing latency" in particular? > > > > cpuidle wouldn't necessarily have to do anything about it since it would > be intended merely as a hint that a device in the system other than the > CPU has a bottleneck. It could provide a lower bound for the wake-up > latency of the idle states that may be considered by cpuidle. It seems > to me like it could be useful when a program can tell from the > characteristics of the workload that a latency reduction below a certain > time bound wouldn't materially affect the performance of the system > (e.g. if you have 20 ms to render a GPU-bound frame, you may not care at > all about the CPU taking a fraction of a millisecond more to wake up a > few times each frame). Well, this is not how cpuidle works. What it does is to try to find the deepest idle state that makes sense to let the CPU go into given all of the constraints etc. IOW it never tries to reduce the latency, it looks how far it can go with possible energy savings given a specific latency limit (or no limit at all). > For cpufreq I was planning to have it influence a time parameter of the > utilization averaging done by the governor, which would allow it to have > a more optimal response in the long term (in the sense of lowering the > energy cost of performing the same work in the specified timeframe), > even if such a large time parameter wouldn't normally be considered > appropriate for utilization averaging due to latency concerns. So this is fine in the schedutil case in principle, it but would not work with HWP, because that doesn't take the scheduler's utilization metrics into account. To cover the HWP case you need to influence the min and max frequency limits, realistically. > >> Of course one would expect the current PM_QOS_CPU_DMA_LATENCY upper > >> bound to take precedence over the new lower bound in cases where the > >> former is in conflict with the latter. > > > > So that needs to be done on top of this series. > > > >> I can think of several alternatives to that which don't involve > >> temporarily holding off your clean-up, > > > > The cleanup goes in. Please work on top of it. > > > > Hopefully we can come up with an alternative in that case. TBH I'd love > to see your clean-up go in too, but global PM QoS seemed fairly > appealing as a way to split up my work so it could be reviewed > incrementally, even though I'm aiming for a finer-grained solution than > that. Well, so "global PM QoS" really means a struct struct pm_qos_constraints object with a global reader of its target_value. Of course, pm_qos_update_target() is not particularly convenient to use, so you'd need to wrap it into an _add/update/remove_request() family of functions along the lines of the cpu_latency_qos_*() ones I suppose and you won't need the _apply() thing. > >> but none of them sound particularly exciting: > >> > >> 1/ Use an interface specific to CPUFREQ, pretty much like the one > >> introduced in my original submission [1]. > > > > It uses frequency QoS already today, do you really need something else? > > > > Yes. I don't see how frequency QoS could be useful for this as-is, > unless we're willing to introduce code in every device driver that takes > advantage of this and have them monitor the utilization of every CPU in > the system, so they can calculate an appropriate max frequency > constraint -- One which we can be reasonably certain won't hurt the > long-term performance of the CPU cores these constraints are being > placed on. I'm not really sure if I understand you correctly. The frequency QoS in cpufreq is a way to influence the min and max freq limits used by it for each CPU. That is done in a couple of places like store_max/min_perf_pct() in intel_pstate or processor_set_cur_state() (I guess the latter would be close to what you think about, but the other way around - you seem to want to influence the min and not the max). Now, the question what request value(s) to put in there and how to compute them is kind of a different one. > >> 2/ Use per-CPU PM QoS, which AFAICT would require the graphics driver > >> to either place a request on every CPU of the system (which would > >> cause a frequent operation to have O(N) complexity on the number of > >> CPUs on the system), or play a cat-and-mouse game with the task > >> scheduler. > > > > That's in place already too in the form of device PM QoS; see > > drivers/base/power/qos.c. > > But wouldn't that have the drawbacks I was talking about above when > trying to use it in order to set this kind of constraints on CPU power > management? I guess so, but the alternatives have drawbacks too. > > > >> 3/ Add a new global PM QoS mechanism roughly duplicating the > >> cpu_latency_qos_* interfaces introduced in this series. Drop your > >> change making this available to CPU IDLE only. > > > > It sounds like you really want performance for energy efficiency and > > CPU latency has a little to do with that. > > > > The mechanism I've been working on isn't intended to sacrifice long-term > performance of the CPU (e.g. if a CPU core is 100% utilized in the > steady state by the same or an unrelated application the CPUFREQ > governor should still request the maximum turbo frequency for it), it's > only meant to affect the trade-off between energy efficiency and latency > (e.g. the time it takes for the CPUFREQ governor to respond to an > oscillation of the workload that chooses to opt in). So the meaning of "latency" here is really different from the meaning of "latency" in the cpuidle context and in RT. I guess it would be better to call it "response time" in this case to avoid confusion. Let me ask if I understand you correctly: the problem is that for some workloads the time it takes to ramp up the frequency to an acceptable (or desirable, more generally) level is too high, so the approach under consideration is to clamp the min frequency, either effectively or directly, so as to reduce that time? > >> 3/ Go straight to a scheduling-based approach, which is likely to > >> greatly increase the review effort required to upstream this > >> feature. (Peter might disagree though?) > > > > Are you familiar with the utilization clamps mechanism? > > > > Sure, that would be a possibility as alternative to PM QoS, but it would > most likely involve task scheduler changes to get it working > effectively, which Srinivas and Rodrigo have asked me to leave out from > my next RFC submission in the interest of reviewability. I wouldn't > mind plumbing comparable information through utilization clamps instead > or as follow-up if you think that's the way forward. Well, like I said somewhere above (or previously), the main problem with utilization clamps is that they have no effect on HWP at the moment. Currently, there is a little connection between the scheduler and the HWP algorithm running on the processor. However, I would like to investigate that path, because the utilization clamps provide a good interface for applications to request a certain level of service from the scheduler (they can really be regarded as a QoS mechanism too) and connecting them to the HWP min and max limits somehow might work. Thanks!
On Thu, Feb 13, 2020 at 11:50 AM Amit Kucheria <amit.kucheria@linaro.org> wrote: > > On Thu, Feb 13, 2020 at 3:47 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > > > > On Thu, Feb 13, 2020 at 8:10 AM Amit Kucheria <amit.kucheria@linaro.org> wrote: > > > > > > On Wed, Feb 12, 2020 at 5:09 AM Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > > > > > > > > Hi All, > > > > > > > > This series of patches is based on the observation that after commit > > > > c3082a674f46 ("PM: QoS: Get rid of unused flags") the only global PM QoS class > > > > in use is PM_QOS_CPU_DMA_LATENCY, but there is still a significant amount of > > > > code dedicated to the handling of global PM QoS classes in general. That code > > > > takes up space and adds overhead in vain, so it is better to get rid of it. > > > > > > > > Moreover, with that unuseful code removed, the interface for adding QoS > > > > requests for CPU latency becomes inelegant and confusing, so it is better to > > > > clean it up. > > > > > > > > Patches [01/28-12/28] do the first part described above, which also includes > > > > some assorted cleanups of the core PM QoS code that doesn't go away. > > > > > > > > Patches [13/28-25/28] rework the CPU latency QoS interface (in the classic > > > > "define stubs, migrate users, change the API proper" manner), patches > > > > [26-27/28] update the general comments and documentation to match the code > > > > after the previous changes and the last one makes the CPU latency QoS depend > > > > on CPU_IDLE (because cpuidle is the only user of its target value today). > > > > > > > > The majority of the patches in this series don't change the functionality of > > > > the code at all (at least not intentionally). > > > > > > > > Please refer to the changelogs of individual patches for details. > > > > > > Hi Rafael, > > > > > > Nice cleanup to the code and docs. > > > > > > I've reviewed the series, and briefly tested it by setting latencies > > > from userspace. Can we not remove the debugfs interface? It is a quick > > > way to check the global cpu latency clamp on the system from userspace > > > without setting up tracepoints or writing a program to read > > > /dev/cpu_dma_latency. > > > > Come on. > > > > What about in Python? > > > > #!/usr/bin/env python > > import numpy as np > > > > if __name__ == '__main__': > > f = open("/dev/cpu_dma_latency", "r") > > print(np.fromfile(f, dtype=np.int32, count=1)) > > f.close() > > > > And probably you can do it in at least 20 different ways. :-) > > Indeed, I can, just not as straightforward as "cat /debugfs/filename" > when you don't have python or perl in your buildroot initramfs. > > Some hexdump/od acrobatics will yield the value, I guess. Right, # hexdump --format '"%d\n"' /dev/cpu_dma_latency works just fine, actually. Thanks!
On Thu, Feb 13, 2020 at 9:09 AM Francisco Jerez <currojerez@riseup.net> wrote: > > "Rafael J. Wysocki" <rafael@kernel.org> writes: > > > On Thu, Feb 13, 2020 at 1:16 AM Rafael J. Wysocki <rafael@kernel.org> wrote: > >> > >> On Thu, Feb 13, 2020 at 12:31 AM Francisco Jerez <currojerez@riseup.net> wrote: > >> > [cut] > > > > And BTW, posting patches as RFC is fine even if they have not been > > tested. At least you let people know that you work on something this > > way, so if they work on changes in the same area, they may take that > > into consideration. > > > > Sure, that was going to be the first RFC. > > > Also if there are objections to your proposal, you may save quite a > > bit of time by sending it early. > > > > It is unfortunate that this series has clashed with the changes that > > you were about to propose, but in this particular case in my view it > > is better to clean up things and start over. > > > > Luckily it doesn't clash with the second RFC I was meaning to send, > maybe we should just skip the first? Yes, please. > Or maybe it's valuable as a curiosity anyway? No, let's just focus on the latest one. Thanks!
On Thu, Feb 13, 2020 at 12:34 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > > On Thu, Feb 13, 2020 at 9:07 AM Francisco Jerez <currojerez@riseup.net> wrote: > > > > "Rafael J. Wysocki" <rafael@kernel.org> writes: > > > > > On Thu, Feb 13, 2020 at 12:31 AM Francisco Jerez <currojerez@riseup.net> wrote: > > >> > > >> "Rafael J. Wysocki" <rjw@rjwysocki.net> writes: > > >> > > >> > Hi All, > > >> > > > >> > This series of patches is based on the observation that after commit > > >> > c3082a674f46 ("PM: QoS: Get rid of unused flags") the only global PM QoS class > > >> > in use is PM_QOS_CPU_DMA_LATENCY, but there is still a significant amount of > > >> > code dedicated to the handling of global PM QoS classes in general. That code > > >> > takes up space and adds overhead in vain, so it is better to get rid of it. > > >> > > > >> > Moreover, with that unuseful code removed, the interface for adding QoS > > >> > requests for CPU latency becomes inelegant and confusing, so it is better to > > >> > clean it up. > > >> > > > >> > Patches [01/28-12/28] do the first part described above, which also includes > > >> > some assorted cleanups of the core PM QoS code that doesn't go away. > > >> > > > >> > Patches [13/28-25/28] rework the CPU latency QoS interface (in the classic > > >> > "define stubs, migrate users, change the API proper" manner), patches > > >> > [26-27/28] update the general comments and documentation to match the code > > >> > after the previous changes and the last one makes the CPU latency QoS depend > > >> > on CPU_IDLE (because cpuidle is the only user of its target value today). > > >> > > > >> > The majority of the patches in this series don't change the functionality of > > >> > the code at all (at least not intentionally). > > >> > > > >> > Please refer to the changelogs of individual patches for details. > > >> > > > >> > Thanks! > > >> > > >> Hi Rafael, > > >> > > >> I believe some of the interfaces removed here could be useful in the > > >> near future. > > > > > > I disagree. > > > > > >> It goes back to the energy efficiency- (and IGP graphics > > >> performance-)improving series I submitted a while ago [1]. It relies on > > >> some mechanism for the graphics driver to report an I/O bottleneck to > > >> CPUFREQ, allowing it to make a more conservative trade-off between > > >> energy efficiency and latency, which can greatly reduce the CPU package > > >> energy usage of IO-bound applications (in some graphics benchmarks I've > > >> seen it reduced by over 40% on my ICL laptop), and therefore also allows > > >> TDP-bound applications to obtain a reciprocal improvement in throughput. > > >> > > >> I'm not particularly fond of the global PM QoS interfaces TBH, it seems > > >> like an excessively blunt hammer to me, so I can very much relate to the > > >> purpose of this series. However the finer-grained solution I've > > >> implemented has seen some push-back from i915 and CPUFREQ devs due to > > >> its complexity, since it relies on task scheduler changes in order to > > >> track IO bottlenecks per-process (roughly as suggested by Peter Zijlstra > > >> during our previous discussions), pretty much in the spirit of PELT but > > >> applied to IO utilization. > > >> > > >> With that in mind I was hoping we could take advantage of PM QoS as a > > >> temporary solution [2], by introducing a global PM QoS class similar but > > >> with roughly converse semantics to PM_QOS_CPU_DMA_LATENCY, allowing > > >> device drivers to report a *lower* bound on CPU latency beyond which PM > > >> shall not bother to reduce latency if doing so would have negative > > >> consequences on the energy efficiency and/or parallelism of the system. > > > > > > So I really don't quite see how that could be responded to, by cpuidle > > > say. What exactly do you mean by "reducing latency" in particular? > > > > > > > cpuidle wouldn't necessarily have to do anything about it since it would > > be intended merely as a hint that a device in the system other than the > > CPU has a bottleneck. It could provide a lower bound for the wake-up > > latency of the idle states that may be considered by cpuidle. It seems > > to me like it could be useful when a program can tell from the > > characteristics of the workload that a latency reduction below a certain > > time bound wouldn't materially affect the performance of the system > > (e.g. if you have 20 ms to render a GPU-bound frame, you may not care at > > all about the CPU taking a fraction of a millisecond more to wake up a > > few times each frame). > > Well, this is not how cpuidle works. > > What it does is to try to find the deepest idle state that makes sense > to let the CPU go into given all of the constraints etc. IOW it never > tries to reduce the latency, it looks how far it can go with possible > energy savings given a specific latency limit (or no limit at all). > > > For cpufreq I was planning to have it influence a time parameter of the > > utilization averaging done by the governor, which would allow it to have > > a more optimal response in the long term (in the sense of lowering the > > energy cost of performing the same work in the specified timeframe), > > even if such a large time parameter wouldn't normally be considered > > appropriate for utilization averaging due to latency concerns. > > So this is fine in the schedutil case in principle, it but would not > work with HWP, because that doesn't take the scheduler's utilization > metrics into account. > > To cover the HWP case you need to influence the min and max frequency > limits, realistically. > > > >> Of course one would expect the current PM_QOS_CPU_DMA_LATENCY upper > > >> bound to take precedence over the new lower bound in cases where the > > >> former is in conflict with the latter. > > > > > > So that needs to be done on top of this series. > > > > > >> I can think of several alternatives to that which don't involve > > >> temporarily holding off your clean-up, > > > > > > The cleanup goes in. Please work on top of it. > > > > > > > Hopefully we can come up with an alternative in that case. TBH I'd love > > to see your clean-up go in too, but global PM QoS seemed fairly > > appealing as a way to split up my work so it could be reviewed > > incrementally, even though I'm aiming for a finer-grained solution than > > that. > > Well, so "global PM QoS" really means a struct struct > pm_qos_constraints object with a global reader of its target_value. > > Of course, pm_qos_update_target() is not particularly convenient to > use, so you'd need to wrap it into an _add/update/remove_request() > family of functions along the lines of the cpu_latency_qos_*() ones I > suppose and you won't need the _apply() thing. > > > >> but none of them sound particularly exciting: > > >> > > >> 1/ Use an interface specific to CPUFREQ, pretty much like the one > > >> introduced in my original submission [1]. > > > > > > It uses frequency QoS already today, do you really need something else? > > > > > > > Yes. I don't see how frequency QoS could be useful for this as-is, > > unless we're willing to introduce code in every device driver that takes > > advantage of this and have them monitor the utilization of every CPU in > > the system, so they can calculate an appropriate max frequency > > constraint -- One which we can be reasonably certain won't hurt the > > long-term performance of the CPU cores these constraints are being > > placed on. > > I'm not really sure if I understand you correctly. > > The frequency QoS in cpufreq is a way to influence the min and max > freq limits used by it for each CPU. That is done in a couple of > places like store_max/min_perf_pct() in intel_pstate or > processor_set_cur_state() (I guess the latter would be close to what > you think about, but the other way around - you seem to want to > influence the min and not the max). It looks like *I* got this part the other way around. :-/ I think that your use case is almost equivalent to the thermal pressure one, so you'd want to limit the max and so that would be something similar to store_max_perf_pct() with its input side hooked up to a QoS list. But it looks like that QoS list would rather be of a "reservation" type, so a request added to it would mean something like "leave this fraction of power that appears to be available to the CPU subsystem unused, because I need it for a different purpose". And in principle there might be multiple requests in there at the same time and those "reservations" would add up. So that would be a kind of "limited sum" QoS type which wasn't even there before my changes. A user of that QoS list might then do something like ret = cpu_power_reserve_add(1, 4); meaning that it wants 25% of the "potential" CPU power to be not utilized by CPU performance scaling and that could affect the scheduler through load modifications (kind of along the thermal pressure patchset discussed some time ago) and HWP (as well as the non-HWP intel_pstate by preventing turbo frequencies from being used etc).
"Rafael J. Wysocki" <rafael@kernel.org> writes: > On Thu, Feb 13, 2020 at 9:07 AM Francisco Jerez <currojerez@riseup.net> wrote: >> >> "Rafael J. Wysocki" <rafael@kernel.org> writes: >> >> > On Thu, Feb 13, 2020 at 12:31 AM Francisco Jerez <currojerez@riseup.net> wrote: >> >> >> >> "Rafael J. Wysocki" <rjw@rjwysocki.net> writes: >> >> >> >> > Hi All, >> >> > >> >> > This series of patches is based on the observation that after commit >> >> > c3082a674f46 ("PM: QoS: Get rid of unused flags") the only global PM QoS class >> >> > in use is PM_QOS_CPU_DMA_LATENCY, but there is still a significant amount of >> >> > code dedicated to the handling of global PM QoS classes in general. That code >> >> > takes up space and adds overhead in vain, so it is better to get rid of it. >> >> > >> >> > Moreover, with that unuseful code removed, the interface for adding QoS >> >> > requests for CPU latency becomes inelegant and confusing, so it is better to >> >> > clean it up. >> >> > >> >> > Patches [01/28-12/28] do the first part described above, which also includes >> >> > some assorted cleanups of the core PM QoS code that doesn't go away. >> >> > >> >> > Patches [13/28-25/28] rework the CPU latency QoS interface (in the classic >> >> > "define stubs, migrate users, change the API proper" manner), patches >> >> > [26-27/28] update the general comments and documentation to match the code >> >> > after the previous changes and the last one makes the CPU latency QoS depend >> >> > on CPU_IDLE (because cpuidle is the only user of its target value today). >> >> > >> >> > The majority of the patches in this series don't change the functionality of >> >> > the code at all (at least not intentionally). >> >> > >> >> > Please refer to the changelogs of individual patches for details. >> >> > >> >> > Thanks! >> >> >> >> Hi Rafael, >> >> >> >> I believe some of the interfaces removed here could be useful in the >> >> near future. >> > >> > I disagree. >> > >> >> It goes back to the energy efficiency- (and IGP graphics >> >> performance-)improving series I submitted a while ago [1]. It relies on >> >> some mechanism for the graphics driver to report an I/O bottleneck to >> >> CPUFREQ, allowing it to make a more conservative trade-off between >> >> energy efficiency and latency, which can greatly reduce the CPU package >> >> energy usage of IO-bound applications (in some graphics benchmarks I've >> >> seen it reduced by over 40% on my ICL laptop), and therefore also allows >> >> TDP-bound applications to obtain a reciprocal improvement in throughput. >> >> >> >> I'm not particularly fond of the global PM QoS interfaces TBH, it seems >> >> like an excessively blunt hammer to me, so I can very much relate to the >> >> purpose of this series. However the finer-grained solution I've >> >> implemented has seen some push-back from i915 and CPUFREQ devs due to >> >> its complexity, since it relies on task scheduler changes in order to >> >> track IO bottlenecks per-process (roughly as suggested by Peter Zijlstra >> >> during our previous discussions), pretty much in the spirit of PELT but >> >> applied to IO utilization. >> >> >> >> With that in mind I was hoping we could take advantage of PM QoS as a >> >> temporary solution [2], by introducing a global PM QoS class similar but >> >> with roughly converse semantics to PM_QOS_CPU_DMA_LATENCY, allowing >> >> device drivers to report a *lower* bound on CPU latency beyond which PM >> >> shall not bother to reduce latency if doing so would have negative >> >> consequences on the energy efficiency and/or parallelism of the system. >> > >> > So I really don't quite see how that could be responded to, by cpuidle >> > say. What exactly do you mean by "reducing latency" in particular? >> > >> >> cpuidle wouldn't necessarily have to do anything about it since it would >> be intended merely as a hint that a device in the system other than the >> CPU has a bottleneck. It could provide a lower bound for the wake-up >> latency of the idle states that may be considered by cpuidle. It seems >> to me like it could be useful when a program can tell from the >> characteristics of the workload that a latency reduction below a certain >> time bound wouldn't materially affect the performance of the system >> (e.g. if you have 20 ms to render a GPU-bound frame, you may not care at >> all about the CPU taking a fraction of a millisecond more to wake up a >> few times each frame). > > Well, this is not how cpuidle works. > > What it does is to try to find the deepest idle state that makes sense > to let the CPU go into given all of the constraints etc. IOW it never > tries to reduce the latency, it looks how far it can go with possible > energy savings given a specific latency limit (or no limit at all). > I didn't mean to say that cpuidle reduces latency except in relative terms: If a sleep state is available but has too high exit latency to be used under the current load conditions, an explicit hint from the application or device driver saying "I'm okay with a wake-up+ramp-up latency of the order of X nanoseconds" might allow it to do a better job than any heuristic decision implemented in the idle governor. >> For cpufreq I was planning to have it influence a time parameter of the >> utilization averaging done by the governor, which would allow it to have >> a more optimal response in the long term (in the sense of lowering the >> energy cost of performing the same work in the specified timeframe), >> even if such a large time parameter wouldn't normally be considered >> appropriate for utilization averaging due to latency concerns. > > So this is fine in the schedutil case in principle, it but would not > work with HWP, because that doesn't take the scheduler's utilization > metrics into account. > The code I've been working on lately targets HWP platforms specifically, but I've gotten it to work on non-HWP too with some minor changes in the governor. The same kernel interfaces should work whether the CPUFREQ governor is delegating frequency selection to HWP or doing it directly. > To cover the HWP case you need to influence the min and max frequency > limits, realistically. > Indeed, the constraint I was planning to introduce eventually influences the calculation of the HWP min/max frequencies in order to make sure that the P-code ends up selecting a reasonably optimal frequency, without fully removing it out of the picture, it's simply meant to assist its decisions whenever the applications running on that CPU core have a non-CPU bottleneck known to the kernel. >> >> Of course one would expect the current PM_QOS_CPU_DMA_LATENCY upper >> >> bound to take precedence over the new lower bound in cases where the >> >> former is in conflict with the latter. >> > >> > So that needs to be done on top of this series. >> > >> >> I can think of several alternatives to that which don't involve >> >> temporarily holding off your clean-up, >> > >> > The cleanup goes in. Please work on top of it. >> > >> >> Hopefully we can come up with an alternative in that case. TBH I'd love >> to see your clean-up go in too, but global PM QoS seemed fairly >> appealing as a way to split up my work so it could be reviewed >> incrementally, even though I'm aiming for a finer-grained solution than >> that. > > Well, so "global PM QoS" really means a struct struct > pm_qos_constraints object with a global reader of its target_value. > > Of course, pm_qos_update_target() is not particularly convenient to > use, so you'd need to wrap it into an _add/update/remove_request() > family of functions along the lines of the cpu_latency_qos_*() ones I > suppose and you won't need the _apply() thing. > Yeah, sounds about right. >> >> but none of them sound particularly exciting: >> >> >> >> 1/ Use an interface specific to CPUFREQ, pretty much like the one >> >> introduced in my original submission [1]. >> > >> > It uses frequency QoS already today, do you really need something else? >> > >> >> Yes. I don't see how frequency QoS could be useful for this as-is, >> unless we're willing to introduce code in every device driver that takes >> advantage of this and have them monitor the utilization of every CPU in >> the system, so they can calculate an appropriate max frequency >> constraint -- One which we can be reasonably certain won't hurt the >> long-term performance of the CPU cores these constraints are being >> placed on. > > I'm not really sure if I understand you correctly. > > The frequency QoS in cpufreq is a way to influence the min and max > freq limits used by it for each CPU. That is done in a couple of > places like store_max/min_perf_pct() in intel_pstate or > processor_set_cur_state() (I guess the latter would be close to what > you think about, but the other way around - you seem to want to > influence the min and not the max). > I do want to influence the max frequency primarily. > Now, the question what request value(s) to put in there and how to > compute them is kind of a different one. > And the question of what frequency request to put in there is the really tricky one IMO, because it requires every user of this interface to monitor CPU performance counters in order to guess what an appropriate frequency constraint is (i.e. one which won't interfere with the work of other applications and that won't cause the bottleneck of the same application to shift from the IO device to the CPU). That's why shifting from a frequency constraint to a response latency constraint seems valuable to me: Even though the optimal CPU frequency constraint is highly variable in time (based on the instantaneous balance between CPU and IO load), the optimal latency constraint is approximately constant for any given workload as long as it continues to be IO-bound (since the greatest acceptable latency constraint might be a simple function of the monitor refresh rate, network protocol constraints, IO device latency, etc.). >> >> 2/ Use per-CPU PM QoS, which AFAICT would require the graphics driver >> >> to either place a request on every CPU of the system (which would >> >> cause a frequent operation to have O(N) complexity on the number of >> >> CPUs on the system), or play a cat-and-mouse game with the task >> >> scheduler. >> > >> > That's in place already too in the form of device PM QoS; see >> > drivers/base/power/qos.c. >> >> But wouldn't that have the drawbacks I was talking about above when >> trying to use it in order to set this kind of constraints on CPU power >> management? > > I guess so, but the alternatives have drawbacks too. > >> > >> >> 3/ Add a new global PM QoS mechanism roughly duplicating the >> >> cpu_latency_qos_* interfaces introduced in this series. Drop your >> >> change making this available to CPU IDLE only. >> > >> > It sounds like you really want performance for energy efficiency and >> > CPU latency has a little to do with that. >> > >> >> The mechanism I've been working on isn't intended to sacrifice long-term >> performance of the CPU (e.g. if a CPU core is 100% utilized in the >> steady state by the same or an unrelated application the CPUFREQ >> governor should still request the maximum turbo frequency for it), it's >> only meant to affect the trade-off between energy efficiency and latency >> (e.g. the time it takes for the CPUFREQ governor to respond to an >> oscillation of the workload that chooses to opt in). > > So the meaning of "latency" here is really different from the meaning > of "latency" in the cpuidle context and in RT. > > I guess it would be better to call it "response time" in this case to > avoid confusion. I'm fine with calling this time parameter response time instead. What is going on under the hood is indeed somewhat different from the cpuidle case, but the interpretation is closely related: Latency it takes for the CPU to reach some nominal (e.g. maximum) level of performance after wake-up in response to a step-function utilization. > > Let me ask if I understand you correctly: the problem is that for some > workloads the time it takes to ramp up the frequency to an acceptable > (or desirable, more generally) level is too high, so the approach > under consideration is to clamp the min frequency, either effectively > or directly, so as to reduce that time? > Nope, the problem is precisely the opposite: PM is responding too quickly to transient oscillations of the CPU load, even though the actual latency requirements of the workload are far less stringent, leading to energy-inefficient behavior which severely reduces the throughput of the system under TDP-bound conditions. >> >> 3/ Go straight to a scheduling-based approach, which is likely to >> >> greatly increase the review effort required to upstream this >> >> feature. (Peter might disagree though?) >> > >> > Are you familiar with the utilization clamps mechanism? >> > >> >> Sure, that would be a possibility as alternative to PM QoS, but it would >> most likely involve task scheduler changes to get it working >> effectively, which Srinivas and Rodrigo have asked me to leave out from >> my next RFC submission in the interest of reviewability. I wouldn't >> mind plumbing comparable information through utilization clamps instead >> or as follow-up if you think that's the way forward. > > Well, like I said somewhere above (or previously), the main problem > with utilization clamps is that they have no effect on HWP at the > moment. Currently, there is a little connection between the scheduler > and the HWP algorithm running on the processor. However, I would like > to investigate that path, because the utilization clamps provide a > good interface for applications to request a certain level of service > from the scheduler (they can really be regarded as a QoS mechanism > too) and connecting them to the HWP min and max limits somehow might > work. > Yeah, it would be really nice to have the utilization clamps influence the HWP P-state range. That said I think the most straightforward way to achieve this via utilization clamps would be to add a third "response latency" clamp which defaults to infinity (if the application doesn't care to set a latency requirement) and is aggregated across tasks queued to the same RQ by taking the minimum value (so the most stringent latency request is honored). It may be technically possible to implement this based on the MAX clamp alone, but it would have similar or worse drawbacks than the per-CPU frequency QoS alternative we were discussing earlier: In order to avoid hurting the performance of the application, each bottlenecking device driver would have to periodically monitor the CPU utilization of every thread of every process talking to the device, and periodically adjust their MAX utilization clamps in order to adapt to fluctuations of the balance between CPU and IO load. That's O(n*f) run-time overhead on the number of threads and utilization sampling frequency. In comparison a latency constraint would be pretty much a fire-and-forget. Or it might be possible but likely as controversial to put all processes talking to the same device under a single cgroup in order to manage them with a single clamp -- Except I don't think that would easily scale to multiple devices. > Thanks! Thanks for your feedback!
"Rafael J. Wysocki" <rafael@kernel.org> writes: > On Thu, Feb 13, 2020 at 12:34 PM Rafael J. Wysocki <rafael@kernel.org> wrote: >> >> On Thu, Feb 13, 2020 at 9:07 AM Francisco Jerez <currojerez@riseup.net> wrote: >> > >> > "Rafael J. Wysocki" <rafael@kernel.org> writes: >> > >> > > On Thu, Feb 13, 2020 at 12:31 AM Francisco Jerez <currojerez@riseup.net> wrote: >> > >> >> > >> "Rafael J. Wysocki" <rjw@rjwysocki.net> writes: >> > >> >> > >> > Hi All, >> > >> > >> > >> > This series of patches is based on the observation that after commit >> > >> > c3082a674f46 ("PM: QoS: Get rid of unused flags") the only global PM QoS class >> > >> > in use is PM_QOS_CPU_DMA_LATENCY, but there is still a significant amount of >> > >> > code dedicated to the handling of global PM QoS classes in general. That code >> > >> > takes up space and adds overhead in vain, so it is better to get rid of it. >> > >> > >> > >> > Moreover, with that unuseful code removed, the interface for adding QoS >> > >> > requests for CPU latency becomes inelegant and confusing, so it is better to >> > >> > clean it up. >> > >> > >> > >> > Patches [01/28-12/28] do the first part described above, which also includes >> > >> > some assorted cleanups of the core PM QoS code that doesn't go away. >> > >> > >> > >> > Patches [13/28-25/28] rework the CPU latency QoS interface (in the classic >> > >> > "define stubs, migrate users, change the API proper" manner), patches >> > >> > [26-27/28] update the general comments and documentation to match the code >> > >> > after the previous changes and the last one makes the CPU latency QoS depend >> > >> > on CPU_IDLE (because cpuidle is the only user of its target value today). >> > >> > >> > >> > The majority of the patches in this series don't change the functionality of >> > >> > the code at all (at least not intentionally). >> > >> > >> > >> > Please refer to the changelogs of individual patches for details. >> > >> > >> > >> > Thanks! >> > >> >> > >> Hi Rafael, >> > >> >> > >> I believe some of the interfaces removed here could be useful in the >> > >> near future. >> > > >> > > I disagree. >> > > >> > >> It goes back to the energy efficiency- (and IGP graphics >> > >> performance-)improving series I submitted a while ago [1]. It relies on >> > >> some mechanism for the graphics driver to report an I/O bottleneck to >> > >> CPUFREQ, allowing it to make a more conservative trade-off between >> > >> energy efficiency and latency, which can greatly reduce the CPU package >> > >> energy usage of IO-bound applications (in some graphics benchmarks I've >> > >> seen it reduced by over 40% on my ICL laptop), and therefore also allows >> > >> TDP-bound applications to obtain a reciprocal improvement in throughput. >> > >> >> > >> I'm not particularly fond of the global PM QoS interfaces TBH, it seems >> > >> like an excessively blunt hammer to me, so I can very much relate to the >> > >> purpose of this series. However the finer-grained solution I've >> > >> implemented has seen some push-back from i915 and CPUFREQ devs due to >> > >> its complexity, since it relies on task scheduler changes in order to >> > >> track IO bottlenecks per-process (roughly as suggested by Peter Zijlstra >> > >> during our previous discussions), pretty much in the spirit of PELT but >> > >> applied to IO utilization. >> > >> >> > >> With that in mind I was hoping we could take advantage of PM QoS as a >> > >> temporary solution [2], by introducing a global PM QoS class similar but >> > >> with roughly converse semantics to PM_QOS_CPU_DMA_LATENCY, allowing >> > >> device drivers to report a *lower* bound on CPU latency beyond which PM >> > >> shall not bother to reduce latency if doing so would have negative >> > >> consequences on the energy efficiency and/or parallelism of the system. >> > > >> > > So I really don't quite see how that could be responded to, by cpuidle >> > > say. What exactly do you mean by "reducing latency" in particular? >> > > >> > >> > cpuidle wouldn't necessarily have to do anything about it since it would >> > be intended merely as a hint that a device in the system other than the >> > CPU has a bottleneck. It could provide a lower bound for the wake-up >> > latency of the idle states that may be considered by cpuidle. It seems >> > to me like it could be useful when a program can tell from the >> > characteristics of the workload that a latency reduction below a certain >> > time bound wouldn't materially affect the performance of the system >> > (e.g. if you have 20 ms to render a GPU-bound frame, you may not care at >> > all about the CPU taking a fraction of a millisecond more to wake up a >> > few times each frame). >> >> Well, this is not how cpuidle works. >> >> What it does is to try to find the deepest idle state that makes sense >> to let the CPU go into given all of the constraints etc. IOW it never >> tries to reduce the latency, it looks how far it can go with possible >> energy savings given a specific latency limit (or no limit at all). >> >> > For cpufreq I was planning to have it influence a time parameter of the >> > utilization averaging done by the governor, which would allow it to have >> > a more optimal response in the long term (in the sense of lowering the >> > energy cost of performing the same work in the specified timeframe), >> > even if such a large time parameter wouldn't normally be considered >> > appropriate for utilization averaging due to latency concerns. >> >> So this is fine in the schedutil case in principle, it but would not >> work with HWP, because that doesn't take the scheduler's utilization >> metrics into account. >> >> To cover the HWP case you need to influence the min and max frequency >> limits, realistically. >> >> > >> Of course one would expect the current PM_QOS_CPU_DMA_LATENCY upper >> > >> bound to take precedence over the new lower bound in cases where the >> > >> former is in conflict with the latter. >> > > >> > > So that needs to be done on top of this series. >> > > >> > >> I can think of several alternatives to that which don't involve >> > >> temporarily holding off your clean-up, >> > > >> > > The cleanup goes in. Please work on top of it. >> > > >> > >> > Hopefully we can come up with an alternative in that case. TBH I'd love >> > to see your clean-up go in too, but global PM QoS seemed fairly >> > appealing as a way to split up my work so it could be reviewed >> > incrementally, even though I'm aiming for a finer-grained solution than >> > that. >> >> Well, so "global PM QoS" really means a struct struct >> pm_qos_constraints object with a global reader of its target_value. >> >> Of course, pm_qos_update_target() is not particularly convenient to >> use, so you'd need to wrap it into an _add/update/remove_request() >> family of functions along the lines of the cpu_latency_qos_*() ones I >> suppose and you won't need the _apply() thing. >> >> > >> but none of them sound particularly exciting: >> > >> >> > >> 1/ Use an interface specific to CPUFREQ, pretty much like the one >> > >> introduced in my original submission [1]. >> > > >> > > It uses frequency QoS already today, do you really need something else? >> > > >> > >> > Yes. I don't see how frequency QoS could be useful for this as-is, >> > unless we're willing to introduce code in every device driver that takes >> > advantage of this and have them monitor the utilization of every CPU in >> > the system, so they can calculate an appropriate max frequency >> > constraint -- One which we can be reasonably certain won't hurt the >> > long-term performance of the CPU cores these constraints are being >> > placed on. >> >> I'm not really sure if I understand you correctly. >> >> The frequency QoS in cpufreq is a way to influence the min and max >> freq limits used by it for each CPU. That is done in a couple of >> places like store_max/min_perf_pct() in intel_pstate or >> processor_set_cur_state() (I guess the latter would be close to what >> you think about, but the other way around - you seem to want to >> influence the min and not the max). > > It looks like *I* got this part the other way around. :-/ > > I think that your use case is almost equivalent to the thermal > pressure one, so you'd want to limit the max and so that would be > something similar to store_max_perf_pct() with its input side hooked > up to a QoS list. > > But it looks like that QoS list would rather be of a "reservation" > type, so a request added to it would mean something like "leave this > fraction of power that appears to be available to the CPU subsystem > unused, because I need it for a different purpose". And in principle > there might be multiple requests in there at the same time and those > "reservations" would add up. So that would be a kind of "limited sum" > QoS type which wasn't even there before my changes. > > A user of that QoS list might then do something like > > ret = cpu_power_reserve_add(1, 4); > > meaning that it wants 25% of the "potential" CPU power to be not > utilized by CPU performance scaling and that could affect the > scheduler through load modifications (kind of along the thermal > pressure patchset discussed some time ago) and HWP (as well as the > non-HWP intel_pstate by preventing turbo frequencies from being used > etc). The problems with this are the same as with the per-CPU frequency QoS approach: How does the device driver know what the appropriate fraction of CPU power is? Depending on the instantaneous behavior of the workload it might take 1% or 95% of the CPU power in order to keep the IO device busy. Each user of this would need to monitor the performance of every CPU in the system and update the constraints on each of them periodically (whether or not they're talking to that IO device, which would possibly negatively impact the latency of unrelated applications running on other CPUs, unless we're willing to race with the task scheduler). A solution based on utilization clamps (with some extensions) sounds more future-proof to me honestly.
On Fri, Feb 14, 2020 at 1:14 AM Francisco Jerez <currojerez@riseup.net> wrote: > > "Rafael J. Wysocki" <rafael@kernel.org> writes: > > > On Thu, Feb 13, 2020 at 12:34 PM Rafael J. Wysocki <rafael@kernel.org> wrote: [cut] > > > > I think that your use case is almost equivalent to the thermal > > pressure one, so you'd want to limit the max and so that would be > > something similar to store_max_perf_pct() with its input side hooked > > up to a QoS list. > > > > But it looks like that QoS list would rather be of a "reservation" > > type, so a request added to it would mean something like "leave this > > fraction of power that appears to be available to the CPU subsystem > > unused, because I need it for a different purpose". And in principle > > there might be multiple requests in there at the same time and those > > "reservations" would add up. So that would be a kind of "limited sum" > > QoS type which wasn't even there before my changes. > > > > A user of that QoS list might then do something like > > > > ret = cpu_power_reserve_add(1, 4); > > > > meaning that it wants 25% of the "potential" CPU power to be not > > utilized by CPU performance scaling and that could affect the > > scheduler through load modifications (kind of along the thermal > > pressure patchset discussed some time ago) and HWP (as well as the > > non-HWP intel_pstate by preventing turbo frequencies from being used > > etc). > > The problems with this are the same as with the per-CPU frequency QoS > approach: How does the device driver know what the appropriate fraction > of CPU power is? Of course it doesn't know and it may never know exactly, but it may guess. Also, it may set up a feedback loop: request an aggressive reservation, run for a while, measure something and refine if there's headroom. Then repeat. > Depending on the instantaneous behavior of the > workload it might take 1% or 95% of the CPU power in order to keep the > IO device busy. Each user of this would need to monitor the performance > of every CPU in the system and update the constraints on each of them > periodically (whether or not they're talking to that IO device, which > would possibly negatively impact the latency of unrelated applications > running on other CPUs, unless we're willing to race with the task > scheduler). No, it just needs to measure a signal representing how much power *it* gets and decide whether or not it can let the CPU subsystem use more power. > A solution based on utilization clamps (with some > extensions) sounds more future-proof to me honestly. Except that it would be rather hard to connect it to something like RAPL, which should be quite straightforward with the approach I'm talking about. The problem with all scheduler-based ways, again, is that there is no direct connection between the scheduler and HWP, or even with whatever the processor does with the P-states in the turbo range. If any P-state in the turbo range is requested, the processor has a license to use whatever P-state it wants, so this pretty much means allowing it to use as much power as it can. So in the first place, if you want to limit the use of power in the CPU subsystem through frequency control alone, you need to prevent it from using turbo P-states at all. However, with RAPL you can just limit power which may still allow some (but not all) turbo P-states to be used.
"Rafael J. Wysocki" <rafael@kernel.org> writes: > On Fri, Feb 14, 2020 at 1:14 AM Francisco Jerez <currojerez@riseup.net> wrote: >> >> "Rafael J. Wysocki" <rafael@kernel.org> writes: >> >> > On Thu, Feb 13, 2020 at 12:34 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > > [cut] > >> > >> > I think that your use case is almost equivalent to the thermal >> > pressure one, so you'd want to limit the max and so that would be >> > something similar to store_max_perf_pct() with its input side hooked >> > up to a QoS list. >> > >> > But it looks like that QoS list would rather be of a "reservation" >> > type, so a request added to it would mean something like "leave this >> > fraction of power that appears to be available to the CPU subsystem >> > unused, because I need it for a different purpose". And in principle >> > there might be multiple requests in there at the same time and those >> > "reservations" would add up. So that would be a kind of "limited sum" >> > QoS type which wasn't even there before my changes. >> > >> > A user of that QoS list might then do something like >> > >> > ret = cpu_power_reserve_add(1, 4); >> > >> > meaning that it wants 25% of the "potential" CPU power to be not >> > utilized by CPU performance scaling and that could affect the >> > scheduler through load modifications (kind of along the thermal >> > pressure patchset discussed some time ago) and HWP (as well as the >> > non-HWP intel_pstate by preventing turbo frequencies from being used >> > etc). >> >> The problems with this are the same as with the per-CPU frequency QoS >> approach: How does the device driver know what the appropriate fraction >> of CPU power is? > > Of course it doesn't know and it may never know exactly, but it may guess. > > Also, it may set up a feedback loop: request an aggressive > reservation, run for a while, measure something and refine if there's > headroom. Then repeat. > Yeah, of course, but that's obviously more computationally intensive and less accurate than computing an approximately optimal constraint in a single iteration (based on knowledge from performance counters and a notion of the latency requirements of the application), since such a feedback loop relies on repeatedly overshooting and undershooting the optimal value (the latter causes an artificial CPU bottleneck, possibly slowing down other applications too) in order to converge to and remain in a neighborhood of the optimal value. Incidentally people tested a power balancing solution with a feedback loop very similar to the one you're describing side by side to the RFC patch series I provided a link to earlier (which targeted Gen9 LP parts), and the energy efficiency improvements they observed were roughly half of the improvement obtained with my series unsurprisingly. Not to speak about generalizing such a feedback loop to bottlenecks on multiple I/O devices. >> Depending on the instantaneous behavior of the >> workload it might take 1% or 95% of the CPU power in order to keep the >> IO device busy. Each user of this would need to monitor the performance >> of every CPU in the system and update the constraints on each of them >> periodically (whether or not they're talking to that IO device, which >> would possibly negatively impact the latency of unrelated applications >> running on other CPUs, unless we're willing to race with the task >> scheduler). > > No, it just needs to measure a signal representing how much power *it* > gets and decide whether or not it can let the CPU subsystem use more > power. > Well yes it's technically possible to set frequency constraints based on trial-and-error without sampling utilization information from the CPU cores, but don't we agree that this kind of information can be highly valuable? >> A solution based on utilization clamps (with some >> extensions) sounds more future-proof to me honestly. > > Except that it would be rather hard to connect it to something like > RAPL, which should be quite straightforward with the approach I'm > talking about. > I think using RAPL as additional control variable would be useful, but fully orthogonal to the cap being set by some global mechanism or being derived from the aggregation of a number of per-process power caps based on the scheduler behavior. The latter sounds like the more reasonable fit for a multi-tasking, possibly virtualized environment honestly. Either way RAPL is neither necessary nor sufficient in order to achieve the energy efficiency improvement I'm working on. > The problem with all scheduler-based ways, again, is that there is no > direct connection between the scheduler and HWP, I was planning to introduce such a connection in RFC part 2. I have a prototype for that based on a not particularly pretty custom interface, I wouldn't mind trying to get it to use utilization clamps if you think that's the way forward. > or even with whatever the processor does with the P-states in the > turbo range. If any P-state in the turbo range is requested, the > processor has a license to use whatever P-state it wants, so this > pretty much means allowing it to use as much power as it can. > > So in the first place, if you want to limit the use of power in the > CPU subsystem through frequency control alone, you need to prevent it > from using turbo P-states at all. However, with RAPL you can just > limit power which may still allow some (but not all) turbo P-states to > be used. My goal is not to limit the use of power of the CPU (if it has enough load to utilize 100% of the cycles at turbo frequency so be it), but to get it to use it more efficiently. If you are constrained by a given power budget (e.g. the TDP or the one you want set via RAPL) you can do more with it if you set a stable frequency rather than if you let the CPU bounce back and forth between turbo and idle. This can only be achieved effectively if the frequency governor has a rough idea of the latency requirements of the workload, since it involves a latency/energy-efficiency trade-off.
"Rafael J. Wysocki" <rafael@kernel.org> writes: > On Thu, Feb 13, 2020 at 9:09 AM Francisco Jerez <currojerez@riseup.net> wrote: >> >> "Rafael J. Wysocki" <rafael@kernel.org> writes: >> >> > On Thu, Feb 13, 2020 at 1:16 AM Rafael J. Wysocki <rafael@kernel.org> wrote: >> >> >> >> On Thu, Feb 13, 2020 at 12:31 AM Francisco Jerez <currojerez@riseup.net> wrote: >> >> > > > [cut] > >> > >> > And BTW, posting patches as RFC is fine even if they have not been >> > tested. At least you let people know that you work on something this >> > way, so if they work on changes in the same area, they may take that >> > into consideration. >> > >> >> Sure, that was going to be the first RFC. >> >> > Also if there are objections to your proposal, you may save quite a >> > bit of time by sending it early. >> > >> > It is unfortunate that this series has clashed with the changes that >> > you were about to propose, but in this particular case in my view it >> > is better to clean up things and start over. >> > >> >> Luckily it doesn't clash with the second RFC I was meaning to send, >> maybe we should just skip the first? > > Yes, please. > >> Or maybe it's valuable as a curiosity anyway? > > No, let's just focus on the latest one. > > Thanks! We don't seem to have reached much of an agreement on the general direction of RFC2, so I can't really get started with it. Here is RFC1 for the record: https://github.com/curro/linux/commits/intel_pstate-lp-hwp-v10.8-alt Specifically the following patch conflicts with this series: https://github.com/curro/linux/commit/9a16f35531bbb76d38493da892ece088e31dc2e0 Series improves performance-per-watt of GfxBench gl_4 (AKA Car Chase) by over 15% on my system with the branch above, actual FPS "only" improves about 5.9% on ICL laptop due to it being very lightly TDP-bound with its rather huge TDP. The performance of almost every graphics benchmark I've tried improves significantly with it (a number of SynMark test-cases are improved by around 40% in perf-per-watt, Egypt perf-per-watt improves by about 25%). Hopefully we can come up with some alternative plan of action.
On Fri, Feb 21, 2020 at 11:10 PM Francisco Jerez <currojerez@riseup.net> wrote: > > "Rafael J. Wysocki" <rafael@kernel.org> writes: > > > On Thu, Feb 13, 2020 at 9:09 AM Francisco Jerez <currojerez@riseup.net> wrote: > >> > >> "Rafael J. Wysocki" <rafael@kernel.org> writes: > >> > >> > On Thu, Feb 13, 2020 at 1:16 AM Rafael J. Wysocki <rafael@kernel.org> wrote: > >> >> > >> >> On Thu, Feb 13, 2020 at 12:31 AM Francisco Jerez <currojerez@riseup.net> wrote: > >> >> > > > > > [cut] > > > >> > > >> > And BTW, posting patches as RFC is fine even if they have not been > >> > tested. At least you let people know that you work on something this > >> > way, so if they work on changes in the same area, they may take that > >> > into consideration. > >> > > >> > >> Sure, that was going to be the first RFC. > >> > >> > Also if there are objections to your proposal, you may save quite a > >> > bit of time by sending it early. > >> > > >> > It is unfortunate that this series has clashed with the changes that > >> > you were about to propose, but in this particular case in my view it > >> > is better to clean up things and start over. > >> > > >> > >> Luckily it doesn't clash with the second RFC I was meaning to send, > >> maybe we should just skip the first? > > > > Yes, please. > > > >> Or maybe it's valuable as a curiosity anyway? > > > > No, let's just focus on the latest one. > > > > Thanks! > > We don't seem to have reached much of an agreement on the general > direction of RFC2, so I can't really get started with it. Here is RFC1 > for the record: > > https://github.com/curro/linux/commits/intel_pstate-lp-hwp-v10.8-alt Appreciate the link, but that hasn't been posted to linux-pm yet, so there's not much to discuss. And when you post it, please rebase it on top of linux-next. > Specifically the following patch conflicts with this series: > > https://github.com/curro/linux/commit/9a16f35531bbb76d38493da892ece088e31dc2e0 > > Series improves performance-per-watt of GfxBench gl_4 (AKA Car Chase) by > over 15% on my system with the branch above, actual FPS "only" improves > about 5.9% on ICL laptop due to it being very lightly TDP-bound with its > rather huge TDP. The performance of almost every graphics benchmark > I've tried improves significantly with it (a number of SynMark > test-cases are improved by around 40% in perf-per-watt, Egypt > perf-per-watt improves by about 25%). > > Hopefully we can come up with some alternative plan of action. It is very easy to replace the patch above with an alternative one on top of linux-next that will add CPU_RESPONSE_FREQUENCY QoS along the lines of the CPU latency QoS implementation in there without the need restore to global QoS classes. IOW, you don't really need the code that goes away in linux-next to implement what you need. Thanks!
Sorry for the late response, I was offline for a major part of the previous week. On Fri, Feb 14, 2020 at 9:31 PM Francisco Jerez <currojerez@riseup.net> wrote: > > "Rafael J. Wysocki" <rafael@kernel.org> writes: > > > On Fri, Feb 14, 2020 at 1:14 AM Francisco Jerez <currojerez@riseup.net> wrote: > >> > >> "Rafael J. Wysocki" <rafael@kernel.org> writes: > >> > >> > On Thu, Feb 13, 2020 at 12:34 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > > > > [cut] > > > >> > > >> > I think that your use case is almost equivalent to the thermal > >> > pressure one, so you'd want to limit the max and so that would be > >> > something similar to store_max_perf_pct() with its input side hooked > >> > up to a QoS list. > >> > > >> > But it looks like that QoS list would rather be of a "reservation" > >> > type, so a request added to it would mean something like "leave this > >> > fraction of power that appears to be available to the CPU subsystem > >> > unused, because I need it for a different purpose". And in principle > >> > there might be multiple requests in there at the same time and those > >> > "reservations" would add up. So that would be a kind of "limited sum" > >> > QoS type which wasn't even there before my changes. > >> > > >> > A user of that QoS list might then do something like > >> > > >> > ret = cpu_power_reserve_add(1, 4); > >> > > >> > meaning that it wants 25% of the "potential" CPU power to be not > >> > utilized by CPU performance scaling and that could affect the > >> > scheduler through load modifications (kind of along the thermal > >> > pressure patchset discussed some time ago) and HWP (as well as the > >> > non-HWP intel_pstate by preventing turbo frequencies from being used > >> > etc). > >> > >> The problems with this are the same as with the per-CPU frequency QoS > >> approach: How does the device driver know what the appropriate fraction > >> of CPU power is? > > > > Of course it doesn't know and it may never know exactly, but it may guess. > > > > Also, it may set up a feedback loop: request an aggressive > > reservation, run for a while, measure something and refine if there's > > headroom. Then repeat. > > > > Yeah, of course, but that's obviously more computationally intensive and > less accurate than computing an approximately optimal constraint in a > single iteration (based on knowledge from performance counters and a > notion of the latency requirements of the application), since such a > feedback loop relies on repeatedly overshooting and undershooting the > optimal value (the latter causes an artificial CPU bottleneck, possibly > slowing down other applications too) in order to converge to and remain > in a neighborhood of the optimal value. I'm not saying that feedback loops are the way to go in general, but that in some cases they are applicable and this particular case looks like it may be one of them. > Incidentally people tested a power balancing solution with a feedback > loop very similar to the one you're describing side by side to the RFC > patch series I provided a link to earlier (which targeted Gen9 LP > parts), and the energy efficiency improvements they observed were > roughly half of the improvement obtained with my series unsurprisingly. > > Not to speak about generalizing such a feedback loop to bottlenecks on > multiple I/O devices. The generalizing part I'm totally unconvinced above. > >> Depending on the instantaneous behavior of the > >> workload it might take 1% or 95% of the CPU power in order to keep the > >> IO device busy. Each user of this would need to monitor the performance > >> of every CPU in the system and update the constraints on each of them > >> periodically (whether or not they're talking to that IO device, which > >> would possibly negatively impact the latency of unrelated applications > >> running on other CPUs, unless we're willing to race with the task > >> scheduler). > > > > No, it just needs to measure a signal representing how much power *it* > > gets and decide whether or not it can let the CPU subsystem use more > > power. > > > > Well yes it's technically possible to set frequency constraints based on > trial-and-error without sampling utilization information from the CPU > cores, but don't we agree that this kind of information can be highly > valuable? OK, so there are three things, frequency constraints (meaning HWP min and max limits, for example), frequency requests (this is what cpufreq does) and power limits. If the processor has at least some autonomy in driving the frequency, using frequency requests (i.e. cpufreq governors) for limiting power is inefficient in general, because the processor is not required to grant those requests at all. Using frequency limits may be good enough, but it generally limits the processor's ability to respond at short-time scales (for example, setting the max frequency limit will prevent the processor from using frequencies above that limit even temporarily, but that might be the most energy-efficient option in some cases). Using power limits (which is what RAPL does) doesn't bring such shortcomings in. > >> A solution based on utilization clamps (with some > >> extensions) sounds more future-proof to me honestly. > > > > Except that it would be rather hard to connect it to something like > > RAPL, which should be quite straightforward with the approach I'm > > talking about. > > > > I think using RAPL as additional control variable would be useful, but > fully orthogonal to the cap being set by some global mechanism or being > derived from the aggregation of a number of per-process power caps based > on the scheduler behavior. I'm not sure what do you mean by "the cap" here. A maximum frequency limit or something else? > The latter sounds like the more reasonable > fit for a multi-tasking, possibly virtualized environment honestly. > Either way RAPL is neither necessary nor sufficient in order to achieve > the energy efficiency improvement I'm working on. The "not necessary" I can agree with, but I don't see any arguments for the "not sufficient" statement. > > The problem with all scheduler-based ways, again, is that there is no > > direct connection between the scheduler and HWP, > > I was planning to introduce such a connection in RFC part 2. I have a > prototype for that based on a not particularly pretty custom interface, > I wouldn't mind trying to get it to use utilization clamps if you think > that's the way forward. Well, I may think so, but that's just thinking at this point. I have no real numbers to support that theory. > > or even with whatever the processor does with the P-states in the > > turbo range. If any P-state in the turbo range is requested, the > > processor has a license to use whatever P-state it wants, so this > > pretty much means allowing it to use as much power as it can. > > > > So in the first place, if you want to limit the use of power in the > > CPU subsystem through frequency control alone, you need to prevent it > > from using turbo P-states at all. However, with RAPL you can just > > limit power which may still allow some (but not all) turbo P-states to > > be used. > > My goal is not to limit the use of power of the CPU (if it has enough > load to utilize 100% of the cycles at turbo frequency so be it), but to > get it to use it more efficiently. If you are constrained by a given > power budget (e.g. the TDP or the one you want set via RAPL) you can do > more with it if you set a stable frequency rather than if you let the > CPU bounce back and forth between turbo and idle. Well, this basically means driving the CPU frequency by hand with the assumption that the processor cannot do the right thing in this respect, while in theory the HWP algorithm should be able to produce the desired result. IOW, your argumentation seems to go into the "HWP is useless" direction, more or less and while there are people who will agree with such a statement, others won't. > This can only be > achieved effectively if the frequency governor has a rough idea of the > latency requirements of the workload, since it involves a > latency/energy-efficiency trade-off. Let me state this again (and this will be the last time, because I don't really like to repeat points): the frequency governor can only *request* the processor to do something in general and the request may or may not be granted, for various reasons. If it is not granted, the whole "control" mechanism fails.
"Rafael J. Wysocki" <rafael@kernel.org> writes: > On Fri, Feb 21, 2020 at 11:10 PM Francisco Jerez <currojerez@riseup.net> wrote: >> >> "Rafael J. Wysocki" <rafael@kernel.org> writes: >> >> > On Thu, Feb 13, 2020 at 9:09 AM Francisco Jerez <currojerez@riseup.net> wrote: >> >> >> >> "Rafael J. Wysocki" <rafael@kernel.org> writes: >> >> >> >> > On Thu, Feb 13, 2020 at 1:16 AM Rafael J. Wysocki <rafael@kernel.org> wrote: >> >> >> >> >> >> On Thu, Feb 13, 2020 at 12:31 AM Francisco Jerez <currojerez@riseup.net> wrote: >> >> >> > >> > >> > [cut] >> > >> >> > >> >> > And BTW, posting patches as RFC is fine even if they have not been >> >> > tested. At least you let people know that you work on something this >> >> > way, so if they work on changes in the same area, they may take that >> >> > into consideration. >> >> > >> >> >> >> Sure, that was going to be the first RFC. >> >> >> >> > Also if there are objections to your proposal, you may save quite a >> >> > bit of time by sending it early. >> >> > >> >> > It is unfortunate that this series has clashed with the changes that >> >> > you were about to propose, but in this particular case in my view it >> >> > is better to clean up things and start over. >> >> > >> >> >> >> Luckily it doesn't clash with the second RFC I was meaning to send, >> >> maybe we should just skip the first? >> > >> > Yes, please. >> > >> >> Or maybe it's valuable as a curiosity anyway? >> > >> > No, let's just focus on the latest one. >> > >> > Thanks! >> >> We don't seem to have reached much of an agreement on the general >> direction of RFC2, so I can't really get started with it. Here is RFC1 >> for the record: >> >> https://github.com/curro/linux/commits/intel_pstate-lp-hwp-v10.8-alt > > Appreciate the link, but that hasn't been posted to linux-pm yet, so > there's not much to discuss. > > And when you post it, please rebase it on top of linux-next. > >> Specifically the following patch conflicts with this series: >> >> https://github.com/curro/linux/commit/9a16f35531bbb76d38493da892ece088e31dc2e0 >> >> Series improves performance-per-watt of GfxBench gl_4 (AKA Car Chase) by >> over 15% on my system with the branch above, actual FPS "only" improves >> about 5.9% on ICL laptop due to it being very lightly TDP-bound with its >> rather huge TDP. The performance of almost every graphics benchmark >> I've tried improves significantly with it (a number of SynMark >> test-cases are improved by around 40% in perf-per-watt, Egypt >> perf-per-watt improves by about 25%). >> >> Hopefully we can come up with some alternative plan of action. > > It is very easy to replace the patch above with an alternative one on > top of linux-next that will add CPU_RESPONSE_FREQUENCY QoS along the > lines of the CPU latency QoS implementation in there without the need > restore to global QoS classes. > > IOW, you don't really need the code that goes away in linux-next to > implement what you need. > > Thanks! Sure, I'll do that.
"Rafael J. Wysocki" <rafael@kernel.org> writes: > Sorry for the late response, I was offline for a major part of the > previous week. > > On Fri, Feb 14, 2020 at 9:31 PM Francisco Jerez <currojerez@riseup.net> wrote: >> >> "Rafael J. Wysocki" <rafael@kernel.org> writes: >> >> > On Fri, Feb 14, 2020 at 1:14 AM Francisco Jerez <currojerez@riseup.net> wrote: >> >> >> >> "Rafael J. Wysocki" <rafael@kernel.org> writes: >> >> >> >> > On Thu, Feb 13, 2020 at 12:34 PM Rafael J. Wysocki <rafael@kernel.org> wrote: >> > >> > [cut] >> > >> >> > >> >> > I think that your use case is almost equivalent to the thermal >> >> > pressure one, so you'd want to limit the max and so that would be >> >> > something similar to store_max_perf_pct() with its input side hooked >> >> > up to a QoS list. >> >> > >> >> > But it looks like that QoS list would rather be of a "reservation" >> >> > type, so a request added to it would mean something like "leave this >> >> > fraction of power that appears to be available to the CPU subsystem >> >> > unused, because I need it for a different purpose". And in principle >> >> > there might be multiple requests in there at the same time and those >> >> > "reservations" would add up. So that would be a kind of "limited sum" >> >> > QoS type which wasn't even there before my changes. >> >> > >> >> > A user of that QoS list might then do something like >> >> > >> >> > ret = cpu_power_reserve_add(1, 4); >> >> > >> >> > meaning that it wants 25% of the "potential" CPU power to be not >> >> > utilized by CPU performance scaling and that could affect the >> >> > scheduler through load modifications (kind of along the thermal >> >> > pressure patchset discussed some time ago) and HWP (as well as the >> >> > non-HWP intel_pstate by preventing turbo frequencies from being used >> >> > etc). >> >> >> >> The problems with this are the same as with the per-CPU frequency QoS >> >> approach: How does the device driver know what the appropriate fraction >> >> of CPU power is? >> > >> > Of course it doesn't know and it may never know exactly, but it may guess. >> > >> > Also, it may set up a feedback loop: request an aggressive >> > reservation, run for a while, measure something and refine if there's >> > headroom. Then repeat. >> > >> >> Yeah, of course, but that's obviously more computationally intensive and >> less accurate than computing an approximately optimal constraint in a >> single iteration (based on knowledge from performance counters and a >> notion of the latency requirements of the application), since such a >> feedback loop relies on repeatedly overshooting and undershooting the >> optimal value (the latter causes an artificial CPU bottleneck, possibly >> slowing down other applications too) in order to converge to and remain >> in a neighborhood of the optimal value. > > I'm not saying that feedback loops are the way to go in general, but > that in some cases they are applicable and this particular case looks > like it may be one of them. > >> Incidentally people tested a power balancing solution with a feedback >> loop very similar to the one you're describing side by side to the RFC >> patch series I provided a link to earlier (which targeted Gen9 LP >> parts), and the energy efficiency improvements they observed were >> roughly half of the improvement obtained with my series unsurprisingly. >> >> Not to speak about generalizing such a feedback loop to bottlenecks on >> multiple I/O devices. > > The generalizing part I'm totally unconvinced above. > One of the main problems I see with generalizing a driver-controlled feedback loop to multiple devices is that any one of the drivers has no visibility over the performance of other workloads running on the same CPU core but not tied to the same feedback loop. E.g. consider a GPU-bound application running concurrently with some latency-bound application on the same CPU core: It would be easy (if somewhat inaccurate) for the GPU driver to monitor the utilization of the one device it controls in order to prevent performance loss as a result of its frequency constraints, but how could it tell whether it's having a negative impact on the performance of the other non-GPU-bound application? It doesn't seem possible to avoid that without the driver monitoring the performance counters of each CPU core *and* having some sort of interface in place for other unrelated applications to communicate their latency constraints (which brings us back to the PM QoS discussion). >> >> Depending on the instantaneous behavior of the >> >> workload it might take 1% or 95% of the CPU power in order to keep the >> >> IO device busy. Each user of this would need to monitor the performance >> >> of every CPU in the system and update the constraints on each of them >> >> periodically (whether or not they're talking to that IO device, which >> >> would possibly negatively impact the latency of unrelated applications >> >> running on other CPUs, unless we're willing to race with the task >> >> scheduler). >> > >> > No, it just needs to measure a signal representing how much power *it* >> > gets and decide whether or not it can let the CPU subsystem use more >> > power. >> > >> >> Well yes it's technically possible to set frequency constraints based on >> trial-and-error without sampling utilization information from the CPU >> cores, but don't we agree that this kind of information can be highly >> valuable? > > OK, so there are three things, frequency constraints (meaning HWP min > and max limits, for example), frequency requests (this is what cpufreq > does) and power limits. > > If the processor has at least some autonomy in driving the frequency, > using frequency requests (i.e. cpufreq governors) for limiting power > is inefficient in general, because the processor is not required to > grant those requests at all. > For limiting power yes, I agree that it would be less effective than a RAPL constraint, but the purpose of my proposal is not to set an upper limit on the power usage of the CPU in absolute terms, but in terms relative to its performance: Given that the energy efficiency of the CPU is steadily decreasing with frequency past the inflection point of the power curve, it's more energy-efficient to set a frequency constraint rather than to set a constraint on its long-term average power consumption while letting the clock frequency swing arbitrarily around the most energy-efficient frequency. Please don't get me wrong: I think that leveraging RAPL constraints as additional variable is authentically useful especially for thermal management, but it's largely complementary to frequency constraints which provide a more direct way to control the energy efficiency of the CPU. But even if we decide to use RAPL for this, wouldn't the RAPL governor need to make a certain latency trade-off? In order to avoid performance degradation it would be necessary for the governor to respond to changes in the load of the CPU, and some awareness of the latency constraints of the application seems necessary either way in order to do that effectively. IOW the kind of latency constraint I wanted to propose would be useful to achieve the most energy-efficient outcome whether we use RAPL, frequency constraints, or both. > Using frequency limits may be good enough, but it generally limits the > processor's ability to respond at short-time scales (for example, > setting the max frequency limit will prevent the processor from using > frequencies above that limit even temporarily, but that might be the > most energy-efficient option in some cases). > > Using power limits (which is what RAPL does) doesn't bring such shortcomings in. But preventing a short-term oscillation of the CPU frequency is the desired outcome rather than a shortcoming whenever the time scale of the oscillation is orders of magnitude smaller than the latency requirements known to the application, since it lowers the energy efficiency (and therefore parallelism) of the system without any visible benefit for the workload. The mechanism I'm proposing wouldn't prevent such short-term oscillations when needed except when an application or device driver explicitly requests PM to damp them. > >> >> A solution based on utilization clamps (with some >> >> extensions) sounds more future-proof to me honestly. >> > >> > Except that it would be rather hard to connect it to something like >> > RAPL, which should be quite straightforward with the approach I'm >> > talking about. >> > >> >> I think using RAPL as additional control variable would be useful, but >> fully orthogonal to the cap being set by some global mechanism or being >> derived from the aggregation of a number of per-process power caps based >> on the scheduler behavior. > > I'm not sure what do you mean by "the cap" here. A maximum frequency > limit or something else? > Either a frequency or a power cap. Either way it seems valuable (but not strictly necessary up front) for the cap to be derived from the scheduler's behavior. >> The latter sounds like the more reasonable >> fit for a multi-tasking, possibly virtualized environment honestly. >> Either way RAPL is neither necessary nor sufficient in order to achieve >> the energy efficiency improvement I'm working on. > > The "not necessary" I can agree with, but I don't see any arguments > for the "not sufficient" statement. > Not sufficient since RAPL doesn't provide as much of a direct limit on the energy efficiency of the system as a frequency constraint would [More on that above]. >> > The problem with all scheduler-based ways, again, is that there is no >> > direct connection between the scheduler and HWP, >> >> I was planning to introduce such a connection in RFC part 2. I have a >> prototype for that based on a not particularly pretty custom interface, >> I wouldn't mind trying to get it to use utilization clamps if you think >> that's the way forward. > > Well, I may think so, but that's just thinking at this point. I have > no real numbers to support that theory. > Right. And the only way to get numbers is to implement it. I wouldn't mind giving that a shot as a follow up. But a PM QoS-based solution is likely to give most of the benefit in the most common scenarios. >> > or even with whatever the processor does with the P-states in the >> > turbo range. If any P-state in the turbo range is requested, the >> > processor has a license to use whatever P-state it wants, so this >> > pretty much means allowing it to use as much power as it can. >> > >> > So in the first place, if you want to limit the use of power in the >> > CPU subsystem through frequency control alone, you need to prevent it >> > from using turbo P-states at all. However, with RAPL you can just >> > limit power which may still allow some (but not all) turbo P-states to >> > be used. >> >> My goal is not to limit the use of power of the CPU (if it has enough >> load to utilize 100% of the cycles at turbo frequency so be it), but to >> get it to use it more efficiently. If you are constrained by a given >> power budget (e.g. the TDP or the one you want set via RAPL) you can do >> more with it if you set a stable frequency rather than if you let the >> CPU bounce back and forth between turbo and idle. > > Well, this basically means driving the CPU frequency by hand with the > assumption that the processor cannot do the right thing in this > respect, while in theory the HWP algorithm should be able to produce > the desired result. > > IOW, your argumentation seems to go into the "HWP is useless" > direction, more or less and while there are people who will agree with > such a statement, others won't. > I don't want to drive the CPU frequency by hand, and I don't think HWP is useless by any means. The purpose of my changes is to get HWP to do a better job by constraining its response to a reasonable range based on information which is largely unavailable to HWP -- E.g.: What are the latency constraints of the application? Does the application have an IO bottleneck? Which CPU core did we schedule the IO-bottlenecking application to? >> This can only be >> achieved effectively if the frequency governor has a rough idea of the >> latency requirements of the workload, since it involves a >> latency/energy-efficiency trade-off. > > Let me state this again (and this will be the last time, because I > don't really like to repeat points): the frequency governor can only > *request* the processor to do something in general and the request may > or may not be granted, for various reasons. If it is not granted, the > whole "control" mechanism fails. And what's wrong with that? The purpose of the latency constraint interface is not to provide a hard limit on the CPU frequency, but to give applications some influence on the latency trade-off made by the governor whenever it isn't in conflict with the constraints set by other applications (possibly as a result of them being part of the same clock domain which may indeed cause the effective frequency to deviate from the range specified by the P-state governor). IOW the CPU frequency momentarily exceeding the optimal value for any specific application wouldn't violate the interface. The result can still be massively more energy-efficient than placing a long-term power constraint, or not placing any constraint at all, even if P-state requests are not guaranteed to succeed in general. Regards, Francisco.