Message ID | 20230827233203.1315953-1-qyousef@layalina.io (mailing list archive) |
---|---|
Headers | show |
Series | sched: cpufreq: Remove magic margins | expand |
Hi Qais, On 8/28/23 00:31, Qais Yousef wrote: > Since the introduction of EAS and schedutil, we had two magic 0.8 and 1.25 > margins applied in fits_capacity() and apply_dvfs_headroom(). > > As reported two years ago in > > https://lore.kernel.org/lkml/1623855954-6970-1-git-send-email-yt.chang@mediatek.com/ > > these values are not good fit for all systems and people do feel the need to > modify them regularly out of tree. That is true, in Android kernel those are known 'features'. Furthermore, in my game testing it looks like higher margins do help to shrink number of dropped frames, while on other types of workloads (e.g. those that you have in the link above) the 0% shows better energy. I remember also the results from MTK regarding the PELT HALF_LIFE https://lore.kernel.org/all/0f82011994be68502fd9833e499749866539c3df.camel@mediatek.com/ The numbers for 8ms half_life where showing really nice improvement for the 'min fps' metric. I got similar with higher margin. IMO we can derive quite important information from those different experiments: More sustainable workloads like "Yahoo browser" don't need margin. More unpredictable workloads like "Fortnite" (shooter game with 'open world') need some decent margin. The problem is that the periodic task can be 'noisy'. The low-pass filter which is our exponentially weighted moving avg PELT will 'smooth' the measured values. It will block sudden 'spikes' since they are high-frequency changes. Those sudden 'spikes' are the task activations where we need to compute a bit longer, e.g. there was explosion in the game. The 25% margin helps us to be ready for this 'noisy' task - the CPU frequency is higher (and capacity). So if a sudden need for longer computation is seen, then we have enough 'idle time' (~25% idle) to serve this properly and not loose the frame. The margin helps in two ways for 'noisy' workloads: 1. in fits_capacity() to avoid a CPU which couldn't handle it and prefers CPUs with higher capacity 2. it asks for longer 'idle time' e.g. 25-40% (depends on margin) to serve sudden computation need IIUC, your proposal is to: 1. extend the low-pass filter to some higher frequency, so we could see those 'spikes' - that's the PELT HALF_LIFE boot parameter for 8ms 1.1. You are likely to have a 'gift' from the Util_est which picks the max util_avg values and maintains them for a while. That's why the 8ms PELT information can last longer and you can get higher frequency and longer idle time. 2. Plumb in this new idea of dvfs_update_delay as the new 'margin' - this I don't understand For the 2. I don't see that the dvfs HW characteristics are best for this problem purpose. We can have a really fast DVFS HW, but we need some decent spare idle time in some workloads, which are two independent issues IMO. You might get the higher idle time thanks to 1.1. but this is a 'side effect'. Could you explain a bit more why this dvfs_update_delay is crucial here? Regards, Lukasz
Hi Lukasz On 09/06/23 10:18, Lukasz Luba wrote: > Hi Qais, > > On 8/28/23 00:31, Qais Yousef wrote: > > Since the introduction of EAS and schedutil, we had two magic 0.8 and 1.25 > > margins applied in fits_capacity() and apply_dvfs_headroom(). > > > > As reported two years ago in > > > > https://lore.kernel.org/lkml/1623855954-6970-1-git-send-email-yt.chang@mediatek.com/ > > > > these values are not good fit for all systems and people do feel the need to > > modify them regularly out of tree. > > That is true, in Android kernel those are known 'features'. Furthermore, > in my game testing it looks like higher margins do help to shrink > number of dropped frames, while on other types of workloads (e.g. > those that you have in the link above) the 0% shows better energy. Do you keep margins high for all types of CPU? I think the littles are the problematic ones which higher margins helps as this means you move away from them quickly. > > I remember also the results from MTK regarding the PELT HALF_LIFE > > https://lore.kernel.org/all/0f82011994be68502fd9833e499749866539c3df.camel@mediatek.com/ > > The numbers for 8ms half_life where showing really nice improvement > for the 'min fps' metric. I got similar with higher margin. > > IMO we can derive quite important information from those different > experiments: > More sustainable workloads like "Yahoo browser" don't need margin. > More unpredictable workloads like "Fortnite" (shooter game with 'open > world') need some decent margin. Yeah. So the point is that while we should have a sensible default, but there isn't a one size fits all. But the question is how the user/sysadmin should control this? This series is what I propose of course :) I also think the current forced/fixed margin values enforce a policy that is clearly not a good default on many systems. With no alternative in hand but to hack their own solutions. > > The problem is that the periodic task can be 'noisy'. The low-pass Hehe. That's because they're not really periodic ;-) I think the model of a periodic task is not suitable for most workloads. All of them are dynamic and how much they need to do at each wake up can very significantly over 10s of ms. > filter which is our exponentially weighted moving avg PELT will > 'smooth' the measured values. It will block sudden 'spikes' since > they are high-frequency changes. Those sudden 'spikes' are > the task activations where we need to compute a bit longer, e.g. > there was explosion in the game. The 25% margin helps us to > be ready for this 'noisy' task - the CPU frequency is higher > (and capacity). So if a sudden need for longer computation > is seen, then we have enough 'idle time' (~25% idle) to serve this > properly and not loose the frame. > > The margin helps in two ways for 'noisy' workloads: > 1. in fits_capacity() to avoid a CPU which couldn't handle it > and prefers CPUs with higher capacity > 2. it asks for longer 'idle time' e.g. 25-40% (depends on margin) to > serve sudden computation need > > IIUC, your proposal is to: > 1. extend the low-pass filter to some higher frequency, so we > could see those 'spikes' - that's the PELT HALF_LIFE boot > parameter for 8ms That's another way to look at it, yes. We can control how reactive we'd like the system to be for changes. > 1.1. You are likely to have a 'gift' from the Util_est > which picks the max util_avg values and maintains them > for a while. That's why the 8ms PELT information can last longer > and you can get higher frequency and longer idle time. This is probably controversial statement. But I am not in favour of util_est. I need to collect the data, but I think we're better with 16ms PELT HALFLIFE as default instead. But I will need to do a separate investigation on that. > 2. Plumb in this new idea of dvfs_update_delay as the new > 'margin' - this I don't understand > > For the 2. I don't see that the dvfs HW characteristics are best > for this problem purpose. We can have a really fast DVFS HW, > but we need some decent spare idle time in some workloads, which > are two independent issues IMO. You might get the higher > idle time thanks to 1.1. but this is a 'side effect'. > > Could you explain a bit more why this dvfs_update_delay is > crucial here? I'm not sure why you relate this to idle time. And the word margin is a bit overloaded here. so I suppose you're referring to the one we have in map_util_perf() or apply_dvfs_headroom(). And I suppose you assume this extra headroom will result in idle time, but this is not necessarily true IMO. My rationale is simply that DVFS based on util should follow util_avg as-is. But as pointed out in different discussions happened elsewhere, we need to provide a headroom for this util to grow as if we were to be exact and the task continues to run, then likely the util will go above the current OPP before we get a chance to change it again. If we do have an ideal hardware that takes 0 time to change frequency, then this headroom IMO is not needed because frequency will follow us as util grows. Assuming here that util updates instantaneously as the task continues to run. So instead of a constant 25% headroom; I redefine this to be a function of the hardware delay. If we take a decision now to choose which OPP, then it should be based on util_avg value after taking into account how much it'll grow before we take the next decision (which the dvfs_update_delay). We don't need any more than that. Maybe we need to take into account how often we call update_load_avg(). I'm not sure about this yet. If the user wants to have faster response time, then the new knobs are the way to control that. But the headroom should be small enough to make sure we don't overrun until the next decision point. Not less, and not more. Thanks! -- Qais Yousef
On 9/6/23 22:18, Qais Yousef wrote: > Hi Lukasz > > On 09/06/23 10:18, Lukasz Luba wrote: >> Hi Qais, >> >> On 8/28/23 00:31, Qais Yousef wrote: >>> Since the introduction of EAS and schedutil, we had two magic 0.8 and 1.25 >>> margins applied in fits_capacity() and apply_dvfs_headroom(). >>> >>> As reported two years ago in >>> >>> https://lore.kernel.org/lkml/1623855954-6970-1-git-send-email-yt.chang@mediatek.com/ >>> >>> these values are not good fit for all systems and people do feel the need to >>> modify them regularly out of tree. >> >> That is true, in Android kernel those are known 'features'. Furthermore, >> in my game testing it looks like higher margins do help to shrink >> number of dropped frames, while on other types of workloads (e.g. >> those that you have in the link above) the 0% shows better energy. > > Do you keep margins high for all types of CPU? I think the littles are the > problematic ones which higher margins helps as this means you move away from > them quickly. That's true, for the Littles higher margins helps to evacuate tasks sooner. I have experiments showing good results with 60% margin on Littles, while on Big & Mid 20%, 30%. The Littles still have also tasks in cgroups cpumask which are quite random, so they cannot migrate, but have a bit higher 'idle time' headroom. > >> >> I remember also the results from MTK regarding the PELT HALF_LIFE >> >> https://lore.kernel.org/all/0f82011994be68502fd9833e499749866539c3df.camel@mediatek.com/ >> >> The numbers for 8ms half_life where showing really nice improvement >> for the 'min fps' metric. I got similar with higher margin. >> >> IMO we can derive quite important information from those different >> experiments: >> More sustainable workloads like "Yahoo browser" don't need margin. >> More unpredictable workloads like "Fortnite" (shooter game with 'open >> world') need some decent margin. > > Yeah. So the point is that while we should have a sensible default, but there > isn't a one size fits all. But the question is how the user/sysadmin should > control this? This series is what I propose of course :) > > I also think the current forced/fixed margin values enforce a policy that is > clearly not a good default on many systems. With no alternative in hand but to > hack their own solutions. I see. > >> >> The problem is that the periodic task can be 'noisy'. The low-pass > > Hehe. That's because they're not really periodic ;-) They are periodic in a sense, they wake up every 16ms, but sometimes they have more work. It depends what is currently going in the game and/or sometimes the data locality (might not be in cache). Although, that's for games, other workloads like youtube play or this one 'Yahoo browser' (from your example) are more 'predictable' (after the start up period). And I really like the potential energy saving there :) > > I think the model of a periodic task is not suitable for most workloads. All > of them are dynamic and how much they need to do at each wake up can very > significantly over 10s of ms. Might be true, the model was built a few years ago when there wasn't such dynamic game scenario with high FPS on mobiles. This could still be tuned with your new design IIUC (no need extra hooks in Android). > >> filter which is our exponentially weighted moving avg PELT will >> 'smooth' the measured values. It will block sudden 'spikes' since >> they are high-frequency changes. Those sudden 'spikes' are >> the task activations where we need to compute a bit longer, e.g. >> there was explosion in the game. The 25% margin helps us to >> be ready for this 'noisy' task - the CPU frequency is higher >> (and capacity). So if a sudden need for longer computation >> is seen, then we have enough 'idle time' (~25% idle) to serve this >> properly and not loose the frame. >> >> The margin helps in two ways for 'noisy' workloads: >> 1. in fits_capacity() to avoid a CPU which couldn't handle it >> and prefers CPUs with higher capacity >> 2. it asks for longer 'idle time' e.g. 25-40% (depends on margin) to >> serve sudden computation need >> >> IIUC, your proposal is to: >> 1. extend the low-pass filter to some higher frequency, so we >> could see those 'spikes' - that's the PELT HALF_LIFE boot >> parameter for 8ms > > That's another way to look at it, yes. We can control how reactive we'd like > the system to be for changes. Which make sense in context to what I said above (newer gaming). > >> 1.1. You are likely to have a 'gift' from the Util_est >> which picks the max util_avg values and maintains them >> for a while. That's why the 8ms PELT information can last longer >> and you can get higher frequency and longer idle time. > > This is probably controversial statement. But I am not in favour of util_est. > I need to collect the data, but I think we're better with 16ms PELT HALFLIFE as > default instead. But I will need to do a separate investigation on that. I like util_est, sometimes it helps ;) > >> 2. Plumb in this new idea of dvfs_update_delay as the new >> 'margin' - this I don't understand >> >> For the 2. I don't see that the dvfs HW characteristics are best >> for this problem purpose. We can have a really fast DVFS HW, >> but we need some decent spare idle time in some workloads, which >> are two independent issues IMO. You might get the higher >> idle time thanks to 1.1. but this is a 'side effect'. >> >> Could you explain a bit more why this dvfs_update_delay is >> crucial here? > > I'm not sure why you relate this to idle time. And the word margin is a bit > overloaded here. so I suppose you're referring to the one we have in > map_util_perf() or apply_dvfs_headroom(). And I suppose you assume this extra > headroom will result in idle time, but this is not necessarily true IMO. > > My rationale is simply that DVFS based on util should follow util_avg as-is. > But as pointed out in different discussions happened elsewhere, we need to > provide a headroom for this util to grow as if we were to be exact and the task > continues to run, then likely the util will go above the current OPP before we > get a chance to change it again. If we do have an ideal hardware that takes Yes, this is another requirement to have +X% margin. When the tasks are growing, we don't know their final util_avg and we give them a bit more cycles. IMO we have to be ready always for such situation in the scheduler, haven't we? > 0 time to change frequency, then this headroom IMO is not needed because > frequency will follow us as util grows. Assuming here that util updates > instantaneously as the task continues to run. > > So instead of a constant 25% headroom; I redefine this to be a function of the > hardware delay. If we take a decision now to choose which OPP, then it should > be based on util_avg value after taking into account how much it'll grow before > we take the next decision (which the dvfs_update_delay). We don't need any more > than that. > > Maybe we need to take into account how often we call update_load_avg(). I'm not > sure about this yet. > > If the user wants to have faster response time, then the new knobs are the way > to control that. But the headroom should be small enough to make sure we don't > overrun until the next decision point. Not less, and not more. For ideal workloads (rt-app) or those 'calm' yes, we could save energy (as you pointed for this 0% margin energy values). I do like this 10% energy saving in some DoU scenarios. I couldn't catch the idea with feeding the dvfs response information into this equation. We might discuss this offline ;) Cheers, Lukasz
On Thu, Sep 07, 2023 at 08:48:08AM +0100, Lukasz Luba wrote: > > Hehe. That's because they're not really periodic ;-) > > They are periodic in a sense, they wake up every 16ms, but sometimes > they have more work. It depends what is currently going in the game > and/or sometimes the data locality (might not be in cache). > > Although, that's for games, other workloads like youtube play or this > one 'Yahoo browser' (from your example) are more 'predictable' (after > the start up period). And I really like the potential energy saving > there :) So everything media is fundamentally periodic, you're hard tied to the framerate / audio-buffer size etc.. Also note that the traditional periodic task model from the real-time community has the notion of WCET, which completely covers this fluctuation in frame-to-frame work, it only considers the absolute worst case. Now, practically, that stinks, esp. when you care about batteries, but it does not mean these tasks are not periodic. Many extentions to the periodic task model are possible, including things like average runtime with bursts etc.. all have their trade-offs.
On 9/7/23 12:53, Peter Zijlstra wrote: > On Thu, Sep 07, 2023 at 08:48:08AM +0100, Lukasz Luba wrote: > >>> Hehe. That's because they're not really periodic ;-) >> >> They are periodic in a sense, they wake up every 16ms, but sometimes >> they have more work. It depends what is currently going in the game >> and/or sometimes the data locality (might not be in cache). >> >> Although, that's for games, other workloads like youtube play or this >> one 'Yahoo browser' (from your example) are more 'predictable' (after >> the start up period). And I really like the potential energy saving >> there :) > > So everything media is fundamentally periodic, you're hard tied to the > framerate / audio-buffer size etc.. Agree > > Also note that the traditional periodic task model from the real-time > community has the notion of WCET, which completely covers this > fluctuation in frame-to-frame work, it only considers the absolute worst > case. That's good point, the WCET here. IMO shorter PELT e.g. 8ms allows us to 'see' a bit more that information: the worst case in fluctuation of a particular task. Then this 'seen' value is maintained in util_est for a while. That's why (probably) I see a better 95-, 99-percentile numbers for frames rendering time. > > Now, practically, that stinks, esp. when you care about batteries, but > it does not mean these tasks are not periodic. Totally agree they are periodic. > > Many extentions to the periodic task model are possible, including > things like average runtime with bursts etc.. all have their trade-offs. Was that maybe proposed somewhere on LKML (the other models)? I can recall one idea - WALT. IIRC ~2016/2017 the WALT proposal and some discussion/conferences, it didn't get positive feedback [2]. I don't know if you remember those numbers back than, e.g. video 1080p playback was using ~10% less energy... Those 10%-15% are still important for us ;) Regards, Lukasz [1] https://lore.kernel.org/all/1477638642-17428-1-git-send-email-markivx@codeaurora.org/
On Mon, Aug 28, 2023 at 12:31:56AM +0100, Qais Yousef wrote: > Equally recent discussion in PELT HALFLIFE thread highlighted the need for > a way to tune system response time to achieve better perf, power and thermal > characteristic for a given system > > https://lore.kernel.org/lkml/20220829055450.1703092-1-dietmar.eggemann@arm.com/ > > To further help tune the system, we introduce PELT HALFLIFE multiplier as > a boot time parameter. This parameter has an impact on how fast we migrate, so > should compensate for whoever needed to tune fits_capacity(); and it has great > impact on default response_time_ms. Particularly it gives a natural faster rise > time when the system gets busy, AND fall time when the system goes back to > idle. It is coarse grain response control that can be coupled with finer grain > control via schedutil's response_time_ms. You're misrepresenting things... The outcome of that thread above was that PELT halftime was not the primary problem. Specifically: https://lore.kernel.org/lkml/424e2c81-987d-f10e-106d-8b4c611768bc@arm.com/ mentions that the only thing that gaming nonsense cares about is DVFS ramp-up. None of the other PELT users mattered one bit. Also, ISTR a fair amount of this was workload dependent. So a solution that has per-task configurability -- like UTIL_EST_FASTER, seems more suitable. I'm *really* hesitant on adding all these mostly random knobs -- esp. without strong justification -- which you don't present. You mostly seem to justify things with: people do random hack, we should legitimize them hacks. Like the last time around, I want the actual problem explained. The problem is not that random people on the internet do random things to their kernel.
On Wed, Sep 06, 2023 at 10:18:50PM +0100, Qais Yousef wrote: > This is probably controversial statement. But I am not in favour of util_est. > I need to collect the data, but I think we're better with 16ms PELT HALFLIFE as > default instead. But I will need to do a separate investigation on that. I think util_est makes perfect sense, where PELT has to fundamentally decay non-running / non-runnable tasks in order to provide a temporal average, DVFS might be best served with a termporal max filter.
On Thu, Sep 07, 2023 at 02:06:15PM +0100, Lukasz Luba wrote: > > Many extentions to the periodic task model are possible, including > > things like average runtime with bursts etc.. all have their trade-offs. > > Was that maybe proposed somewhere on LKML (the other models)? RT literatur mostly methinks. Replacing WCET with a statistical model of sorts is not uncommon, the argument goes that not everybody will have their worst case at the same time and lows and highs can commonly cancel out and this way we can cram a little more on the system. Typically this is proposed in the context of soft-realtime systems.
On 9/7/23 14:29, Peter Zijlstra wrote: > On Thu, Sep 07, 2023 at 02:06:15PM +0100, Lukasz Luba wrote: > >>> Many extentions to the periodic task model are possible, including >>> things like average runtime with bursts etc.. all have their trade-offs. >> >> Was that maybe proposed somewhere on LKML (the other models)? > > RT literatur mostly methinks. Replacing WCET with a statistical model of > sorts is not uncommon, the argument goes that not everybody will have > their worst case at the same time and lows and highs can commonly cancel > out and this way we can cram a little more on the system. > > Typically this is proposed in the context of soft-realtime systems. Thanks Peter, I will dive into some books...
On Thu, Sep 07, 2023 at 02:33:49PM +0100, Lukasz Luba wrote: > > > On 9/7/23 14:29, Peter Zijlstra wrote: > > On Thu, Sep 07, 2023 at 02:06:15PM +0100, Lukasz Luba wrote: > > > > > > Many extentions to the periodic task model are possible, including > > > > things like average runtime with bursts etc.. all have their trade-offs. > > > > > > Was that maybe proposed somewhere on LKML (the other models)? > > > > RT literatur mostly methinks. Replacing WCET with a statistical model of > > sorts is not uncommon, the argument goes that not everybody will have > > their worst case at the same time and lows and highs can commonly cancel > > out and this way we can cram a little more on the system. > > > > Typically this is proposed in the context of soft-realtime systems. > > Thanks Peter, I will dive into some books... I would look at academic papers, not sure any of that ever made it to books, Daniel would know I suppose.
On 9/7/23 14:38, Peter Zijlstra wrote: > On Thu, Sep 07, 2023 at 02:33:49PM +0100, Lukasz Luba wrote: >> >> >> On 9/7/23 14:29, Peter Zijlstra wrote: >>> On Thu, Sep 07, 2023 at 02:06:15PM +0100, Lukasz Luba wrote: >>> >>>>> Many extentions to the periodic task model are possible, including >>>>> things like average runtime with bursts etc.. all have their trade-offs. >>>> >>>> Was that maybe proposed somewhere on LKML (the other models)? >>> >>> RT literatur mostly methinks. Replacing WCET with a statistical model of >>> sorts is not uncommon, the argument goes that not everybody will have >>> their worst case at the same time and lows and highs can commonly cancel >>> out and this way we can cram a little more on the system. >>> >>> Typically this is proposed in the context of soft-realtime systems. >> >> Thanks Peter, I will dive into some books... > > I would look at academic papers, not sure any of that ever made it to > books, Daniel would know I suppose. Good hint, thanks!
On 9/7/23 14:26, Peter Zijlstra wrote: > On Wed, Sep 06, 2023 at 10:18:50PM +0100, Qais Yousef wrote: > >> This is probably controversial statement. But I am not in favour of util_est. >> I need to collect the data, but I think we're better with 16ms PELT HALFLIFE as >> default instead. But I will need to do a separate investigation on that. > > I think util_est makes perfect sense, where PELT has to fundamentally > decay non-running / non-runnable tasks in order to provide a temporal > average, DVFS might be best served with a termporal max filter. > > Since we are here... Would you allow to have a configuration for the util_est shifter: UTIL_EST_WEIGHT_SHIFT ? I've found other values than '2' better in some scenarios. That helps to prevent a big task to 'down' migrate from a Big CPU (1024) to some Mid CPU (~500-700 capacity) or even Little (~120-300).
On Thu, Sep 07, 2023 at 02:57:26PM +0100, Lukasz Luba wrote: > > > On 9/7/23 14:26, Peter Zijlstra wrote: > > On Wed, Sep 06, 2023 at 10:18:50PM +0100, Qais Yousef wrote: > > > > > This is probably controversial statement. But I am not in favour of util_est. > > > I need to collect the data, but I think we're better with 16ms PELT HALFLIFE as > > > default instead. But I will need to do a separate investigation on that. > > > > I think util_est makes perfect sense, where PELT has to fundamentally > > decay non-running / non-runnable tasks in order to provide a temporal > > average, DVFS might be best served with a termporal max filter. > > > > > > Since we are here... > Would you allow to have a configuration for > the util_est shifter: UTIL_EST_WEIGHT_SHIFT ? > > I've found other values than '2' better in some scenarios. That helps > to prevent a big task to 'down' migrate from a Big CPU (1024) to some > Mid CPU (~500-700 capacity) or even Little (~120-300). Larger values, I'm thinking you're after? Those would cause the new contribution to weight less, making the function more smooth, right? What task characteristic is tied to this? That is, this seems trivial to modify per-task.
On 9/7/23 15:29, Peter Zijlstra wrote: > On Thu, Sep 07, 2023 at 02:57:26PM +0100, Lukasz Luba wrote: >> >> >> On 9/7/23 14:26, Peter Zijlstra wrote: >>> On Wed, Sep 06, 2023 at 10:18:50PM +0100, Qais Yousef wrote: >>> >>>> This is probably controversial statement. But I am not in favour of util_est. >>>> I need to collect the data, but I think we're better with 16ms PELT HALFLIFE as >>>> default instead. But I will need to do a separate investigation on that. >>> >>> I think util_est makes perfect sense, where PELT has to fundamentally >>> decay non-running / non-runnable tasks in order to provide a temporal >>> average, DVFS might be best served with a termporal max filter. >>> >>> >> >> Since we are here... >> Would you allow to have a configuration for >> the util_est shifter: UTIL_EST_WEIGHT_SHIFT ? >> >> I've found other values than '2' better in some scenarios. That helps >> to prevent a big task to 'down' migrate from a Big CPU (1024) to some >> Mid CPU (~500-700 capacity) or even Little (~120-300). > > Larger values, I'm thinking you're after? Those would cause the new > contribution to weight less, making the function more smooth, right? Yes, more smooth, because we only use the 'ewma' goodness for decaying part (not the raising [1]). > > What task characteristic is tied to this? That is, this seems trivial to > modify per-task. In particular Speedometer test and the main browser task, which reaches ~900util, but sometimes vanish and waits for other background tasks to do something. In the meantime it can decay and wake-up on Mid/Little (which can cause a penalty to score up to 5-10% vs. if we pin the task to big CPUs). So, a longer util_est helps to avoid at least very bad down migration to Littles... [1] https://elixir.bootlin.com/linux/v6.5.1/source/kernel/sched/fair.c#L4442
On Thu, Sep 07, 2023 at 03:42:13PM +0100, Lukasz Luba wrote: > > What task characteristic is tied to this? That is, this seems trivial to > > modify per-task. > > In particular Speedometer test and the main browser task, which reaches > ~900util, but sometimes vanish and waits for other background tasks > to do something. In the meantime it can decay and wake-up on > Mid/Little (which can cause a penalty to score up to 5-10% vs. if > we pin the task to big CPUs). So, a longer util_est helps to avoid > at least very bad down migration to Littles... Do they do a few short activations (wakeup/sleeps) while waiting? That would indeed completely ruin things since the EWMA thing is activation based. I wonder if there's anything sane we can do here...
On 09/07/23 15:08, Peter Zijlstra wrote: > On Mon, Aug 28, 2023 at 12:31:56AM +0100, Qais Yousef wrote: > > > Equally recent discussion in PELT HALFLIFE thread highlighted the need for > > a way to tune system response time to achieve better perf, power and thermal > > characteristic for a given system > > > > https://lore.kernel.org/lkml/20220829055450.1703092-1-dietmar.eggemann@arm.com/ > > > > > To further help tune the system, we introduce PELT HALFLIFE multiplier as > > a boot time parameter. This parameter has an impact on how fast we migrate, so > > should compensate for whoever needed to tune fits_capacity(); and it has great > > impact on default response_time_ms. Particularly it gives a natural faster rise > > time when the system gets busy, AND fall time when the system goes back to > > idle. It is coarse grain response control that can be coupled with finer grain > > control via schedutil's response_time_ms. > > You're misrepresenting things... The outcome of that thread above was Sorry if I did. My PoV might have gotten skewed. I'm not intending to mislead for sure. I actually was hesitant about adding the PELT patch initially, but it did feel that the two topics are connected. Margins are causing problems because they end up wasting power. So there's a desire to slow current response down. But this PELT story wanted to speed things up. And this polar opposite is what I think the distilled problem. > that PELT halftime was not the primary problem. Specifically: > > https://lore.kernel.org/lkml/424e2c81-987d-f10e-106d-8b4c611768bc@arm.com/ > > mentions that the only thing that gaming nonsense cares about is DVFS > ramp-up. > > None of the other PELT users mattered one bit. I actually latched to Vincent response about a boot time parameter makes sense. Just to be clear, my main issue here with the current hardcoded values of the 'margins'. And the fact they go too fast is my main problem. The way I saw PELT fits in this story is to help lower end systems who don't have a lot of oomph. For reasonably powerful system; I don't see a necessity to change this and DVFS is what matters, I agree. It was my attempt to draw a full picture and cover the full spectrum. I don't think PELT halfllife plays a role in powerful systems. But under-powered ones, I think it will help; and that's why I was depicting it as coarse grain control. I think I did try to present similar arguments on that thread. > > Also, ISTR a fair amount of this was workload dependent. So a solution > that has per-task configurability -- like UTIL_EST_FASTER, seems more > suitable. But for the 0.8 and 1.25 margin problems, actually the problem is that 25% is too aggressive/fast and wastes power. I'm actually slowing things down as a result of this series. And I'm expecting some not to be happy about it on their systems. The response_time_ms was my way to give back control. I didn't see how I can make things faster and slower at the same time without making decisions on behalf of the user/sysadmin. So the connection I see between PELT and the margins or headrooms in fits_capacity() and map_util_perf()/dvfs_headroom is that they expose the need to manage the perf/power trade-off of the system. Particularly the default is not good for the modern systems, Cortex-X is too powerful but we still operate within the same power and thermal budgets. And what was a high end A78 is a mid core today. So if you look at today's mobile world topology we really have a tiy+big+huge combination of cores. The bigs are called mids, but they're very capable. Fits capacity forces migration to the 'huge' cores too soon with that 80% margin. While the 80% might be too small for the tiny ones as some workloads really struggle there if they hang on for too long. It doesn't help that these systems ship with 4ms tick. Something more to consider changing I guess. And the 25% headroom forces near max frequency to be used when the workload is happily hovering in the 750 region. I did force the frequency to be lower and the workload is happily running - we don't need the extra 25% headroom enforced unconditionally. UTIL_EST_FASTER moves in one direction. And it's a constant response too, no? I didn't get the per-task configurability part. AFAIU we can't turn off these sched-features if they end up causing power issues. That what makes me hesitant about them. There's a bias towards perf. But some systems prefer to save power at the expense of perf. There's a lot of grey areas in between to what perceived as a suitable trade-off for perf vs power. There are cases like above where actually you can lower freqs without hit on perf. But most of the time it's a trade-off; and some do decide to drop perf in favour of power. Keep in mind battery capacity differs between systems with the same SoC even. Some ship to enable more perf, others are more constrained and opt to be more efficient. Sorry I didn't explain things well in the cover letter. > I'm *really* hesitant on adding all these mostly random knobs -- esp. > without strong justification -- which you don't present. You mostly seem > to justify things with: people do random hack, we should legitimize them > hacks. I share your sentiment and I am trying to find out what's the right thing to do really. I am open to explore other territories. But from what I see there's a real need to give users the power to tune how responsive their system needs to be. I can't see how we can have one size that fits all here given the different capabilities of the systems and the desired outcome (I want more perf vs more efficiency). > Like the last time around, I want the actual problem explained. The > problem is not that random people on the internet do random things to > their kernel. The problem is that those 0.8 and 1.25 margins forces unsuitable default. The case I see the most is it is causing wasting power that tuning it down regains this power at no perf cost or small one. Others actually do tune it for faster response, but I can't cover this case in detail. All I know is lower end systems do struggle as they don't have enough oomph. I also saw comparison on phoronix where schedutil is not doing as good still - which tells me it seems server systems do prefer to ramp up faster too. I think that PELT thread is a variation of the same problem. So one of the things I saw is a workload where it spends majority of the time in 600-750 util_avg range. Rarely ramps up to max. But the workload under uses the medium cores and runs at a lot higher freqs than it really needs on bigs. We don't end up utilizing our resources properly. Happy to go and dig for more data/info if this is not good enough :) There's a question that I'm struggling with if I may ask. Why is it perceived our constant response time (practically ~200ms to go from 0 to max) as a good fit for all use cases? Capability of systems differs widely in terms of what performance you get at say a util of 512. Or in other words how much work is done in a unit of time differs between system, but we still represent that work in a constant way. A task ran for 10ms on powerful System A would have done a lot more work than running on poor System B for the same 10ms. But util will still rise the same for both cases. If someone wants to allow this task to be able to do more on those 10ms, it seems natural to be able to control this response time. It seems this thinking is flawed for some reason and I'd appreciate a help to understand why. I think a lot of us perceive this problem this way. Hopefully uclamp will help address these issues in a better way. ADPF gives apps a way to access it reasonably now. Unity announced support for ADPF, so hopefully games and other workloads can learn to be smarter overtime. But the spectrum of workloads to cover is still big, and adoption will take time. And there are still lessons to be learnt and improvements to make. I expect this effort to take time before it's the norm. And thinking of desktop systems; distros like Debian for example still don't enable uclamp by default on their kernels. I sent asking to enable it and it got added to wishlist.. Actually even schedutil is not enabled by default on my Pine64 running Armbian nor on my Mac Mini with M1 chip running Asahi Linux Ubuntu. You'd think big.LITTLE systems should have EAS written all over them, but not sure if this is accidental omission or ondemand is actually perceived as better. I think my intel systems also don't run schedutil by default still too. They're not waking up on lan now to double check though (yep, saving power :D). Happy to go and try to dig more info if this is still not clear enough. Thanks! -- Qais Yousef
On 08/09/2023 02:17, Qais Yousef wrote: > On 09/07/23 15:08, Peter Zijlstra wrote: >> On Mon, Aug 28, 2023 at 12:31:56AM +0100, Qais Yousef wrote: [...] > But for the 0.8 and 1.25 margin problems, actually the problem is that 25% is > too aggressive/fast and wastes power. I'm actually slowing things down as > a result of this series. And I'm expecting some not to be happy about it on > their systems. The response_time_ms was my way to give back control. I didn't > see how I can make things faster and slower at the same time without making > decisions on behalf of the user/sysadmin. > > So the connection I see between PELT and the margins or headrooms in > fits_capacity() and map_util_perf()/dvfs_headroom is that they expose the need > to manage the perf/power trade-off of the system. > > Particularly the default is not good for the modern systems, Cortex-X is too > powerful but we still operate within the same power and thermal budgets. > > And what was a high end A78 is a mid core today. So if you look at today's > mobile world topology we really have a tiy+big+huge combination of cores. The > bigs are called mids, but they're very capable. Fits capacity forces migration > to the 'huge' cores too soon with that 80% margin. While the 80% might be too > small for the tiny ones as some workloads really struggle there if they hang on > for too long. It doesn't help that these systems ship with 4ms tick. Something > more to consider changing I guess. If this is the problem then you could simply make the margin (headroom) a function of cpu_capacity_orig? [...] > There's a question that I'm struggling with if I may ask. Why is it perceived > our constant response time (practically ~200ms to go from 0 to max) as a good > fit for all use cases? Capability of systems differs widely in terms of what > performance you get at say a util of 512. Or in other words how much work is > done in a unit of time differs between system, but we still represent that work > in a constant way. A task ran for 10ms on powerful System A would have done PELT (util_avg) is uarch & frequency invariant. So e.g. a task with util_avg = 256 could have a runtime/period on big CPU (capacity = 1024) of 4ms/16ms on little CPU (capacity = 512) of 8ms/16ms The amount of work in invariant (so we can compare between asymmetric capacity CPUs) but the runtime obviously differs according to the capacity. [...]
On Fri, Sep 08, 2023 at 01:17:25AM +0100, Qais Yousef wrote: > Just to be clear, my main issue here with the current hardcoded values of the > 'margins'. And the fact they go too fast is my main problem. So I stripped the whole margin thing from my reply because I didn't want to comment on that yet, but yes, I can see how those might be a problem, and you're changing them into something dynamic, not just removing them. The tunables is what I worry most about. The moment we expose knobs it becomes really hard to change things later. > UTIL_EST_FASTER moves in one direction. And it's a constant response too, no? The idea of UTIL_EST_FASTER was that we run a PELT sum on the current activation runtime, all runtime since wakeup and take the max of this extra sum and the regular thing. On top of that this extra PELT sum can/has a time multiplier and thus ramps up faster (this multiplies could be per task). Nb.: util_est_fast = faster_est_approx(delta * 2); is a state-less expression -- by making util_est_fast = faster_est_approx(delta * curr->se.faster_mult); only the current task is affected. > I didn't get the per-task configurability part. AFAIU we can't turn off these > sched-features if they end up causing power issues. That what makes me hesitant > about them. See above, the extra sum is (fundamentally) per task, the multiplier could be per task, if you set the multiplier to <=1, you'll never gain on the existing sum and the max filter makes that the feature is effectively disabled for the one task. It of course gets us the problem of how to set the new multiplier... ;-) > There's a bias towards perf. But some systems prefer to save power > at the expense of perf. There's a lot of grey areas in between to what > perceived as a suitable trade-off for perf vs power. There are cases like above > where actually you can lower freqs without hit on perf. But most of the time > it's a trade-off; and some do decide to drop perf in favour of power. Keep in > mind battery capacity differs between systems with the same SoC even. Some ship > to enable more perf, others are more constrained and opt to be more efficient. It always depends on the workload too -- you want different trade-offs for different tasks. > > I'm *really* hesitant on adding all these mostly random knobs -- esp. > > without strong justification -- which you don't present. You mostly seem > > to justify things with: people do random hack, we should legitimize them > > hacks. > > I share your sentiment and I am trying to find out what's the right thing to do > really. I am open to explore other territories. But from what I see there's > a real need to give users the power to tune how responsive their system needs > to be. I can't see how we can have one size that fits all here given the > different capabilities of the systems and the desired outcome (I want more perf > vs more efficiency). This is true; but we also cannot keep adding random knobs. Knobs that are very specific are hard constraints we've got to live with. Take for instance uclamp, that's not something we can ever get rid of I think (randomly picking on uclamp, not saying I'm hating on it). From an actual interface POV, the unit-less generic energy-vs-perf knob is of course ideal, one global and one per task and then we can fill out the details as we see fit. System integrators (you say users, but really, not a single actual user will use any of this) can muck about and see what works for them. (even hardware has these things today, we get 0-255 values that do 'something' uarch specific) > The problem is that those 0.8 and 1.25 margins forces unsuitable default. The > case I see the most is it is causing wasting power that tuning it down regains > this power at no perf cost or small one. Others actually do tune it for faster > response, but I can't cover this case in detail. All I know is lower end > systems do struggle as they don't have enough oomph. I also saw comparison on > phoronix where schedutil is not doing as good still - which tells me it seems > server systems do prefer to ramp up faster too. I think that PELT thread is > a variation of the same problem. > > So one of the things I saw is a workload where it spends majority of the time > in 600-750 util_avg range. Rarely ramps up to max. But the workload under uses > the medium cores and runs at a lot higher freqs than it really needs on bigs. > We don't end up utilizing our resources properly. So that is actually a fairly solid argument for changing things up, if the margin causes us to neglect mid cores then that needs fixing. But I don't think that means we need a tunable. After all, the system knows it has 3 capacities, it just needs to be better at mapping workloads to them. It knows how much 'room' there is between a mid and a large. If 1.25*mid > large we in trouble etc.. > There's a question that I'm struggling with if I may ask. Why is it perceived > our constant response time (practically ~200ms to go from 0 to max) as a good > fit for all use cases? Capability of systems differs widely in terms of what > performance you get at say a util of 512. Or in other words how much work is > done in a unit of time differs between system, but we still represent that work > in a constant way. A task ran for 10ms on powerful System A would have done > a lot more work than running on poor System B for the same 10ms. But util will > still rise the same for both cases. If someone wants to allow this task to be > able to do more on those 10ms, it seems natural to be able to control this > response time. It seems this thinking is flawed for some reason and I'd > appreciate a help to understand why. I think a lot of us perceive this problem > this way. I think part of the problem is that todays servers are tomorrow's smartphones. Back when we started all this PELT nonsense computers in general were less powerful than they are now, yet todays servers are no less busy than they were back then. Give us compute, we'll fill it. Now, smartphones in particular are media devices, but a large part of the server farms are indirectly interactive too, you don't want your search query to take too long, or your bookface page stuck loading, or you twatter message about your latest poop not being insta read by your mates. That is, much of what we do with the computers, ever more powerful or not, is eventually measured in human time perception. So yeah, 200ms. Remember, all this PELT nonsense was created for cgroups, to distribute shared between CPUs as load demanded. I think for that purpose it still sorta makes sense. Ofc we've added a few more users over time, because if you have this data, might as well use it etc. I'm not sure we really sat down and analyzed if the timing all made sense. And as I argued elsewhere, PELT is a running average, but DVFS might be better suited with a max filter. > Happy to go and try to dig more info if this is still not clear enough. So I'm not generally opposed to changing things -- but I much prefer to have an actual problem driving that change :-)
On 9/7/23 15:45, Lukasz Luba wrote: >>>> RT literatur mostly methinks. Replacing WCET with a statistical model of >>>> sorts is not uncommon, the argument goes that not everybody will have >>>> their worst case at the same time and lows and highs can commonly cancel >>>> out and this way we can cram a little more on the system. >>>> >>>> Typically this is proposed in the context of soft-realtime systems. >>> >>> Thanks Peter, I will dive into some books... >> >> I would look at academic papers, not sure any of that ever made it to >> books, Daniel would know I suppose. > > Good hint, thanks! The key-words that came to my mind are: - mk-firm, where you accept m tasks will make their deadline every k execution - like, because you run too long. - mixed criticality with pWCET (probabilistic execution time) or average execution time + an sporadic tail execution time for the low criticality part. mk-firm smells like 2005's.. mixed criticality as 2015's..present. You will probably find more papers than books. Read the papers as a source for inspiration... not necessarily as a definitive solution. They generally proposed too restrictive task models. -- Daniel
On 09/08/23 12:25, Peter Zijlstra wrote: > On Fri, Sep 08, 2023 at 01:17:25AM +0100, Qais Yousef wrote: > > > Just to be clear, my main issue here with the current hardcoded values of the > > 'margins'. And the fact they go too fast is my main problem. > > So I stripped the whole margin thing from my reply because I didn't want > to comment on that yet, but yes, I can see how those might be a problem, > and you're changing them into something dynamic, not just removing them. The main difficulty is that if you try to apply those patches on their own, I'm sure you'll notice a difference. So if we were to take this alone and put them on linux-next; I expect a few regression reports for those who run with schedutil. Any ST oriented workload will not be happy. But if we compensate to reduce the regression, my problem will re-appear, just for a different reason. So whack-a-mole. I didn't know how to make both happy without being dynamic, hence the RFC to hopefully get some help and insights on how to resolve this. I think I'm hovering around the right solutions, but not sure if I'm there yet. Some implementation details certainly still need ironing out. I genuinely think that we should be more conservative in adding those hardcoded numbers without making them a function of real limitation. TEO util threshold for instance has a similar problem to these margins. I backported them to 5.10 and 5.15 and not long after I had to introduce knobs to allow tuning them as power regression reports surfaced. The good news it wasn't a full revert; the bad news those numbers seemed best for a class for workloads on a particular system, but on another system and different workloads, the reality will be different. And of course because Android has out of tree patches; I need to spend a good amount of time before I can report back properly to ensure the root cause is identified correctly. I will risk a potentially incorrect statement, but I do hold to question the validity of these hardcoded numbers on all systems and all workloads. I am not sure we can avoid being dynamic; and personally I prefer to delegate more to userspace and make it their problem to manage this dynamism. But by providing the right tools of course :) I think they need to earn their best perf/watt too; not let the kernel do all the dirty work, hehe. > The tunables is what I worry most about. The moment we expose knobs it > becomes really hard to change things later. I'm not particularly attached to them to be honest. But at the same time I am not sure if we can get away without giving the user some power to decide. What I propose what seemed the most sensible way to do it. But really open to explore alternatives and I do indeed need help to find this. Generally; I think userspace expects too much automagic and the programming model is ancient and not portable and we end up overcompensating for that in the kernel. So giving them some power is necessary the way I see it, but the shape and form it should take is debatable for sure. I don't claim to have the right answer but happy to explore and experiment to get the right ones identified and done :-) > > > UTIL_EST_FASTER moves in one direction. And it's a constant response too, no? > > The idea of UTIL_EST_FASTER was that we run a PELT sum on the current > activation runtime, all runtime since wakeup and take the max of this > extra sum and the regular thing. > > On top of that this extra PELT sum can/has a time multiplier and thus > ramps up faster (this multiplies could be per task). Nb.: > > util_est_fast = faster_est_approx(delta * 2); > > is a state-less expression -- by making > > util_est_fast = faster_est_approx(delta * curr->se.faster_mult); > > only the current task is affected. Okay; maybe I didn't understand this fully and will go back and study it more. Maybe the word faster is what makes me worried as I really see faster is not what people want on a class of systems; or at least CPUs if you think of HMP. Taming the beast is a more difficult problem in this class of systems. So if I get it correctly; we will slow things down by removing these margins, but people who suffer from this slow down will need to use util_est_faster to regain the difference, right? > > > I didn't get the per-task configurability part. AFAIU we can't turn off these > > sched-features if they end up causing power issues. That what makes me hesitant > > about them. > > See above, the extra sum is (fundamentally) per task, the multiplier > could be per task, if you set the multiplier to <=1, you'll never gain on > the existing sum and the max filter makes that the feature is > effectively disabled for the one task. Gotch ya. I think this could work, but it also seems to overlap with what we can get already with uclamp. If we can tell this task needs a faster multiplier, we can tell that it needs better uclamp_min and do that instead? When should we use one over the other if we add both? The challenging bit in practice is when we need to get some generic auto response for all these workloads that just expect the system to give them what they want without collaboration. I really do hope we can provide alternative to make these expectations obselete and just be able to tell userspace your app is not portable, go fix it; but we're not there yet. And another selfish reason; analysing workloads is harder with these. We have a lot of mechanisms on top of each others and reasoning about a cause of a power issue in particular becomes a lot harder when something goes wrong on one of these and gets bubbled up in subtle ways. Perf issues tend to be more obvious; but if something cause power or bad thermals, then finding out if there's sub optimality is hard. And if I find one, fixing it will be hard too. > It of course gets us the problem of how to set the new multiplier... ;-) I am actually trying to write a proposal for a generic QoS interface that we potentially can plumb these things into (main motivation is wake up latency control with eevdf - but seems you might be pushing something out soon). My perception of the reality is that userspace is stuck on old programming model and *a lot* of bad habits. But I think it is about time for it to get smarter and more collaborative. But this necessities we give some mechanisms to enable this smarter approach. > > > There's a bias towards perf. But some systems prefer to save power > > at the expense of perf. There's a lot of grey areas in between to what > > perceived as a suitable trade-off for perf vs power. There are cases like above > > where actually you can lower freqs without hit on perf. But most of the time > > it's a trade-off; and some do decide to drop perf in favour of power. Keep in > > mind battery capacity differs between systems with the same SoC even. Some ship > > to enable more perf, others are more constrained and opt to be more efficient. > > It always depends on the workload too -- you want different trade-offs > for different tasks. Indeed. We are trying to push for better classification of workloads so that we can tell with reasonable confidence a certain trade-off is better for them. Which what it really helps with is enable better use of resources with the pre-knowledge that the current user experience won't be impacted. Again, I really ultimately would love to see userspace becoming smarter and delegate the task of writing portable and scalable software that works across systems without the need for guess work and hand tuning. I think we're making good steps in that direction, but we still need a lot more effort. > > > > I'm *really* hesitant on adding all these mostly random knobs -- esp. > > > without strong justification -- which you don't present. You mostly seem > > > to justify things with: people do random hack, we should legitimize them > > > hacks. > > > > I share your sentiment and I am trying to find out what's the right thing to do > > really. I am open to explore other territories. But from what I see there's > > a real need to give users the power to tune how responsive their system needs > > to be. I can't see how we can have one size that fits all here given the > > different capabilities of the systems and the desired outcome (I want more perf > > vs more efficiency). > > This is true; but we also cannot keep adding random knobs. Knobs that > are very specific are hard constraints we've got to live with. Take for > instance uclamp, that's not something we can ever get rid of I think > (randomly picking on uclamp, not saying I'm hating on it). I'm really open to explore alterantives. But need help to find them. I'm also trying to simplify kernel responsibilities by delegating more to uerspace. It could be a personal mental hung up, but I can't see how can we have one size fits all. Almost all types of systems are expected to do a lot of varying workloads and both hardware and software are moving at faster pace, but programming model is pretty much the same. The response_time_ms in schedutil seemed a reasonable knob to me as it directly tells the user how fast they rampup for that policy. It can be done once at boot, or if someone has knowledge about workloads they can be smart and find the best ones for them on a particular system. The good news for us in the kernel is that we won't care. uclamp for really smart per task control, and this knob for some hand tuning for those who don't have alternatives is the way I see it. > > From an actual interface POV, the unit-less generic energy-vs-perf knob I can see this working for mobile as SoC vendors/OEM can get energy for their systems and define these curves properly. But average joe will lose out. For example M1 mac mini doesn't have energy model actually defined. I do have energy meter so I hope to be able to do some measurement, but not sure if I can get accurate numbers out. x86 and other archs don't tend to produce as good energy-vs-perf curves like we tend to see in mobile world (maybe they do and I'm just ignorant, apologies if this ends up being a bad blanket statement). Don't you think we could end up making the bar high to define this knob? It is less intuitive too, but this is less of a problem maybe. > is of course ideal, one global and one per task and then we can fill out > the details as we see fit. System integrators (you say users, but Can certainly look at that and it sounds reasonable to me, par the issues above about it requires more effort and good class of Linux users might not see these definitions on their systems as there's no real system integrators for a large class of desktop/laptop systems. It'd be nice to make the programming experience coherent and readily available, if possible. I think these systems are losing out. > really, not a single actual user will use any of this) can muck about > and see what works for them. Yes I mean system integrator. I use users maybe bcause I do think of desktops too as the integrator for them is the end users. I do hope to see more vendors do ship tuned Linux desktops/laptops like we see in Android world. Servers probably have an army of people managing them anyway. > > (even hardware has these things today, we get 0-255 values that do > 'something' uarch specific) Ah, could I get some pointers please? > > > The problem is that those 0.8 and 1.25 margins forces unsuitable default. The > > case I see the most is it is causing wasting power that tuning it down regains > > this power at no perf cost or small one. Others actually do tune it for faster > > response, but I can't cover this case in detail. All I know is lower end > > systems do struggle as they don't have enough oomph. I also saw comparison on > > phoronix where schedutil is not doing as good still - which tells me it seems > > server systems do prefer to ramp up faster too. I think that PELT thread is > > a variation of the same problem. > > > > So one of the things I saw is a workload where it spends majority of the time > > in 600-750 util_avg range. Rarely ramps up to max. But the workload under uses > > the medium cores and runs at a lot higher freqs than it really needs on bigs. > > We don't end up utilizing our resources properly. > > So that is actually a fairly solid argument for changing things up, if > the margin causes us to neglect mid cores then that needs fixing. But I > don't think that means we need a tunable. After all, the system knows it > has 3 capacities, it just needs to be better at mapping workloads to > them. We can fix the misfit capacity without a tunable, I believe. I just know from past discussions that those low end systems like these to be large. And the PELT boot time is to help address this potential issue. Happy to leave it out and leave it to someone who cares to come and complain. But from theoertical point of view I can see the problem of slow response on those systems. And capacities don't tell us much if this is a high end SoC or lower end SoC. Nor util or anything else we have in the system today, to my knowledge at least. > > It knows how much 'room' there is between a mid and a large. If 1.25*mid Ideally we should end up distributing on mids and bigs for the capacity region that overlaps. I do see that the needs to have the margin is related to misfit migration and we can fix it by improving the definition of this relationship. I'm not sure if I implemented it in the best way, but I think the definition I'm proposing makes sense and removes guess work. If the task is 600 and fits in both mids and bigs, why should we skip the mids as a candidate if no misfit can happen very soon by next tick? If current implementation is expensive I think I can make it cheaper. But if no misfit can happen within tick, I think we need to consider those CPUs as candidates. On a slightly related problem that I avoided bringing up but maybe a good time now. I see the definition of overutilized is stale too. It is a wrapper around fits_capacity(), or misfit detection. It is very easy for a single busy task to trigger overutilized. And if this task is background and capped by cpuset to little cores, then we end up overutilized until it decides to go back to sleep. Not ideal. I think the definition needs revisiting too, but I have no idea how yet. It should be more of a function of the current system state rather than tightly coupled with misfit detection. EAS is disabled when we're overutilized and default spreading behavior can be expensive in terms of power. > > large we in trouble etc.. > > > There's a question that I'm struggling with if I may ask. Why is it perceived > > our constant response time (practically ~200ms to go from 0 to max) as a good > > fit for all use cases? Capability of systems differs widely in terms of what > > performance you get at say a util of 512. Or in other words how much work is > > done in a unit of time differs between system, but we still represent that work > > in a constant way. A task ran for 10ms on powerful System A would have done > > a lot more work than running on poor System B for the same 10ms. But util will > > still rise the same for both cases. If someone wants to allow this task to be > > able to do more on those 10ms, it seems natural to be able to control this > > response time. It seems this thinking is flawed for some reason and I'd > > appreciate a help to understand why. I think a lot of us perceive this problem > > this way. > > I think part of the problem is that todays servers are tomorrow's > smartphones. Back when we started all this PELT nonsense computers in > general were less powerful than they are now, yet todays servers are no > less busy than they were back then. > > Give us compute, we'll fill it. Hehe, yep! > > Now, smartphones in particular are media devices, but a large part of > the server farms are indirectly interactive too, you don't want your > search query to take too long, or your bookface page stuck loading, or > you twatter message about your latest poop not being insta read by your > mates. > > That is, much of what we do with the computers, ever more powerful or > not, is eventually measured in human time perception. Sadly I do think a lot of workloads make bad assumptions about hardware and kernel services and I think the past trend has been to compensate for this in the kernel but the true problem IMHO is that our current programming model is stale and programs are carrying old bad habits that are no longer valid. As a simple example a lot has struggled with HMP systems as they were assuming if I have X cores then I can spawn X threads and do my awesome parallel work. They of course got caught out badly. They used affinity later to be smart about which cores, but then as noted earlier the bigs are power hungry and now they can easily end up in power and thermal issues because the past assumptions are no longer true. By the way even the littles can be power hungry at top frequencies. So any form of pinning done is causing problems. They just can't make assumptions. But what to do instead then? ADPF and uclamp is the way to address this and make portable software and that's what being pushed for. But flushing these old habits out will take time. Beside I think we still have ironing out work to be done. Generally even in desktop/laptop/server, programmers seem to think they're the only active app and get greedy when creating tasks ending up tripping over themselves. We need a smart middleware to manage these; or a new programming model to abstract these details. I don't know how, but the status queue is that programming model is lagging behind. I think Windows and Mac OS/iOS do provide some more tightly integrated interfaces for apps; see Grand Central Dispatcher for instance on apple OSes. > > So yeah, 200ms. > > Remember, all this PELT nonsense was created for cgroups, to distribute > shared between CPUs as load demanded. I think for that purpose it still > sorta makes sense. > > Ofc we've added a few more users over time, because if you have this > data, might as well use it etc. I'm not sure we really sat down and > analyzed if the timing all made sense. I think if we want to distill the problems to its basic form, it's a timing issue. Too fast, we lose power. Too slow, we lose perf. And we don't have a way to scale perf per systems. ie; what absolute perf we end up getting we don't know and I'm not sure if we can provide at all without hardware extensions. So that's why we end up scaling time, and end up with those related knobs. > > And as I argued elsewhere, PELT is a running average, but DVFS might be > better suited with a max filter. Sorry didn't catch up with all other replies yet, but I will think how to incorporate all of that in. I think the major issue is that we do need to both speed up and slow down. And as long as we are able to achieve that I'm really fine to explore options. What I'm presenting here is what truly seemed to me the best. But I need help and feedback to do better :-) > > > Happy to go and try to dig more info if this is still not clear enough. > > So I'm not generally opposed to changing things -- but I much prefer to > have an actual problem driving that change :-) Good to know. If you think the info I shared are still not good enough, I can look for more examples. I think my main goal here is really to discuss the problem, and my proposed solution is a way to demonstrate both the problem and a possible way to fix it so I'm not just complaining, but actively looking for fixes too :-) I don't claim to have all the right answers, but certainly happy to follow this through to make sure we fix the problem properly. Hopefully not just for me, but for all those who've been struggling with similar problems. Thanks! -- Qais Yousef
On Fri, Sep 08, 2023 at 02:33:36PM +0100, Qais Yousef wrote: > > > UTIL_EST_FASTER moves in one direction. And it's a constant response too, no? > > > > The idea of UTIL_EST_FASTER was that we run a PELT sum on the current > > activation runtime, all runtime since wakeup and take the max of this > > extra sum and the regular thing. > > > > On top of that this extra PELT sum can/has a time multiplier and thus > > ramps up faster (this multiplies could be per task). Nb.: > > > > util_est_fast = faster_est_approx(delta * 2); > > > > is a state-less expression -- by making > > > > util_est_fast = faster_est_approx(delta * curr->se.faster_mult); > > > > only the current task is affected. > > Okay; maybe I didn't understand this fully and will go back and study it more. > > Maybe the word faster is what makes me worried as I really see faster is not > what people want on a class of systems; or at least CPUs if you think of HMP. > Taming the beast is a more difficult problem in this class of systems. The faster refers to the ramp-up. Which was the issue identified in that earlier thread. The game thing wanted to ramp up more agressive.
On Fri, Sep 08, 2023 at 02:33:36PM +0100, Qais Yousef wrote: > > (even hardware has these things today, we get 0-255 values that do > > 'something' uarch specific) > > Ah, could I get some pointers please? Intel HWP.EPP and AMD CPPC EPP I think.. both intel-pstate and amd-pstate have EPP thingies.
On 09/08/23 09:40, Dietmar Eggemann wrote: > On 08/09/2023 02:17, Qais Yousef wrote: > > On 09/07/23 15:08, Peter Zijlstra wrote: > >> On Mon, Aug 28, 2023 at 12:31:56AM +0100, Qais Yousef wrote: > > [...] > > > But for the 0.8 and 1.25 margin problems, actually the problem is that 25% is > > too aggressive/fast and wastes power. I'm actually slowing things down as > > a result of this series. And I'm expecting some not to be happy about it on > > their systems. The response_time_ms was my way to give back control. I didn't > > see how I can make things faster and slower at the same time without making > > decisions on behalf of the user/sysadmin. > > > > So the connection I see between PELT and the margins or headrooms in > > fits_capacity() and map_util_perf()/dvfs_headroom is that they expose the need > > to manage the perf/power trade-off of the system. > > > > Particularly the default is not good for the modern systems, Cortex-X is too > > powerful but we still operate within the same power and thermal budgets. > > > > And what was a high end A78 is a mid core today. So if you look at today's > > mobile world topology we really have a tiy+big+huge combination of cores. The > > bigs are called mids, but they're very capable. Fits capacity forces migration > > to the 'huge' cores too soon with that 80% margin. While the 80% might be too > > small for the tiny ones as some workloads really struggle there if they hang on > > for too long. It doesn't help that these systems ship with 4ms tick. Something > > more to consider changing I guess. > > If this is the problem then you could simply make the margin (headroom) > a function of cpu_capacity_orig? I don't see what you mean. instead of capacity_of() but keep the 80%? Again, I could be delusional and misunderstanding everything, but what I really see fits_capacity() is about is misfit detection. But a task is not really misfit until it actually has a util above the capacity of the CPU. Now due to implementation details there can be a delay between the task crossing this capacity and being able to move it. Which what I believe this headroom is trying to achieve. I think we can better define this by tying this headroom to the worst case scenario it takes to actually move this misfit task to the right CPU. If it can continue to run without being impacted with this delay and crossing the capacity of the CPU it is on, then we should not trigger misfit IMO. > > [...] > > > There's a question that I'm struggling with if I may ask. Why is it perceived > > our constant response time (practically ~200ms to go from 0 to max) as a good > > fit for all use cases? Capability of systems differs widely in terms of what > > performance you get at say a util of 512. Or in other words how much work is > > done in a unit of time differs between system, but we still represent that work > > in a constant way. A task ran for 10ms on powerful System A would have done > > PELT (util_avg) is uarch & frequency invariant. > > So e.g. a task with util_avg = 256 could have a runtime/period > > on big CPU (capacity = 1024) of 4ms/16ms > > on little CPU (capacity = 512) of 8ms/16ms > > The amount of work in invariant (so we can compare between asymmetric > capacity CPUs) but the runtime obviously differs according to the capacity. Yep! Cheers -- Qais Yousef
On 09/08/23 15:59, Peter Zijlstra wrote: > On Fri, Sep 08, 2023 at 02:33:36PM +0100, Qais Yousef wrote: > > > > (even hardware has these things today, we get 0-255 values that do > > > 'something' uarch specific) > > > > Ah, could I get some pointers please? > > Intel HWP.EPP and AMD CPPC EPP I think.. both intel-pstate and > amd-pstate have EPP thingies. Okay, thanks! So do you see tying this to the presence of some hardware mechanisms and provide a fallback for the other systems to define it somehow would be the best way to explore this? Cheers -- Qais Yousef
On 09/07/23 08:48, Lukasz Luba wrote: > They are periodic in a sense, they wake up every 16ms, but sometimes > they have more work. It depends what is currently going in the game > and/or sometimes the data locality (might not be in cache). > > Although, that's for games, other workloads like youtube play or this > one 'Yahoo browser' (from your example) are more 'predictable' (after > the start up period). And I really like the potential energy saving > there :) It is more complicated than that from what I've seen. Userspace is sadly bloated and the relationship between the tasks are a lot more complex. They talk to other frame work elements, other hardware, have network elements coming in, and specifically for gaming, could be preparing multiple frames in parallel. The task wake up and sleep time is not that periodic. It can busy loop for periods of time, other wake up for short periods of time (pattern of which might not be on point as it interacts with other elements in a serial manner where one prepared something and can take variable time every wake up to prepare it before handing it over to the next task). Browsers can be tricky as well as when user scrolls what elements appear and what java script will execute and how heavy it is, and how many tabs are have webpages being opened and how the user is moving between them. It is organized chaos :-) > > > > > I think the model of a periodic task is not suitable for most workloads. All > > of them are dynamic and how much they need to do at each wake up can very > > significantly over 10s of ms. > > Might be true, the model was built a few years ago when there wasn't > such dynamic game scenario with high FPS on mobiles. This could still > be tuned with your new design IIUC (no need extra hooks in Android). It is my perception of course. But I think generally, not just for gaming, there are a lot of elements interacting with each others in a complex way. The wake up time and length is determined by these complex elements; and it is a very dynamic interaction where they could get into steady state for a very short period of time but could change quickly. As an extreme example a player could be standing in empty room doing nothing but another player in another part of the world launches a rocket on this room and we'd get to know when the network packet arrives that we have to draw a big explosion. A lot of workloads are interactive and these moments of interactions are the challenging ones. > > > > > > 2. Plumb in this new idea of dvfs_update_delay as the new > > > 'margin' - this I don't understand > > > > > > For the 2. I don't see that the dvfs HW characteristics are best > > > for this problem purpose. We can have a really fast DVFS HW, > > > but we need some decent spare idle time in some workloads, which > > > are two independent issues IMO. You might get the higher > > > idle time thanks to 1.1. but this is a 'side effect'. > > > > > > Could you explain a bit more why this dvfs_update_delay is > > > crucial here? > > > > I'm not sure why you relate this to idle time. And the word margin is a bit > > overloaded here. so I suppose you're referring to the one we have in > > map_util_perf() or apply_dvfs_headroom(). And I suppose you assume this extra > > headroom will result in idle time, but this is not necessarily true IMO. > > > > My rationale is simply that DVFS based on util should follow util_avg as-is. > > But as pointed out in different discussions happened elsewhere, we need to > > provide a headroom for this util to grow as if we were to be exact and the task > > continues to run, then likely the util will go above the current OPP before we > > get a chance to change it again. If we do have an ideal hardware that takes > > Yes, this is another requirement to have +X% margin. When the tasks are > growing, we don't know their final util_avg and we give them a bit more > cycles. > IMO we have to be ready always for such situation in the scheduler, > haven't we? Yes we should. I think I am not ignoring this part. Hope I clarified things offline. Cheers -- Qais Yousef
On 09/07/23 13:53, Peter Zijlstra wrote: > On Thu, Sep 07, 2023 at 08:48:08AM +0100, Lukasz Luba wrote: > > > > Hehe. That's because they're not really periodic ;-) > > > > They are periodic in a sense, they wake up every 16ms, but sometimes > > they have more work. It depends what is currently going in the game > > and/or sometimes the data locality (might not be in cache). > > > > Although, that's for games, other workloads like youtube play or this > > one 'Yahoo browser' (from your example) are more 'predictable' (after > > the start up period). And I really like the potential energy saving > > there :) > > So everything media is fundamentally periodic, you're hard tied to the > framerate / audio-buffer size etc.. > > Also note that the traditional periodic task model from the real-time > community has the notion of WCET, which completely covers this > fluctuation in frame-to-frame work, it only considers the absolute worst > case. > > Now, practically, that stinks, esp. when you care about batteries, but > it does not mean these tasks are not periodic. piecewise periodic? > Many extentions to the periodic task model are possible, including > things like average runtime with bursts etc.. all have their trade-offs. The challenge we have is the endless number of workloads we need to cater for.. Or you think one of these models can actually scale to that? Thanks! -- Qais Yousef
On 09/07/23 15:26, Peter Zijlstra wrote: > On Wed, Sep 06, 2023 at 10:18:50PM +0100, Qais Yousef wrote: > > > This is probably controversial statement. But I am not in favour of util_est. > > I need to collect the data, but I think we're better with 16ms PELT HALFLIFE as > > default instead. But I will need to do a separate investigation on that. > > I think util_est makes perfect sense, where PELT has to fundamentally My concern about it is that it has inherit bias towards PERF. In the soap of tasks running in the system, not all of which care about perf. The key tasks tend to be the minority, I'd say. Again, I need to do more investigations but my worry mainly comes from that and what impact it could have on power. In an ideal world where userspace is fully uclamp aware, we shouldn't need it. The task can set uclamp_min to make sure it sees the right performance at wake up. And depends on the outcome of this discussion, we might need to introduce something to help speed up/slow down migration more selectively. So it could become redundant control. > decay non-running / non-runnable tasks in order to provide a temporal > average, DVFS might be best served with a termporal max filter. Ah, I certainly don't think DVFS need PELT HALFLIFE to be controlled. I only see it being useful on HMP systems, under-powered specifically, that really need faster *migration* times. Maybe we can find a better way to control this. I'll think about it. Not sure about the temporal max. Isn't this a bias towards perf first too? Thanks! -- Qais Yousef
On 09/07/23 15:42, Lukasz Luba wrote: > > > On 9/7/23 15:29, Peter Zijlstra wrote: > > On Thu, Sep 07, 2023 at 02:57:26PM +0100, Lukasz Luba wrote: > > > > > > > > > On 9/7/23 14:26, Peter Zijlstra wrote: > > > > On Wed, Sep 06, 2023 at 10:18:50PM +0100, Qais Yousef wrote: > > > > > > > > > This is probably controversial statement. But I am not in favour of util_est. > > > > > I need to collect the data, but I think we're better with 16ms PELT HALFLIFE as > > > > > default instead. But I will need to do a separate investigation on that. > > > > > > > > I think util_est makes perfect sense, where PELT has to fundamentally > > > > decay non-running / non-runnable tasks in order to provide a temporal > > > > average, DVFS might be best served with a termporal max filter. > > > > > > > > > > > > > > Since we are here... > > > Would you allow to have a configuration for > > > the util_est shifter: UTIL_EST_WEIGHT_SHIFT ? > > > > > > I've found other values than '2' better in some scenarios. That helps > > > to prevent a big task to 'down' migrate from a Big CPU (1024) to some > > > Mid CPU (~500-700 capacity) or even Little (~120-300). > > > > Larger values, I'm thinking you're after? Those would cause the new > > contribution to weight less, making the function more smooth, right? > > Yes, more smooth, because we only use the 'ewma' goodness for decaying > part (not the raising [1]). > > > > > What task characteristic is tied to this? That is, this seems trivial to > > modify per-task. > > In particular Speedometer test and the main browser task, which reaches > ~900util, but sometimes vanish and waits for other background tasks > to do something. In the meantime it can decay and wake-up on > Mid/Little (which can cause a penalty to score up to 5-10% vs. if > we pin the task to big CPUs). So, a longer util_est helps to avoid > at least very bad down migration to Littles... Warning, this is not a global win! We do want tasks in general to downmigrate when they sleep. Would be great to avoid biasing towards perf first by default to fix these special cases. As I mentioned in other reply, there's a perf/power/thermal impact of these decisions and it's not a global win. Some might want this to improve their scores, others might not want that and rather get the worse score but keep their power budget in check. And it will highly depend on the workload and the system. Which we can't test everyone of them :( We did give the power to userspace via uclamp which should make this problem fixable. And this is readily available. We can't basically know in the kernel when one way is better than the other without being told explicitly IMHO. If you try to boot with faster PELT HALFLIFE, would this give you the same perf/power trade-off? Thanks -- Qais Yousef > > [1] https://elixir.bootlin.com/linux/v6.5.1/source/kernel/sched/fair.c#L4442
On 09/08/23 14:33, Qais Yousef wrote: > On 09/08/23 12:25, Peter Zijlstra wrote: > > On Fri, Sep 08, 2023 at 01:17:25AM +0100, Qais Yousef wrote: > > > > > Just to be clear, my main issue here with the current hardcoded values of the > > > 'margins'. And the fact they go too fast is my main problem. > > > > So I stripped the whole margin thing from my reply because I didn't want > > to comment on that yet, but yes, I can see how those might be a problem, > > and you're changing them into something dynamic, not just removing them. > > The main difficulty is that if you try to apply those patches on their own, I'm > sure you'll notice a difference. So if we were to take this alone and put them > on linux-next; I expect a few regression reports for those who run with > schedutil. Any ST oriented workload will not be happy. But if we compensate to > reduce the regression, my problem will re-appear, just for a different reason. > So whack-a-mole. Sorry I just realized that the dynamic thing was about the margin, not the new knob. My answer above still holds to some extent. But yes, I meant to write that I'm removing magic hardcoded numbers from the margins. Cheers -- Qais Yousef
Hi Peter, On 9/7/23 21:16, Peter Zijlstra wrote: > On Thu, Sep 07, 2023 at 03:42:13PM +0100, Lukasz Luba wrote: > >>> What task characteristic is tied to this? That is, this seems trivial to >>> modify per-task. >> >> In particular Speedometer test and the main browser task, which reaches >> ~900util, but sometimes vanish and waits for other background tasks >> to do something. In the meantime it can decay and wake-up on >> Mid/Little (which can cause a penalty to score up to 5-10% vs. if >> we pin the task to big CPUs). So, a longer util_est helps to avoid >> at least very bad down migration to Littles... > > Do they do a few short activations (wakeup/sleeps) while waiting? That > would indeed completely ruin things since the EWMA thing is activation > based. > > I wonder if there's anything sane we can do here... My apologies for a delay, I have tried to push the graphs for you. The experiment is on pixel7*. It's running the browser on the phone with the test 'Speedometer 2.0'. It's a web test (you can also run on your phone) available here, no need to install anything: https://browserbench.org/Speedometer2.0/ Here is the Jupiter notebook [1], with plots of the signals: - top 20 tasks' (based on runtime) utilization - Util EST signals for the top 20 tasks, with the longer decaying ewma filter (which is the 'red' plot called 'ewma') - the main task (comm=CrRendererMain) Util, Util EST and task residency (which tires to stick to CPUs 6,7* ) - the test score was 144.6 (while with fast decay ewma is ~134), so staying at big cpus (helps the score in this case) (the plots are interactive, you can zoom in with the icon 'Box Zoom') (e.g. you can zoom in the task activation plot which is also linked with the 'Util EST' on top, for this main task) You can see the util signal of that 'CrRendererMain' task and those utilization drops in time, which I was referring to. When the util drops below some threshold, the task might 'fit' into smaller CPU, which could be prevented automatically byt maintaining the util est for longer (but not for all). I do like your idea that Util EST might be per-task. I'm going to check this, because that might help to get rid of the overutilized state which is probably because small tasks are also 'bigger' for longer. If this util est have chance to fly upstream, I could send an RFC if you don't mind. Regards, Lukasz *CPUs 6,7 - big (1024 capacity), CPUs 4,5 Mid (768 capacity), CPUs 0-3 Littles (~150 capacity) [1] https://nbviewer.org/github/lukaszluba-arm/lisa/blob/public_tests/p7_wa_speedometer2_small_size.ipynb
Hi Daniel, On 9/8/23 13:51, Daniel Bristot de Oliveira wrote: > On 9/7/23 15:45, Lukasz Luba wrote: >>>>> RT literatur mostly methinks. Replacing WCET with a statistical model of >>>>> sorts is not uncommon, the argument goes that not everybody will have >>>>> their worst case at the same time and lows and highs can commonly cancel >>>>> out and this way we can cram a little more on the system. >>>>> >>>>> Typically this is proposed in the context of soft-realtime systems. >>>> >>>> Thanks Peter, I will dive into some books... >>> >>> I would look at academic papers, not sure any of that ever made it to >>> books, Daniel would know I suppose. >> >> Good hint, thanks! > > The key-words that came to my mind are: > > - mk-firm, where you accept m tasks will make their deadline > every k execution - like, because you run too long. > - mixed criticality with pWCET (probabilistic execution time) or > average execution time + an sporadic tail execution time for > the low criticality part. > > mk-firm smells like 2005's.. mixed criticality as 2015's..present. > > You will probably find more papers than books. Read the papers > as a source for inspiration... not necessarily as a definitive > solution. They generally proposed too restrictive task models. > > -- Daniel > Thanks for describing this context! That would save my time and avoid maybe sinking in this unknown water. As you said I might tread that as inspiration, since I don't fight with life-critical system, but a phone which needs 'nice user experience' (hopefully there are no people who disagree) ;) Regards, Lukasz
Hi Lukasz, On Tue, 12 Sept 2023 at 13:51, Lukasz Luba <lukasz.luba@arm.com> wrote: > > Hi Peter, > > On 9/7/23 21:16, Peter Zijlstra wrote: > > On Thu, Sep 07, 2023 at 03:42:13PM +0100, Lukasz Luba wrote: > > > >>> What task characteristic is tied to this? That is, this seems trivial to > >>> modify per-task. > >> > >> In particular Speedometer test and the main browser task, which reaches > >> ~900util, but sometimes vanish and waits for other background tasks > >> to do something. In the meantime it can decay and wake-up on > >> Mid/Little (which can cause a penalty to score up to 5-10% vs. if > >> we pin the task to big CPUs). So, a longer util_est helps to avoid > >> at least very bad down migration to Littles... > > > > Do they do a few short activations (wakeup/sleeps) while waiting? That > > would indeed completely ruin things since the EWMA thing is activation > > based. > > > > I wonder if there's anything sane we can do here... > > My apologies for a delay, I have tried to push the graphs for you. > > The experiment is on pixel7*. It's running the browser on the phone > with the test 'Speedometer 2.0'. It's a web test (you can also run on > your phone) available here, no need to install anything: > https://browserbench.org/Speedometer2.0/ > > Here is the Jupiter notebook [1], with plots of the signals: > - top 20 tasks' (based on runtime) utilization > - Util EST signals for the top 20 tasks, with the longer decaying ewma > filter (which is the 'red' plot called 'ewma') > - the main task (comm=CrRendererMain) Util, Util EST and task residency > (which tires to stick to CPUs 6,7* ) > - the test score was 144.6 (while with fast decay ewma is ~134), so > staying at big cpus (helps the score in this case) > > (the plots are interactive, you can zoom in with the icon 'Box Zoom') > (e.g. you can zoom in the task activation plot which is also linked > with the 'Util EST' on top, for this main task) > > You can see the util signal of that 'CrRendererMain' task and those > utilization drops in time, which I was referring to. When the util > drops below some threshold, the task might 'fit' into smaller CPU, > which could be prevented automatically byt maintaining the util est > for longer (but not for all). I was looking at your nice chart and I wonder if you could also add the runnable _avg of the tasks ? My 1st impression is that the decrease happens when your task starts to share the CPU with some other tasks and this ends up with a decrease of its utilization because util_avg doesn't take into account the waiting time so typically task with an utilization of 1024, will see its utilization decrease because of other tasks running on the same cpu. This would explain the drop that you can see. I wonder if we should not take into account the runnable_avg when applying the ewm on util_est ? so the util_est will not decrease because of time sharing with other > > I do like your idea that Util EST might be per-task. I'm going to > check this, because that might help to get rid of the overutilized state > which is probably because small tasks are also 'bigger' for longer. > > If this util est have chance to fly upstream, I could send an RFC if > you don't mind. > > Regards, > Lukasz > > *CPUs 6,7 - big (1024 capacity), CPUs 4,5 Mid (768 capacity), CPUs 0-3 > Littles (~150 capacity) > > [1] > https://nbviewer.org/github/lukaszluba-arm/lisa/blob/public_tests/p7_wa_speedometer2_small_size.ipynb
On Tue, Sep 12, 2023 at 12:51:52PM +0100, Lukasz Luba wrote: > You can see the util signal of that 'CrRendererMain' task and those > utilization drops in time, which I was referring to. When the util > drops below some threshold, the task might 'fit' into smaller CPU, > which could be prevented automatically byt maintaining the util est > for longer (but not for all). Right, so right at those util_est dips it has some small activations. It's like a poll loop or something instead of a full block waiting for things to happen. And yeah, that'll destroy util_est in a hurry :/ > I do like your idea that Util EST might be per-task. I'm going to > check this, because that might help to get rid of the overutilized state > which is probably because small tasks are also 'bigger' for longer. > > If this util est have chance to fly upstream, I could send an RFC if > you don't mind. The biggest stumbling block I see is the user interface; some generic QoS hints based thing that allows us to do random things -- like tune the above might do, dunno.
On 08/09/2023 16:07, Qais Yousef wrote: > On 09/08/23 09:40, Dietmar Eggemann wrote: >> On 08/09/2023 02:17, Qais Yousef wrote: >>> On 09/07/23 15:08, Peter Zijlstra wrote: >>>> On Mon, Aug 28, 2023 at 12:31:56AM +0100, Qais Yousef wrote: [...] >>> And what was a high end A78 is a mid core today. So if you look at today's >>> mobile world topology we really have a tiy+big+huge combination of cores. The >>> bigs are called mids, but they're very capable. Fits capacity forces migration >>> to the 'huge' cores too soon with that 80% margin. While the 80% might be too >>> small for the tiny ones as some workloads really struggle there if they hang on >>> for too long. It doesn't help that these systems ship with 4ms tick. Something >>> more to consider changing I guess. >> >> If this is the problem then you could simply make the margin (headroom) >> a function of cpu_capacity_orig? > > I don't see what you mean. instead of capacity_of() but keep the 80%? > > Again, I could be delusional and misunderstanding everything, but what I really > see fits_capacity() is about is misfit detection. But a task is not really > misfit until it actually has a util above the capacity of the CPU. Now due to > implementation details there can be a delay between the task crossing this > capacity and being able to move it. Which what I believe this headroom is > trying to achieve. > > I think we can better define this by tying this headroom to the worst case > scenario it takes to actually move this misfit task to the right CPU. If it can > continue to run without being impacted with this delay and crossing the > capacity of the CPU it is on, then we should not trigger misfit IMO. Instead of: fits_capacity(unsigned long util, unsigned long capacity) return approximate_util_avg(util, TICK_USEC) < capacity; just make 1280 in: #define fits_capacity(cap, max) ((cap) * 1280 < (max) * 1024) dependent on cpu's capacity_orig or the capacity diff to the next higher capacity_orig. Typical example today: {little-medium-big capacity_orig} = {128, 896, 1024} 896÷128 = 7 1024/896 = 1.14 to achieve higher margin on little and lower margin on medium. [...]
Hi Vincent, On 9/12/23 15:01, Vincent Guittot wrote: > Hi Lukasz, > > On Tue, 12 Sept 2023 at 13:51, Lukasz Luba <lukasz.luba@arm.com> wrote: >> >> Hi Peter, >> >> On 9/7/23 21:16, Peter Zijlstra wrote: >>> On Thu, Sep 07, 2023 at 03:42:13PM +0100, Lukasz Luba wrote: >>> >>>>> What task characteristic is tied to this? That is, this seems trivial to >>>>> modify per-task. >>>> >>>> In particular Speedometer test and the main browser task, which reaches >>>> ~900util, but sometimes vanish and waits for other background tasks >>>> to do something. In the meantime it can decay and wake-up on >>>> Mid/Little (which can cause a penalty to score up to 5-10% vs. if >>>> we pin the task to big CPUs). So, a longer util_est helps to avoid >>>> at least very bad down migration to Littles... >>> >>> Do they do a few short activations (wakeup/sleeps) while waiting? That >>> would indeed completely ruin things since the EWMA thing is activation >>> based. >>> >>> I wonder if there's anything sane we can do here... >> >> My apologies for a delay, I have tried to push the graphs for you. >> >> The experiment is on pixel7*. It's running the browser on the phone >> with the test 'Speedometer 2.0'. It's a web test (you can also run on >> your phone) available here, no need to install anything: >> https://browserbench.org/Speedometer2.0/ >> >> Here is the Jupiter notebook [1], with plots of the signals: >> - top 20 tasks' (based on runtime) utilization >> - Util EST signals for the top 20 tasks, with the longer decaying ewma >> filter (which is the 'red' plot called 'ewma') >> - the main task (comm=CrRendererMain) Util, Util EST and task residency >> (which tires to stick to CPUs 6,7* ) >> - the test score was 144.6 (while with fast decay ewma is ~134), so >> staying at big cpus (helps the score in this case) >> >> (the plots are interactive, you can zoom in with the icon 'Box Zoom') >> (e.g. you can zoom in the task activation plot which is also linked >> with the 'Util EST' on top, for this main task) >> >> You can see the util signal of that 'CrRendererMain' task and those >> utilization drops in time, which I was referring to. When the util >> drops below some threshold, the task might 'fit' into smaller CPU, >> which could be prevented automatically byt maintaining the util est >> for longer (but not for all). > > I was looking at your nice chart and I wonder if you could also add > the runnable _avg of the tasks ? Yes, I will try today or tomorrow to add such plots as well. > > My 1st impression is that the decrease happens when your task starts > to share the CPU with some other tasks and this ends up with a > decrease of its utilization because util_avg doesn't take into account > the waiting time so typically task with an utilization of 1024, will > see its utilization decrease because of other tasks running on the > same cpu. This would explain the drop that you can see. > > I wonder if we should not take into account the runnable_avg when > applying the ewm on util_est ? so the util_est will not decrease > because of time sharing with other Yes, that sounds a good idea. Let me provide those plots so we could go further with the analysis. I will try to capture if that happens to that particular task on CPU (if there are some others as well). Thanks for jumping in to the discussion! Lukasz
On 09/12/23 19:18, Dietmar Eggemann wrote: > On 08/09/2023 16:07, Qais Yousef wrote: > > On 09/08/23 09:40, Dietmar Eggemann wrote: > >> On 08/09/2023 02:17, Qais Yousef wrote: > >>> On 09/07/23 15:08, Peter Zijlstra wrote: > >>>> On Mon, Aug 28, 2023 at 12:31:56AM +0100, Qais Yousef wrote: > > [...] > > >>> And what was a high end A78 is a mid core today. So if you look at today's > >>> mobile world topology we really have a tiy+big+huge combination of cores. The > >>> bigs are called mids, but they're very capable. Fits capacity forces migration > >>> to the 'huge' cores too soon with that 80% margin. While the 80% might be too > >>> small for the tiny ones as some workloads really struggle there if they hang on > >>> for too long. It doesn't help that these systems ship with 4ms tick. Something > >>> more to consider changing I guess. > >> > >> If this is the problem then you could simply make the margin (headroom) > >> a function of cpu_capacity_orig? > > > > I don't see what you mean. instead of capacity_of() but keep the 80%? > > > > Again, I could be delusional and misunderstanding everything, but what I really > > see fits_capacity() is about is misfit detection. But a task is not really > > misfit until it actually has a util above the capacity of the CPU. Now due to > > implementation details there can be a delay between the task crossing this > > capacity and being able to move it. Which what I believe this headroom is > > trying to achieve. > > > > I think we can better define this by tying this headroom to the worst case > > scenario it takes to actually move this misfit task to the right CPU. If it can > > continue to run without being impacted with this delay and crossing the > > capacity of the CPU it is on, then we should not trigger misfit IMO. > > > Instead of: > > fits_capacity(unsigned long util, unsigned long capacity) > > return approximate_util_avg(util, TICK_USEC) < capacity; > > just make 1280 in: > > #define fits_capacity(cap, max) ((cap) * 1280 < (max) * 1024) > > dependent on cpu's capacity_orig or the capacity diff to the next higher > capacity_orig. > > Typical example today: {little-medium-big capacity_orig} = {128, 896, 1024} > > 896÷128 = 7 > > 1024/896 = 1.14 > > to achieve higher margin on little and lower margin on medium. I am not keen on this personally. I think these numbers are random to me and why they help (or not help) is not clear to me at least. I do believe that the only reason why we want to move before a task util crosses the capacity of the CPU is tied down to the misfit load balance to be able to move the task. Because until the task crosses the capacity, it is getting its computational demand per our PELT representation. But since load balance is not an immediate action (especially on our platforms where it is 4ms, something I hope we can change); we need to preemptively exclude the CPU as a misfit when we know the task will get 'stuck' on this CPU and not get its computational demand (as per our representation of course). I think this removes all guess work and provides a very meaningful decision making process that I think will scale transparently so we utilize our resources the best we can. We can probably optimize the code to avoid the call to approximate_util_avg() if this is a problem. Why do you think the ratio of cpu capacities gives more meaningful method to judge misfit? Thanks! -- Qais Yousef