Message ID | 20240728184551.42133-1-qyousef@layalina.io (mailing list archive) |
---|---|
State | Handled Elsewhere, archived |
Headers | show |
Series | [v7] sched: Consolidate cpufreq updates | expand |
On 28/07/2024 7:45 pm, Qais Yousef wrote: > Improve the interaction with cpufreq governors by making the > cpufreq_update_util() calls more intentional. [snip] > We also ensure to ignore cpufreq udpates for sugov workers at context Nit: s/udpates/updates/ > switch if it was prev task. [snip] > > +static __always_inline void > +__update_cpufreq_ctx_switch(struct rq *rq, struct task_struct *prev) > +{ > +#ifdef CONFIG_CPU_FREQ > + if (prev && prev->dl.flags & SCHED_FLAG_SUGOV) { > + /* Sugov just did an update, don't be too aggressive */ > + return; > + } > + > + /* > + * RT and DL should always send a freq update. But we can do some > + * simple checks to avoid it when we know it's not necessary. > + * > + * iowait_boost will always trigger a freq update too. > + * > + * Fair tasks will only trigger an update if the root cfs_rq has > + * decayed. > + * > + * Everything else should do nothing. > + */ > + switch (current->policy) { > + case SCHED_NORMAL: > + case SCHED_BATCH: > + case SCHED_IDLE: > + if (unlikely(current->in_iowait)) { > + cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT | SCHED_CPUFREQ_FORCE_UPDATE); > + return; > + } > + > +#ifdef CONFIG_SMP > + /* > + * Send an update if we switched from RT or DL as they tend to > + * boost the CPU and we are likely able to reduce the freq now. > + */ > + rq->cfs.decayed |= prev && (rt_policy(prev->policy) || dl_policy(prev->policy)); > + > + if (unlikely(rq->cfs.decayed)) { > + rq->cfs.decayed = false; > + cpufreq_update_util(rq, 0); > + return; > + } > +#else > + cpufreq_update_util(rq, 0); > +#endif > + return; /* ! */ > + case SCHED_FIFO: > + case SCHED_RR: > + if (prev && rt_policy(prev->policy)) { > +#ifdef CONFIG_UCLAMP_TASK > + unsigned long curr_uclamp_min = uclamp_eff_value(current, UCLAMP_MIN); > + unsigned long prev_uclamp_min = uclamp_eff_value(prev, UCLAMP_MIN); > + > + if (curr_uclamp_min == prev_uclamp_min) > +#endif > + return; > + } > +#ifdef CONFIG_SMP > + /* Stopper task masquerades as RT */ > + if (unlikely(current->sched_class == &stop_sched_class)) > + return; > +#endif > + cpufreq_update_util(rq, SCHED_CPUFREQ_FORCE_UPDATE); > + return; /* ! */ > + case SCHED_DEADLINE: > + /* > + * This is handled at enqueue to avoid breaking DL bandwidth > + * rules when multiple DL tasks are running on the same CPU. > + * Deferring till context switch here could mean the bandwidth > + * calculations would be broken to ensure all the DL tasks meet > + * their deadlines. > + */ > + return; /* ! */ > + default: > + return; /* ! */ > + } Nit: would it be more conventional to replace marked `return` statements above with `break`s? > +#endif > +} > + > +/* > + * Call when currently running task had an attribute change that requires > + * an immediate cpufreq update. > + */ > +void update_cpufreq_current(struct rq *rq) > +{ > + __update_cpufreq_ctx_switch(rq, NULL); > +} > + [snip]
On 07/29/24 17:01, Metin Kaya wrote: > On 28/07/2024 7:45 pm, Qais Yousef wrote: > > Improve the interaction with cpufreq governors by making the > > cpufreq_update_util() calls more intentional. > > [snip] > > > We also ensure to ignore cpufreq udpates for sugov workers at context > > Nit: s/udpates/updates/ > > > switch if it was prev task. > > [snip] > > > +static __always_inline void > > +__update_cpufreq_ctx_switch(struct rq *rq, struct task_struct *prev) > > +{ > > +#ifdef CONFIG_CPU_FREQ > > + if (prev && prev->dl.flags & SCHED_FLAG_SUGOV) { > > + /* Sugov just did an update, don't be too aggressive */ > > + return; > > + } > > + > > + /* > > + * RT and DL should always send a freq update. But we can do some > > + * simple checks to avoid it when we know it's not necessary. > > + * > > + * iowait_boost will always trigger a freq update too. > > + * > > + * Fair tasks will only trigger an update if the root cfs_rq has > > + * decayed. > > + * > > + * Everything else should do nothing. > > + */ > > + switch (current->policy) { > > + case SCHED_NORMAL: > > + case SCHED_BATCH: > > + case SCHED_IDLE: > > + if (unlikely(current->in_iowait)) { > > + cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT | SCHED_CPUFREQ_FORCE_UPDATE); > > + return; > > + } > > + > > +#ifdef CONFIG_SMP > > + /* > > + * Send an update if we switched from RT or DL as they tend to > > + * boost the CPU and we are likely able to reduce the freq now. > > + */ > > + rq->cfs.decayed |= prev && (rt_policy(prev->policy) || dl_policy(prev->policy)); > > + > > + if (unlikely(rq->cfs.decayed)) { > > + rq->cfs.decayed = false; > > + cpufreq_update_util(rq, 0); > > + return; > > + } > > +#else > > + cpufreq_update_util(rq, 0); > > +#endif > > + return; /* ! */ > > + case SCHED_FIFO: > > + case SCHED_RR: > > + if (prev && rt_policy(prev->policy)) { > > +#ifdef CONFIG_UCLAMP_TASK > > + unsigned long curr_uclamp_min = uclamp_eff_value(current, UCLAMP_MIN); > > + unsigned long prev_uclamp_min = uclamp_eff_value(prev, UCLAMP_MIN); > > + > > + if (curr_uclamp_min == prev_uclamp_min) > > +#endif > > + return; > > + } > > +#ifdef CONFIG_SMP > > + /* Stopper task masquerades as RT */ > > + if (unlikely(current->sched_class == &stop_sched_class)) > > + return; > > +#endif > > + cpufreq_update_util(rq, SCHED_CPUFREQ_FORCE_UPDATE); > > + return; /* ! */ > > + case SCHED_DEADLINE: > > + /* > > + * This is handled at enqueue to avoid breaking DL bandwidth > > + * rules when multiple DL tasks are running on the same CPU. > > + * Deferring till context switch here could mean the bandwidth > > + * calculations would be broken to ensure all the DL tasks meet > > + * their deadlines. > > + */ > > + return; /* ! */ > > + default: > > + return; /* ! */ > > + } > > Nit: would it be more conventional to replace marked `return` statements > above with `break`s? Thanks Metin. I think return and break are both fine here. > > > +#endif > > +} > > + > > +/* > > + * Call when currently running task had an attribute change that requires > > + * an immediate cpufreq update. > > + */ > > +void update_cpufreq_current(struct rq *rq) > > +{ > > + __update_cpufreq_ctx_switch(rq, NULL); > > +} > > + > > [snip] >
On 7/28/24 19:45, Qais Yousef wrote: > Improve the interaction with cpufreq governors by making the > cpufreq_update_util() calls more intentional. > > At the moment we send them when load is updated for CFS, bandwidth for > DL and at enqueue/dequeue for RT. But this can lead to too many updates > sent in a short period of time and potentially be ignored at a critical > moment due to the rate_limit_us in schedutil. > > For example, simultaneous task enqueue on the CPU where 2nd task is > bigger and requires higher freq. The trigger to cpufreq_update_util() by > the first task will lead to dropping the 2nd request until tick. Or > another CPU in the same policy triggers a freq update shortly after. > > Updates at enqueue for RT are not strictly required. Though they do help > to reduce the delay for switching the frequency and the potential > observation of lower frequency during this delay. But current logic > doesn't intentionally (at least to my understanding) try to speed up the > request. > > To help reduce the amount of cpufreq updates and make them more > purposeful, consolidate them into these locations: > > 1. context_switch() > 2. task_tick_fair() > 3. sched_balance_update_blocked_averages() > 4. on sched_setscheduler() syscall that changes policy or uclamp values > 5. on check_preempt_wakeup_fair() if wakeup preemption failed > 6. on __add_running_bw() to guarantee DL bandwidth requirements. > > The update at context switch should help guarantee that RT get the right > frequency straightaway when they're RUNNING. As mentioned though the > update will happen slightly after enqueue_task(); though in an ideal > world these tasks should be RUNNING ASAP and this additional delay > should be negligible. For fair tasks we need to make sure we send > a single update for every decay for the root cfs_rq. Any changes to the > rq will be deferred until the next task is ready to run, or we hit TICK. > But we are guaranteed the task is running at a level that meets its > requirements after enqueue. > > To guarantee RT and DL tasks updates are never missed, we add a new > SCHED_CPUFREQ_FORCE_UPDATE to ignore the rate_limit_us. If we are > already running at the right freq, the governor will end up doing > nothing, but we eliminate the risk of the task ending up accidentally > running at the wrong freq due to rate_limit_us. > > Similarly for iowait boost, we ignore rate limits. We also handle a case > of a boost reset prematurely by adding a guard in sugov_iowait_apply() > to reduce the boost after 1ms which seems iowait boost mechanism relied > on rate_limit_us and cfs_rq.decayed preventing any updates to happen > soon after iowait boost. > > The new SCHED_CPUFREQ_FORCE_UPDATE should not impact the rate limit > time stamps otherwise we can end up delaying updates for normal > requests. Hi Qais, the idea of SCHED_CPUFREQ_FORCE_UPDATE and the possiblity of spamming freq updates still bothered me so let me share my thoughts even though it might be niche enough for us not to care. 1. On fast_switch systems, assuming they are fine with handling the actual updates, we have a bit more work on each context_switch() and some synchronisation, too. That should be fine, if anything there's some performance regression in a couple of niche cases. 2. On !fast_switch systems this gets more interesting IMO. So we have a sugov DEADLINE task wakeup for every (in a freq-diff resulting) update request. This task will preempt whatever and currently will pretty much always be running on the CPU it ran last on (so first CPU of the PD). The weirdest case I can think of right now is two FAIR iowait tasks on e.g. CPU1 keep waking up the DEADLINE task on CPU0 (same PD) regardless of what is running there. Potentially that means two fair tasks on one CPU CPU-starving an RT task on another CPU, because it keeps getting preempted by the DEADLINE sugov worker. For this to actually happen we need to ensure the tasks context-switching actually results in a different requested frequency every time, which is a bit unlikely without UCLAMP_MAX, let's say task A has 512, task B 1024, task C (RT on CPU1 should have uclamp_min<=512 then too otherwise frequency may be dictated by the RT task anyway.) (Note the entire thing also works with Tasks A & B being lower-prio RT too, instead of FAIR and iowait.) Note that due to the nature of SCHED_DEADLINE and the sugov task having 10s period and 1s runtime this behavior is limited to 1s every 10s. The remaining 9s (replenishment time) we won't see any cpufreq updates for that PD at all though. To reproduce, I have [4,5] being one PD: fio --minimal --time_based --name=cpu5_1024uclamp --filename=/dev/nullb0 --runtime=10 --rw=randread --bs=4k--direct=1 --cpus_allowed=5 uclampset -M 512 fio --minimal --time_based --name=cpu5_512uclamp --filename=/dev/nullb0 --runtime=10 --rw=randread --bs=4k--direct=1 --cpus_allowed=5 and then your RT task on CPU4. Something like this would mitigate that, I think it makes sense even without your patch to get a more predictable idle pattern, just maybe not exactly this patch of course. -->8-- diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index 64f614b3db20..c186f8f999fe 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -567,6 +567,8 @@ static void sugov_irq_work(struct irq_work *irq_work) sg_policy = container_of(irq_work, struct sugov_policy, irq_work); + /* Try to wake the task here to not preempt or wake up another CPU. */ + sg_policy->worker.task->wake_cpu = smp_processor_id(); kthread_queue_work(&sg_policy->worker, &sg_policy->work); }
On 08/05/24 16:35, Christian Loehle wrote: > On 7/28/24 19:45, Qais Yousef wrote: > > Improve the interaction with cpufreq governors by making the > > cpufreq_update_util() calls more intentional. > > > > At the moment we send them when load is updated for CFS, bandwidth for > > DL and at enqueue/dequeue for RT. But this can lead to too many updates > > sent in a short period of time and potentially be ignored at a critical > > moment due to the rate_limit_us in schedutil. > > > > For example, simultaneous task enqueue on the CPU where 2nd task is > > bigger and requires higher freq. The trigger to cpufreq_update_util() by > > the first task will lead to dropping the 2nd request until tick. Or > > another CPU in the same policy triggers a freq update shortly after. > > > > Updates at enqueue for RT are not strictly required. Though they do help > > to reduce the delay for switching the frequency and the potential > > observation of lower frequency during this delay. But current logic > > doesn't intentionally (at least to my understanding) try to speed up the > > request. > > > > To help reduce the amount of cpufreq updates and make them more > > purposeful, consolidate them into these locations: > > > > 1. context_switch() > > 2. task_tick_fair() > > 3. sched_balance_update_blocked_averages() > > 4. on sched_setscheduler() syscall that changes policy or uclamp values > > 5. on check_preempt_wakeup_fair() if wakeup preemption failed > > 6. on __add_running_bw() to guarantee DL bandwidth requirements. > > > > The update at context switch should help guarantee that RT get the right > > frequency straightaway when they're RUNNING. As mentioned though the > > update will happen slightly after enqueue_task(); though in an ideal > > world these tasks should be RUNNING ASAP and this additional delay > > should be negligible. For fair tasks we need to make sure we send > > a single update for every decay for the root cfs_rq. Any changes to the > > rq will be deferred until the next task is ready to run, or we hit TICK. > > But we are guaranteed the task is running at a level that meets its > > requirements after enqueue. > > > > To guarantee RT and DL tasks updates are never missed, we add a new > > SCHED_CPUFREQ_FORCE_UPDATE to ignore the rate_limit_us. If we are > > already running at the right freq, the governor will end up doing > > nothing, but we eliminate the risk of the task ending up accidentally > > running at the wrong freq due to rate_limit_us. > > > > Similarly for iowait boost, we ignore rate limits. We also handle a case > > of a boost reset prematurely by adding a guard in sugov_iowait_apply() > > to reduce the boost after 1ms which seems iowait boost mechanism relied > > on rate_limit_us and cfs_rq.decayed preventing any updates to happen > > soon after iowait boost. > > > > The new SCHED_CPUFREQ_FORCE_UPDATE should not impact the rate limit > > time stamps otherwise we can end up delaying updates for normal > > requests. > > Hi Qais, > the idea of SCHED_CPUFREQ_FORCE_UPDATE and the possiblity of spamming > freq updates still bothered me so let me share my thoughts even though > it might be niche enough for us not to care. It is a valid concern. I thought a lot about dropping it for iowait, but I really had consistency of behavior in mind and was rooting for this logic to be revamped soon to hopefully drop this altogether. I don't mind dropping it. But I am keen on pushing it forward and reconsider if someone reports it to be an actual problem rather than being held back by hypothesis that we don't think are likely to be true. For RT and DL though, I think being rate limited is really bad. It breaks a promise we give to those tasks. I expect usage of RT on systems that cares about power management to be minimal and niche. Serious realtime folks would most certainly disable a lot of power management features as it is a major source of latencies. > > 1. On fast_switch systems, assuming they are fine with handling the > actual updates, we have a bit more work on each context_switch() and > some synchronisation, too. That should be fine, if anything there's > some performance regression in a couple of niche cases. > > 2. On !fast_switch systems this gets more interesting IMO. So we have > a sugov DEADLINE task wakeup for every (in a freq-diff resulting) > update request. This task will preempt whatever and currently will > pretty much always be running on the CPU it ran last on (so first CPU > of the PD). > > The weirdest case I can think of right now is two FAIR iowait tasks on > e.g. CPU1 keep waking up the DEADLINE task on CPU0 (same PD) regardless > of what is running there. This setup is very elaborate. The CPU frequency would be completely saturated within few wake ups given how aggressive current iowait logic is. If sugov continues to be triggered after this, the bug is on why it triggers as there's no frequency to change. > Potentially that means two fair tasks on one CPU CPU-starving an RT > task on another CPU, because it keeps getting preempted by the DEADLINE > sugov worker. Deadline will look like a higher priority RT task. I'd expect the RT task to be pushed to where the fair tasks are running and the fair tasks to end up waiting or saved by load balance to move to another vacant CPU. > For this to actually happen we need to ensure the tasks > context-switching actually results in a different requested frequency > every time, which is a bit unlikely without UCLAMP_MAX, let's say task > A has 512, task B 1024, task C (RT on CPU1 should have uclamp_min<=512 > then too otherwise frequency may be dictated by the RT task anyway.) > (Note the entire thing also works with Tasks A & B being lower-prio RT > too, instead of FAIR and iowait.) > > Note that due to the nature of SCHED_DEADLINE and the sugov task having > 10s period and 1s runtime this behavior is limited to 1s every 10s. > The remaining 9s (replenishment time) we won't see any cpufreq updates > for that PD at all though. > > To reproduce, I have [4,5] being one PD: > > fio --minimal --time_based --name=cpu5_1024uclamp --filename=/dev/nullb0 --runtime=10 --rw=randread --bs=4k--direct=1 --cpus_allowed=5 > uclampset -M 512 fio --minimal --time_based --name=cpu5_512uclamp --filename=/dev/nullb0 --runtime=10 --rw=randread --bs=4k--direct=1 --cpus_allowed=5 > and then your RT task on CPU4. Is the RT task affined to CPU4? If yes, then this is the problem. I'd say if someone uses affinity because they think they know better but end up shooting themselves in the foot, then that's their problem :-) Affinities are evil. I think we should ban them (/me runs) If RT is not affined to CPU4 only, then I'm surprised it doesn't get pushed immediately if a deadline task woke up on the CPU it is running on. The only reason an RT task would wait behind a deadline task is because all other CPUs are occupied by a deadline task or another higher priority RT task. In your case if you're not using affinities (and assuming it's a 2 CPUs sytem), then I'd expect the fair tasks to wait behind the RT task and deadline. If it is a UP system, then when RT wakes up, sugov will preempt it to change freq, and then RT will go back to running until it gives up the CPU. And the fair tasks will wait behind when RT stopped running. > > Something like this would mitigate that, I think it makes sense even > without your patch to get a more predictable idle pattern, just maybe > not exactly this patch of course. > > -->8-- > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c > index 64f614b3db20..c186f8f999fe 100644 > --- a/kernel/sched/cpufreq_schedutil.c > +++ b/kernel/sched/cpufreq_schedutil.c > @@ -567,6 +567,8 @@ static void sugov_irq_work(struct irq_work *irq_work) > > sg_policy = container_of(irq_work, struct sugov_policy, irq_work); > > + /* Try to wake the task here to not preempt or wake up another CPU. */ > + sg_policy->worker.task->wake_cpu = smp_processor_id(); This is good and bad. Good because this CPU was already interrupted, so let it finish the work. But bad, because something else important might be there (another RT/deadline task) and better let the proper wakeup logic pick a more suitable CPU. > kthread_queue_work(&sg_policy->worker, &sg_policy->work); > } > > > >
On Mon, 5 Aug 2024 at 17:35, Christian Loehle <christian.loehle@arm.com> wrote: > > On 7/28/24 19:45, Qais Yousef wrote: > > Improve the interaction with cpufreq governors by making the > > cpufreq_update_util() calls more intentional. > > > > At the moment we send them when load is updated for CFS, bandwidth for > > DL and at enqueue/dequeue for RT. But this can lead to too many updates > > sent in a short period of time and potentially be ignored at a critical > > moment due to the rate_limit_us in schedutil. > > > > For example, simultaneous task enqueue on the CPU where 2nd task is > > bigger and requires higher freq. The trigger to cpufreq_update_util() by > > the first task will lead to dropping the 2nd request until tick. Or > > another CPU in the same policy triggers a freq update shortly after. > > > > Updates at enqueue for RT are not strictly required. Though they do help > > to reduce the delay for switching the frequency and the potential > > observation of lower frequency during this delay. But current logic > > doesn't intentionally (at least to my understanding) try to speed up the > > request. > > > > To help reduce the amount of cpufreq updates and make them more > > purposeful, consolidate them into these locations: > > > > 1. context_switch() > > 2. task_tick_fair() > > 3. sched_balance_update_blocked_averages() > > 4. on sched_setscheduler() syscall that changes policy or uclamp values > > 5. on check_preempt_wakeup_fair() if wakeup preemption failed > > 6. on __add_running_bw() to guarantee DL bandwidth requirements. > > > > The update at context switch should help guarantee that RT get the right > > frequency straightaway when they're RUNNING. As mentioned though the > > update will happen slightly after enqueue_task(); though in an ideal > > world these tasks should be RUNNING ASAP and this additional delay > > should be negligible. For fair tasks we need to make sure we send > > a single update for every decay for the root cfs_rq. Any changes to the > > rq will be deferred until the next task is ready to run, or we hit TICK. > > But we are guaranteed the task is running at a level that meets its > > requirements after enqueue. > > > > To guarantee RT and DL tasks updates are never missed, we add a new > > SCHED_CPUFREQ_FORCE_UPDATE to ignore the rate_limit_us. If we are > > already running at the right freq, the governor will end up doing > > nothing, but we eliminate the risk of the task ending up accidentally > > running at the wrong freq due to rate_limit_us. > > > > Similarly for iowait boost, we ignore rate limits. We also handle a case > > of a boost reset prematurely by adding a guard in sugov_iowait_apply() > > to reduce the boost after 1ms which seems iowait boost mechanism relied > > on rate_limit_us and cfs_rq.decayed preventing any updates to happen > > soon after iowait boost. > > > > The new SCHED_CPUFREQ_FORCE_UPDATE should not impact the rate limit > > time stamps otherwise we can end up delaying updates for normal > > requests. > > Hi Qais, > the idea of SCHED_CPUFREQ_FORCE_UPDATE and the possiblity of spamming > freq updates still bothered me so let me share my thoughts even though > it might be niche enough for us not to care. > > 1. On fast_switch systems, assuming they are fine with handling the > actual updates, we have a bit more work on each context_switch() and > some synchronisation, too. That should be fine, if anything there's > some performance regression in a couple of niche cases. > > 2. On !fast_switch systems this gets more interesting IMO. So we have > a sugov DEADLINE task wakeup for every (in a freq-diff resulting) > update request. This task will preempt whatever and currently will > pretty much always be running on the CPU it ran last on (so first CPU > of the PD). The !fast_switch is a bit of concern for me too but not for the same reason and maybe the opposite of yours IIUC your proposal below: With fast_switch we have the following sequence: sched_switch() to task A cpufreq_driver_fast_switch -> write new freq target run task A This is pretty straight forward but we have the following sequence with !fast_switch sched_switch() to task A queue_irq_work -> raise an IPI on local CPU Handle IPI -> wakeup and queue sugov dl worker on local CPU (always with 1 CPU per PD) sched_switch() to sugov dl task __cpufreq_driver_target() which can possibly block on a lock sched_switch() to task A run task A We can possibly have 2 context switch and one IPi for each "normal" context switch which is not really optimal > > The weirdest case I can think of right now is two FAIR iowait tasks on > e.g. CPU1 keep waking up the DEADLINE task on CPU0 (same PD) regardless > of what is running there. > Potentially that means two fair tasks on one CPU CPU-starving an RT > task on another CPU, because it keeps getting preempted by the DEADLINE > sugov worker. > For this to actually happen we need to ensure the tasks > context-switching actually results in a different requested frequency > every time, which is a bit unlikely without UCLAMP_MAX, let's say task > A has 512, task B 1024, task C (RT on CPU1 should have uclamp_min<=512 > then too otherwise frequency may be dictated by the RT task anyway.) > (Note the entire thing also works with Tasks A & B being lower-prio RT > too, instead of FAIR and iowait.) > > Note that due to the nature of SCHED_DEADLINE and the sugov task having > 10s period and 1s runtime this behavior is limited to 1s every 10s. > The remaining 9s (replenishment time) we won't see any cpufreq updates > for that PD at all though. > > To reproduce, I have [4,5] being one PD: > > fio --minimal --time_based --name=cpu5_1024uclamp --filename=/dev/nullb0 --runtime=10 --rw=randread --bs=4k--direct=1 --cpus_allowed=5 > uclampset -M 512 fio --minimal --time_based --name=cpu5_512uclamp --filename=/dev/nullb0 --runtime=10 --rw=randread --bs=4k--direct=1 --cpus_allowed=5 > and then your RT task on CPU4. > > Something like this would mitigate that, I think it makes sense even > without your patch to get a more predictable idle pattern, just maybe > not exactly this patch of course. > > -->8-- > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c > index 64f614b3db20..c186f8f999fe 100644 > --- a/kernel/sched/cpufreq_schedutil.c > +++ b/kernel/sched/cpufreq_schedutil.c > @@ -567,6 +567,8 @@ static void sugov_irq_work(struct irq_work *irq_work) > > sg_policy = container_of(irq_work, struct sugov_policy, irq_work); > > + /* Try to wake the task here to not preempt or wake up another CPU. */ > + sg_policy->worker.task->wake_cpu = smp_processor_id(); > kthread_queue_work(&sg_policy->worker, &sg_policy->work); > } > > > >
On Tue, 13 Aug 2024 at 10:25, Vincent Guittot <vincent.guittot@linaro.org> wrote: > > On Mon, 5 Aug 2024 at 17:35, Christian Loehle <christian.loehle@arm.com> wrote: > > > > On 7/28/24 19:45, Qais Yousef wrote: > > > Improve the interaction with cpufreq governors by making the > > > cpufreq_update_util() calls more intentional. > > > > > > At the moment we send them when load is updated for CFS, bandwidth for > > > DL and at enqueue/dequeue for RT. But this can lead to too many updates > > > sent in a short period of time and potentially be ignored at a critical > > > moment due to the rate_limit_us in schedutil. > > > > > > For example, simultaneous task enqueue on the CPU where 2nd task is > > > bigger and requires higher freq. The trigger to cpufreq_update_util() by > > > the first task will lead to dropping the 2nd request until tick. Or > > > another CPU in the same policy triggers a freq update shortly after. > > > > > > Updates at enqueue for RT are not strictly required. Though they do help > > > to reduce the delay for switching the frequency and the potential > > > observation of lower frequency during this delay. But current logic > > > doesn't intentionally (at least to my understanding) try to speed up the > > > request. > > > > > > To help reduce the amount of cpufreq updates and make them more > > > purposeful, consolidate them into these locations: > > > > > > 1. context_switch() > > > 2. task_tick_fair() > > > 3. sched_balance_update_blocked_averages() > > > 4. on sched_setscheduler() syscall that changes policy or uclamp values > > > 5. on check_preempt_wakeup_fair() if wakeup preemption failed > > > 6. on __add_running_bw() to guarantee DL bandwidth requirements. > > > > > > The update at context switch should help guarantee that RT get the right > > > frequency straightaway when they're RUNNING. As mentioned though the > > > update will happen slightly after enqueue_task(); though in an ideal > > > world these tasks should be RUNNING ASAP and this additional delay > > > should be negligible. For fair tasks we need to make sure we send > > > a single update for every decay for the root cfs_rq. Any changes to the > > > rq will be deferred until the next task is ready to run, or we hit TICK. > > > But we are guaranteed the task is running at a level that meets its > > > requirements after enqueue. > > > > > > To guarantee RT and DL tasks updates are never missed, we add a new > > > SCHED_CPUFREQ_FORCE_UPDATE to ignore the rate_limit_us. If we are > > > already running at the right freq, the governor will end up doing > > > nothing, but we eliminate the risk of the task ending up accidentally > > > running at the wrong freq due to rate_limit_us. > > > > > > Similarly for iowait boost, we ignore rate limits. We also handle a case > > > of a boost reset prematurely by adding a guard in sugov_iowait_apply() > > > to reduce the boost after 1ms which seems iowait boost mechanism relied > > > on rate_limit_us and cfs_rq.decayed preventing any updates to happen > > > soon after iowait boost. > > > > > > The new SCHED_CPUFREQ_FORCE_UPDATE should not impact the rate limit > > > time stamps otherwise we can end up delaying updates for normal > > > requests. > > > > Hi Qais, > > the idea of SCHED_CPUFREQ_FORCE_UPDATE and the possiblity of spamming > > freq updates still bothered me so let me share my thoughts even though > > it might be niche enough for us not to care. > > > > 1. On fast_switch systems, assuming they are fine with handling the > > actual updates, we have a bit more work on each context_switch() and > > some synchronisation, too. That should be fine, if anything there's > > some performance regression in a couple of niche cases. > > > > 2. On !fast_switch systems this gets more interesting IMO. So we have > > a sugov DEADLINE task wakeup for every (in a freq-diff resulting) > > update request. This task will preempt whatever and currently will > > pretty much always be running on the CPU it ran last on (so first CPU > > of the PD). > > The !fast_switch is a bit of concern for me too but not for the same > reason and maybe the opposite of yours IIUC your proposal below: > > With fast_switch we have the following sequence: > > sched_switch() to task A > cpufreq_driver_fast_switch -> write new freq target > run task A > > This is pretty straight forward but we have the following sequence > with !fast_switch > > sched_switch() to task A > queue_irq_work -> raise an IPI on local CPU > Handle IPI -> wakeup and queue sugov dl worker on local CPU (always > with 1 CPU per PD) > sched_switch() to sugov dl task > __cpufreq_driver_target() which can possibly block on a lock > sched_switch() to task A > run task A > sent a bit too early > We can possibly have 2 context switch and one IPi for each "normal" > context switch which is not really optimal It would be good to find a way to skip the spurious back and forth between the normal task and sugov > > > > > The weirdest case I can think of right now is two FAIR iowait tasks on > > e.g. CPU1 keep waking up the DEADLINE task on CPU0 (same PD) regardless > > of what is running there. > > Potentially that means two fair tasks on one CPU CPU-starving an RT > > task on another CPU, because it keeps getting preempted by the DEADLINE > > sugov worker. > > For this to actually happen we need to ensure the tasks > > context-switching actually results in a different requested frequency > > every time, which is a bit unlikely without UCLAMP_MAX, let's say task > > A has 512, task B 1024, task C (RT on CPU1 should have uclamp_min<=512 > > then too otherwise frequency may be dictated by the RT task anyway.) > > (Note the entire thing also works with Tasks A & B being lower-prio RT > > too, instead of FAIR and iowait.) > > > > Note that due to the nature of SCHED_DEADLINE and the sugov task having > > 10s period and 1s runtime this behavior is limited to 1s every 10s. > > The remaining 9s (replenishment time) we won't see any cpufreq updates > > for that PD at all though. > > > > To reproduce, I have [4,5] being one PD: > > > > fio --minimal --time_based --name=cpu5_1024uclamp --filename=/dev/nullb0 --runtime=10 --rw=randread --bs=4k--direct=1 --cpus_allowed=5 > > uclampset -M 512 fio --minimal --time_based --name=cpu5_512uclamp --filename=/dev/nullb0 --runtime=10 --rw=randread --bs=4k--direct=1 --cpus_allowed=5 > > and then your RT task on CPU4. > > > > Something like this would mitigate that, I think it makes sense even > > without your patch to get a more predictable idle pattern, just maybe > > not exactly this patch of course. > > > > -->8-- > > > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c > > index 64f614b3db20..c186f8f999fe 100644 > > --- a/kernel/sched/cpufreq_schedutil.c > > +++ b/kernel/sched/cpufreq_schedutil.c > > @@ -567,6 +567,8 @@ static void sugov_irq_work(struct irq_work *irq_work) > > > > sg_policy = container_of(irq_work, struct sugov_policy, irq_work); > > > > + /* Try to wake the task here to not preempt or wake up another CPU. */ > > + sg_policy->worker.task->wake_cpu = smp_processor_id(); > > kthread_queue_work(&sg_policy->worker, &sg_policy->work); > > } > > > > > > > >
On Sun, 28 Jul 2024 at 20:45, Qais Yousef <qyousef@layalina.io> wrote: > > Improve the interaction with cpufreq governors by making the > cpufreq_update_util() calls more intentional. > > At the moment we send them when load is updated for CFS, bandwidth for > DL and at enqueue/dequeue for RT. But this can lead to too many updates > sent in a short period of time and potentially be ignored at a critical > moment due to the rate_limit_us in schedutil. > > For example, simultaneous task enqueue on the CPU where 2nd task is > bigger and requires higher freq. The trigger to cpufreq_update_util() by > the first task will lead to dropping the 2nd request until tick. Or > another CPU in the same policy triggers a freq update shortly after. > > Updates at enqueue for RT are not strictly required. Though they do help > to reduce the delay for switching the frequency and the potential > observation of lower frequency during this delay. But current logic > doesn't intentionally (at least to my understanding) try to speed up the > request. > > To help reduce the amount of cpufreq updates and make them more > purposeful, consolidate them into these locations: > > 1. context_switch() > 2. task_tick_fair() > 3. sched_balance_update_blocked_averages() > 4. on sched_setscheduler() syscall that changes policy or uclamp values > 5. on check_preempt_wakeup_fair() if wakeup preemption failed > 6. on __add_running_bw() to guarantee DL bandwidth requirements. > > The update at context switch should help guarantee that RT get the right > frequency straightaway when they're RUNNING. As mentioned though the > update will happen slightly after enqueue_task(); though in an ideal > world these tasks should be RUNNING ASAP and this additional delay > should be negligible. For fair tasks we need to make sure we send > a single update for every decay for the root cfs_rq. Any changes to the > rq will be deferred until the next task is ready to run, or we hit TICK. > But we are guaranteed the task is running at a level that meets its > requirements after enqueue. > > To guarantee RT and DL tasks updates are never missed, we add a new > SCHED_CPUFREQ_FORCE_UPDATE to ignore the rate_limit_us. If we are > already running at the right freq, the governor will end up doing > nothing, but we eliminate the risk of the task ending up accidentally > running at the wrong freq due to rate_limit_us. > > Similarly for iowait boost, we ignore rate limits. We also handle a case > of a boost reset prematurely by adding a guard in sugov_iowait_apply() > to reduce the boost after 1ms which seems iowait boost mechanism relied > on rate_limit_us and cfs_rq.decayed preventing any updates to happen > soon after iowait boost. > > The new SCHED_CPUFREQ_FORCE_UPDATE should not impact the rate limit > time stamps otherwise we can end up delaying updates for normal > requests. > > As a simple optimization, we avoid sending cpufreq updates when > switching from RT to another RT as RT tasks run at max freq by default. > If CONFIG_UCLAMP_TASK is enabled, we can do a simple check to see if > uclamp_min is different to avoid unnecessary cpufreq update as most RT > tasks are likely to be running at the same performance level, so we can > avoid unnecessary overhead of forced updates when there's nothing to do. > > We also ensure to ignore cpufreq udpates for sugov workers at context > switch if it was prev task. > > The update at task_tick_fair() will guarantee that the governor will > follow any updates to load for tasks/CPU or due to new enqueues/dequeues > to the rq. Since DL and RT always run at constant frequencies and have > no load tracking, this is only required for fair tasks. > > The update at update_blocked_averages() will ensure we decay frequency > as the CPU becomes idle for long enough. > > If the currently running task changes its policy or uclamp values, we > ensure we follow up with cpufreq update to ensure we follow up with any > potential new perf requirements based on the new change. > > To handle systems with long TICK where tasks could end up enqueued but > no preemption happens until TICK, we add an update in > check_preempt_wakeup_fair() if wake up preemption fails. This will send > special SCHED_CPUFREQ_TASK_ENQUEUED cpufreq update to tell the governor > that the state of the CPU has changed and it can consider an update if > it deems worthwhile. In schedutil this will do an update if no update > was done since 1ms which is how often util_avg changes roughly. > > To ensure DL tasks bandwidth are respected, we do the update on > __add_running_bw() instead of context switch as the delay could result > in missing a deadline when multiple DL tasks are RUNNING. > > Since now DL tasks always ignore rate limit, remove > ignore_dl_rate_limit() function as it's no longer necessary. > > Also move updating sg_cpu->last_update inside sugov_iowait_boost() where > this variable is associated and rename it to last_iowait_update to > better reflect it is iowait boost specific. > > Note worthy that we still have the following race condition on systems > that have shared policy: > > * CPUs with shared policy can end up sending simultaneous cpufreq > updates requests where the 2nd one will be unlucky and get blocked by > the rate_limit_us (schedutil). > > We can potentially address this limitation later, but it is out of the > scope of this patch. > > Signed-off-by: Qais Yousef <qyousef@layalina.io> > --- > > Changes since v6: > > * Fix typos in commit message > * Move DL to enqueue to address breaking bandwidth rules for DL > * Do freq updates for SCHED_IDLE too > * Ensure wakeup preemption will cause cpufreq updates even if > cfs_rq.decayed was false as util_est could be high and cfs_rq.decayed > wouldn't reflect that. > * Ensure we send an update if we switch to fair from RT or DL as this > is an opportunity to reduce freq even if cfs_rq.decayed is false. > * If sched_setsched() syscall for a queued task requires cpufreq > update, handle it like we do for wakeup_preemption_check() > * Use 1ms instead of base_slice to send an update if wakeup preemption > fails > * Fix a bug in setting sg_cpu->last_update being updated too early > causing some systems to always request 1024 io boost. > * Change delta_ns <= NSEC_PER_MSEC to be strictly less than > delta_ns < NSEC_PER_MSEC for iowait boost to match the condition for > when a task was enqueued. > * Moved the results of context switch test out of the commit messages > as I am seeing some variations that I am not sure are due to binary > differences causing weird caching effect or true overhead > > Results of > > taskset 1 perf record perf stat --repeat 10 -e cycles,instructions,task-clock perf bench sched pipe > > on AMD 3900X to verify any potential overhead because of the addition at > context switch against sched-core-2024-07-16 tip/sched/core > > tip sched-core-2024-07-16 schedutil: > ------------------------------------ > > Performance counter stats for '/home/qyousef/utils/perf bench sched pipe' (10 runs): > > 39,296,424,438 cycles # 3.208 GHz ( +- 0.05% ) > 20,350,055,343 instructions # 0.52 insn per cycle ( +- 0.03% ) > 12,274.17 msec task-clock # 1.002 CPUs utilized ( +- 0.06% ) > > 12.24917 +- 0.00783 seconds time elapsed ( +- 0.06% ) > > tip sched-core-2024-07-16 performance: > -------------------------------------- > > Performance counter stats for '/home/qyousef/utils/perf bench sched pipe' (10 runs): > > 40,610,243,585 cycles # 3.268 GHz ( +- 0.15% ) > 21,252,175,791 instructions # 0.53 insn per cycle ( +- 0.05% ) > 12,443.34 msec task-clock # 1.001 CPUs utilized ( +- 0.06% ) > > 12.42761 +- 0.00672 seconds time elapsed ( +- 0.05% ) > > patch: tip sched-core-2024-07-16 schedutil: > ------------------------------------------- > > Performance counter stats for '/home/qyousef/utils/perf bench sched pipe' (10 runs): > > 40,706,113,323 cycles # 3.253 GHz ( +- 0.07% ) > 21,163,304,319 instructions # 0.52 insn per cycle ( +- 0.04% ) > 12,494.93 msec task-clock # 0.998 CPUs utilized ( +- 0.04% ) > > 12.51557 +- 0.00486 seconds time elapsed ( +- 0.04% ) > > patch: tip sched-core-2024-07-16 performance: > --------------------------------------------- > > Performance counter stats for '/home/qyousef/utils/perf bench sched pipe' (10 runs): > > 39,654,998,545 cycles # 3.220 GHz ( +- 0.12% ) > 20,554,376,621 instructions # 0.52 insn per cycle ( +- 0.12% ) > 12,317.02 msec task-clock # 1.000 CPUs utilized ( +- 0.16% ) > > 12.3166 +- 0.0193 seconds time elapsed ( +- 0.16% ) > > We do better in performance governor than tip/sched/core. But schedutil looks > worse. Looking at perf diff I can see update_load_avg() and > sugov_update_single_freq() but not sure if this is due to this patch per se > rather than strange binary difference creating unexpected effect. The hot > instructions in update_load_avg() are not related to the new code added there. > Similarly for check_preempt_wakeup_fair(). > > For sugov_update_single_freq() this hasn't shown up in previous versions. > Removing the new cpufreq update in check_preempt_wakeup_fair() didn't help. > > Note that in v6 same test showed that schedutil was on par but performance was > slightly worse. Though the test was against 6.8.7 stable kernel then. > > perf diff schedutil: > -------------------- > > 10.56% -2.56% [kernel.kallsyms] [k] delay_halt_mwaitx > 14.56% -1.46% [kernel.kallsyms] [k] native_read_msr > 14.19% -1.40% [kernel.kallsyms] [k] native_write_msr > 0.63% +0.54% [kernel.kallsyms] [k] restore_fpregs_from_fpstate > 1.52% +0.52% [kernel.kallsyms] [k] update_load_avg > 0.01% +0.47% [kernel.kallsyms] [k] sugov_update_single_freq > 3.44% -0.35% [kernel.kallsyms] [k] amd_pmu_addr_offset > 4.67% -0.31% [kernel.kallsyms] [k] x86_pmu_disable_all > 0.35% +0.29% [kernel.kallsyms] [k] check_preempt_wakeup_fair > 1.81% -0.28% [kernel.kallsyms] [k] amd_pmu_check_overflow > 1.81% -0.27% [kernel.kallsyms] [k] amd_pmu_test_overflow_topbit > 1.20% +0.26% [kernel.kallsyms] [k] pick_next_task_fair > 0.01% +0.22% [kernel.kallsyms] [k] __get_user_8 > 1.41% +0.21% [kernel.kallsyms] [k] update_curr > 1.18% -0.21% [kernel.kallsyms] [k] delay_halt > 0.50% +0.21% [kernel.kallsyms] [k] pick_eevdf > 3.13% +0.20% [kernel.kallsyms] [k] srso_safe_ret > 0.00% +0.18% [kernel.kallsyms] [k] sugov_get_util > 1.23% +0.17% [kernel.kallsyms] [k] __schedule > 0.50% +0.16% [kernel.kallsyms] [k] enqueue_entity > 0.57% +0.16% [kernel.kallsyms] [k] psi_task_change > 0.57% +0.15% [kernel.kallsyms] [k] enqueue_task_fair > 1.06% -0.15% [kernel.kallsyms] [k] apparmor_file_permission > 0.80% +0.15% [kernel.kallsyms] [k] try_to_wake_up > 1.07% +0.14% [kernel.kallsyms] [k] psi_task_switch > 1.58% +0.14% [kernel.kallsyms] [k] pipe_write > 0.86% +0.14% [kernel.kallsyms] [k] syscall_exit_to_user_mode > 1.02% +0.13% [kernel.kallsyms] [k] native_sched_clock > 0.46% +0.11% [kernel.kallsyms] [k] __update_load_avg_se > > perf diff performance: > ---------------------- > > 13.09% +3.06% [kernel.kallsyms] [k] native_read_msr > 13.12% +2.84% [kernel.kallsyms] [k] native_write_msr > 7.94% +2.34% [kernel.kallsyms] [k] delay_halt_mwaitx > 2.15% -0.93% [kernel.kallsyms] [k] update_curr > 4.42% +0.87% [kernel.kallsyms] [k] x86_pmu_disable_all > 3.12% +0.74% [kernel.kallsyms] [k] amd_pmu_addr_offset > 2.84% -0.59% [kernel.kallsyms] [k] psi_group_change > 1.44% +0.53% [kernel.kallsyms] [k] amd_pmu_check_overflow > 1.45% +0.50% [kernel.kallsyms] [k] amd_pmu_test_overflow_topbit > 0.47% -0.47% [kernel.kallsyms] [k] __calc_delta.constprop.0 > 1.60% -0.40% [kernel.kallsyms] [k] pick_next_task_fair > 1.97% -0.37% [kernel.kallsyms] [k] update_load_avg > 0.57% -0.37% [kernel.kallsyms] [k] avg_vruntime > 0.82% -0.37% [kernel.kallsyms] [k] enqueue_task_fair > 1.54% -0.34% [kernel.kallsyms] [k] __schedule > 0.79% -0.32% [kernel.kallsyms] [k] pick_eevdf > 0.88% +0.32% [kernel.kallsyms] [k] delay_halt > 0.59% -0.28% [kernel.kallsyms] [k] update_cfs_group > 0.86% -0.25% [kernel.kallsyms] [k] try_to_wake_up > 1.18% -0.25% [kernel.kallsyms] [k] native_sched_clock > 0.45% -0.24% [kernel.kallsyms] [k] put_prev_entity > 0.49% -0.24% [kernel.kallsyms] [k] ttwu_do_activate > 0.64% -0.23% [kernel.kallsyms] [k] enqueue_entity > 0.72% -0.22% [kernel.kallsyms] [k] __update_load_avg_cfs_rq > 1.57% -0.22% [kernel.kallsyms] [k] pipe_write > 0.50% -0.20% [kernel.kallsyms] [k] update_min_vruntime > 3.31% -0.19% [kernel.kallsyms] [k] srso_safe_ret > 1.31% -0.18% [kernel.kallsyms] [k] psi_task_switch > 0.52% -0.18% [kernel.kallsyms] [k] check_preempt_wakeup_fair > 0.32% -0.16% [kernel.kallsyms] [k] __enqueue_entity > 0.87% -0.16% [kernel.kallsyms] [k] dequeue_task_fair > 0.44% -0.14% [kernel.kallsyms] [k] pick_next_entity > 0.63% -0.13% [kernel.kallsyms] [k] psi_task_change > 0.62% -0.13% [kernel.kallsyms] [k] sched_clock_cpu > > Changes since v5: > > * Fix a bug where switching between RT and sugov tasks triggered an > endless cycle of cpufreq updates. > * Only do cpufreq updates at tick for fair after verifying > rq->cfs.decayed > * Remove optimization in update_load_avg() to avoid sending an update > if util hasn't changed that caused a bug when switching from Idle > * Handle systems with long ticks by adding extra update on > check_preempt_wakeup_fair(). The idea is to rely on context switch > but still consider an update if wakeup preemption failed and no > update was sent since sysctl_sched_base_slice > * Remove ignore_dl_rate_limit() as this function is now redundant > * move sg_cpu->last_update = time inside sugov_iowait_boost() > * Update commit message with new details and with perf diff output > > Changes since v4: > > * Fix updating freq when uclamp changes before the dequeue/enqueue > dance. (Hongyan) > * Rebased on top of tip/sched/core 6.10-rc1 and resolve some conflicts > due to code shuffling to syscalls.c. Added new function > update_cpufreq_current() to be used outside core.c when > task_current() requires cpufreq update. > > Changes since v3: > > * Omit cpufreq updates at attach/detach_entity_load_avg(). They share > the update path from enqueue/dequeue which is not intended to trigger > an update. And task_change_group_fair() is not expected to cause the > root cfs_rq util to change significantly to warrant an immediate > update for enqueued tasks. Better defer for next context switch to > sample the state of the cpu taking all changes into account before > the next task is due to run. > Dietmar also pointed out a bug where we could send more updates vs > without the patch in this path as I wasn't sending the update for > cfs_rq == &rq->cfs. > > Changes since v2: > > * Clean up update_cpufreq_ctx_switch() to reduce branches (Peter) > * Fix issue with cpufreq updates missed on switching from idle (Vincent) > * perf bench sched pipe regressed after fixing the switch from idle, > detect when util_avg has changed when cfs_rq->decayed to fix it > * Ensure to issue cpufreq updates when task_current() switches > policy/uclamp values > > Changes since v1: > > * Use taskset and measure with performance governor as Ingo suggested > * Remove the static key as I found out we always register a function > for cpu_dbs in cpufreq_governor.c; and as Christian pointed out it > trigger a lock debug warning. > * Improve detection of sugov workers by using SCHED_FLAG_SUGOV > * Guard against NSEC_PER_MSEC instead of TICK_USEC to avoid prematurely > reducing iowait boost as the latter was a NOP and like > sugov_iowait_reset() like Christian pointed out. > > v1 discussion: https://lore.kernel.org/all/20240324020139.1032473-1-qyousef@layalina.io/ > v2 discussion: https://lore.kernel.org/lkml/20240505233103.168766-1-qyousef@layalina.io/ > v3 discussion: https://lore.kernel.org/lkml/20240512190018.531820-1-qyousef@layalina.io/ > v4 discussion: https://lore.kernel.org/lkml/20240516204802.846520-1-qyousef@layalina.io/ > v5 discussion: https://lore.kernel.org/lkml/20240530104653.1234004-1-qyousef@layalina.io/ > v6 discussion: https://lore.kernel.org/lkml/20240619201409.2071728-1-qyousef@layalina.io/ > > include/linux/sched/cpufreq.h | 4 +- > kernel/sched/core.c | 116 +++++++++++++++++++++++++++-- > kernel/sched/cpufreq_schedutil.c | 122 +++++++++++++++++++------------ > kernel/sched/deadline.c | 10 ++- > kernel/sched/fair.c | 91 +++++++++++------------ > kernel/sched/rt.c | 8 +- > kernel/sched/sched.h | 9 ++- > kernel/sched/syscalls.c | 30 ++++++-- > 8 files changed, 271 insertions(+), 119 deletions(-) > > diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h > index bdd31ab93bc5..5409a9f79cc0 100644 > --- a/include/linux/sched/cpufreq.h > +++ b/include/linux/sched/cpufreq.h > @@ -8,7 +8,9 @@ > * Interface between cpufreq drivers and the scheduler: > */ > > -#define SCHED_CPUFREQ_IOWAIT (1U << 0) > +#define SCHED_CPUFREQ_IOWAIT (1U << 0) > +#define SCHED_CPUFREQ_FORCE_UPDATE (1U << 1) /* ignore transition_delay_us */ > +#define SCHED_CPUFREQ_TASK_ENQUEUED (1U << 2) /* new fair task was enqueued */ > > #ifdef CONFIG_CPU_FREQ > struct cpufreq_policy; > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 6d35c48239be..a31d91a224d0 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -153,6 +153,9 @@ const_debug unsigned int sysctl_sched_nr_migrate = SCHED_NR_MIGRATE_BREAK; > > __read_mostly int scheduler_running; > > +static __always_inline void > +__update_cpufreq_ctx_switch(struct rq *rq, struct task_struct *prev); > + > #ifdef CONFIG_SCHED_CORE > > DEFINE_STATIC_KEY_FALSE(__sched_core_enabled); > @@ -2038,17 +2041,24 @@ inline int task_curr(const struct task_struct *p) > * this means any call to check_class_changed() must be followed by a call to > * balance_callback(). > */ > -void check_class_changed(struct rq *rq, struct task_struct *p, > +bool check_class_changed(struct rq *rq, struct task_struct *p, > const struct sched_class *prev_class, > int oldprio) > { > + bool class_changed = false; > + > if (prev_class != p->sched_class) { > if (prev_class->switched_from) > prev_class->switched_from(rq, p); > > p->sched_class->switched_to(rq, p); > - } else if (oldprio != p->prio || dl_task(p)) > + > + class_changed = true; > + } else if (oldprio != p->prio || dl_task(p)) { > p->sched_class->prio_changed(rq, p, oldprio); > + } > + > + return class_changed; > } > > void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags) > @@ -4913,6 +4923,93 @@ static inline void __balance_callbacks(struct rq *rq) > > #endif > > +static __always_inline void > +__update_cpufreq_ctx_switch(struct rq *rq, struct task_struct *prev) > +{ > +#ifdef CONFIG_CPU_FREQ > + if (prev && prev->dl.flags & SCHED_FLAG_SUGOV) { > + /* Sugov just did an update, don't be too aggressive */ > + return; > + } > + > + /* > + * RT and DL should always send a freq update. But we can do some > + * simple checks to avoid it when we know it's not necessary. > + * > + * iowait_boost will always trigger a freq update too. > + * > + * Fair tasks will only trigger an update if the root cfs_rq has > + * decayed. > + * > + * Everything else should do nothing. > + */ > + switch (current->policy) { > + case SCHED_NORMAL: > + case SCHED_BATCH: > + case SCHED_IDLE: > + if (unlikely(current->in_iowait)) { > + cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT | SCHED_CPUFREQ_FORCE_UPDATE); > + return; > + } > + > +#ifdef CONFIG_SMP > + /* > + * Send an update if we switched from RT or DL as they tend to > + * boost the CPU and we are likely able to reduce the freq now. > + */ > + rq->cfs.decayed |= prev && (rt_policy(prev->policy) || dl_policy(prev->policy)); > + > + if (unlikely(rq->cfs.decayed)) { My previous use case of a task non preempting current with large util_est is fixed with this version but I'm facing a new one a bit similar because of waiting for the context switch and the decay to try to update the frequency. When the task wakes up on an idle cpu, you wait for the decay to update the freq but if the freq is low and the pelt has been updated recently (less than 1024us) you can wait a long time before the next decay and the freq update. This is a problem if the task's util_est is large because you can stay several ms at low frequency before taking into account task's util_est > + rq->cfs.decayed = false; > + cpufreq_update_util(rq, 0); > + return; > + } > +#else > + cpufreq_update_util(rq, 0); > +#endif > + return; > + case SCHED_FIFO: > + case SCHED_RR: > + if (prev && rt_policy(prev->policy)) { > +#ifdef CONFIG_UCLAMP_TASK > + unsigned long curr_uclamp_min = uclamp_eff_value(current, UCLAMP_MIN); > + unsigned long prev_uclamp_min = uclamp_eff_value(prev, UCLAMP_MIN); > + > + if (curr_uclamp_min == prev_uclamp_min) > +#endif > + return; > + } > +#ifdef CONFIG_SMP > + /* Stopper task masquerades as RT */ > + if (unlikely(current->sched_class == &stop_sched_class)) > + return; > +#endif > + cpufreq_update_util(rq, SCHED_CPUFREQ_FORCE_UPDATE); > + return; > + case SCHED_DEADLINE: > + /* > + * This is handled at enqueue to avoid breaking DL bandwidth > + * rules when multiple DL tasks are running on the same CPU. > + * Deferring till context switch here could mean the bandwidth > + * calculations would be broken to ensure all the DL tasks meet > + * their deadlines. > + */ > + return; > + default: > + return; > + } > +#endif > +} > + > +/* > + * Call when currently running task had an attribute change that requires > + * an immediate cpufreq update. > + */ > +void update_cpufreq_current(struct rq *rq) > +{ > + __update_cpufreq_ctx_switch(rq, NULL); > +} > + > static inline void > prepare_lock_switch(struct rq *rq, struct task_struct *next, struct rq_flags *rf) > { > @@ -4930,7 +5027,7 @@ prepare_lock_switch(struct rq *rq, struct task_struct *next, struct rq_flags *rf > #endif > } > > -static inline void finish_lock_switch(struct rq *rq) > +static inline void finish_lock_switch(struct rq *rq, struct task_struct *prev) > { > /* > * If we are tracking spinlock dependencies then we have to > @@ -4939,6 +5036,11 @@ static inline void finish_lock_switch(struct rq *rq) > */ > spin_acquire(&__rq_lockp(rq)->dep_map, 0, 0, _THIS_IP_); > __balance_callbacks(rq); > + /* > + * Request freq update after __balance_callbacks to take into account > + * any changes to rq. > + */ > + __update_cpufreq_ctx_switch(rq, prev); > raw_spin_rq_unlock_irq(rq); > } > > @@ -5057,7 +5159,7 @@ static struct rq *finish_task_switch(struct task_struct *prev) > perf_event_task_sched_in(prev, current); > finish_task(prev); > tick_nohz_task_switch(); > - finish_lock_switch(rq); > + finish_lock_switch(rq, prev); > finish_arch_post_lock_switch(); > kcov_finish_switch(current); > /* > @@ -6920,6 +7022,7 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task) > int prio, oldprio, queued, running, queue_flag = > DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK; > const struct sched_class *prev_class; > + bool class_changed; > struct rq_flags rf; > struct rq *rq; > > @@ -7021,7 +7124,10 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task) > if (running) > set_next_task(rq, p); > > - check_class_changed(rq, p, prev_class, oldprio); > + class_changed = check_class_changed(rq, p, prev_class, oldprio); > + if (class_changed && running) > + update_cpufreq_current(rq); > + > out_unlock: > /* Avoid rq from going away on us: */ > preempt_disable(); > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c > index eece6244f9d2..64f614b3db20 100644 > --- a/kernel/sched/cpufreq_schedutil.c > +++ b/kernel/sched/cpufreq_schedutil.c > @@ -44,7 +44,7 @@ struct sugov_cpu { > > bool iowait_boost_pending; > unsigned int iowait_boost; > - u64 last_update; > + u64 last_iowait_update; > > unsigned long util; > unsigned long bw_min; > @@ -59,10 +59,31 @@ static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu); > > /************************ Governor internals ***********************/ > > -static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time) > +static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time, > + unsigned int flags) > { > s64 delta_ns; > > + delta_ns = time - sg_policy->last_freq_update_time; > + > + /* > + * We want to update cpufreq at context switch, but on systems with > + * long TICK values, this can happen after a long time while more tasks > + * would have been added meanwhile leaving us potentially running at > + * inadequate frequency for extended period of time. > + * > + * This logic should only apply when new fair task was added to the > + * CPU, we'd want to defer to context switch as much as possible, but > + * to avoid the potential delays mentioned above, let's check if this > + * additional tasks warrants sending an update sooner. > + * > + * We want to ensure there's at least an update every 1ms. > + */ > + if (likely(flags & SCHED_CPUFREQ_TASK_ENQUEUED)) { > + if (delta_ns < NSEC_PER_MSEC) > + return false; > + } > + > /* > * Since cpufreq_update_util() is called with rq->lock held for > * the @target_cpu, our per-CPU data is fully serialized. > @@ -87,13 +108,14 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time) > return true; > } > > - delta_ns = time - sg_policy->last_freq_update_time; > + if (unlikely(flags & SCHED_CPUFREQ_FORCE_UPDATE)) > + return true; > > return delta_ns >= sg_policy->freq_update_delay_ns; > } > > static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time, > - unsigned int next_freq) > + unsigned int next_freq, unsigned int flags) > { > if (sg_policy->need_freq_update) > sg_policy->need_freq_update = cpufreq_driver_test_flags(CPUFREQ_NEED_UPDATE_LIMITS); > @@ -101,7 +123,9 @@ static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time, > return false; > > sg_policy->next_freq = next_freq; > - sg_policy->last_freq_update_time = time; > + > + if (!unlikely(flags & SCHED_CPUFREQ_FORCE_UPDATE)) > + sg_policy->last_freq_update_time = time; > > return true; > } > @@ -219,7 +243,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost) > static bool sugov_iowait_reset(struct sugov_cpu *sg_cpu, u64 time, > bool set_iowait_boost) > { > - s64 delta_ns = time - sg_cpu->last_update; > + s64 delta_ns = time - sg_cpu->last_iowait_update; > > /* Reset boost only if a tick has elapsed since last request */ > if (delta_ns <= TICK_NSEC) > @@ -249,30 +273,33 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time, > unsigned int flags) > { > bool set_iowait_boost = flags & SCHED_CPUFREQ_IOWAIT; > + bool forced_update = flags & SCHED_CPUFREQ_FORCE_UPDATE; > > /* Reset boost if the CPU appears to have been idle enough */ > - if (sg_cpu->iowait_boost && > + if (sg_cpu->iowait_boost && !forced_update && > sugov_iowait_reset(sg_cpu, time, set_iowait_boost)) > - return; > + goto done; > > /* Boost only tasks waking up after IO */ > if (!set_iowait_boost) > - return; > + goto done; > > /* Ensure boost doubles only one time at each request */ > if (sg_cpu->iowait_boost_pending) > - return; > + goto done; > sg_cpu->iowait_boost_pending = true; > > /* Double the boost at each request */ > if (sg_cpu->iowait_boost) { > sg_cpu->iowait_boost = > min_t(unsigned int, sg_cpu->iowait_boost << 1, SCHED_CAPACITY_SCALE); > - return; > + goto done; > } > > /* First wakeup after IO: start with minimum boost */ > sg_cpu->iowait_boost = IOWAIT_BOOST_MIN; > +done: > + sg_cpu->last_iowait_update = time; > } > > /** > @@ -294,17 +321,34 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time, > * being more conservative on tasks which does sporadic IO operations. > */ > static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time, > - unsigned long max_cap) > + unsigned long max_cap, unsigned int flags) > { > + bool forced_update = flags & SCHED_CPUFREQ_FORCE_UPDATE; > + s64 delta_ns = time - sg_cpu->last_iowait_update; > + > /* No boost currently required */ > if (!sg_cpu->iowait_boost) > return 0; > > + if (forced_update) > + goto apply_boost; > + > /* Reset boost if the CPU appears to have been idle enough */ > if (sugov_iowait_reset(sg_cpu, time, false)) > return 0; > > if (!sg_cpu->iowait_boost_pending) { > + /* > + * This logic relied on PELT signal decays happening once every > + * 1ms. But due to changes to how updates are done now, we can > + * end up with more request coming up leading to iowait boost > + * to be prematurely reduced. Make the assumption explicit > + * until we improve the iowait boost logic to be better in > + * general as it is due for an overhaul. > + */ > + if (delta_ns < NSEC_PER_MSEC) > + goto apply_boost; > + > /* > * No boost pending; reduce the boost value. > */ > @@ -315,6 +359,7 @@ static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time, > } > } > > +apply_boost: > sg_cpu->iowait_boost_pending = false; > > /* > @@ -337,31 +382,18 @@ static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) > static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; } > #endif /* CONFIG_NO_HZ_COMMON */ > > -/* > - * Make sugov_should_update_freq() ignore the rate limit when DL > - * has increased the utilization. > - */ > -static inline void ignore_dl_rate_limit(struct sugov_cpu *sg_cpu) > -{ > - if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_min) > - sg_cpu->sg_policy->limits_changed = true; > -} > - > static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu, > u64 time, unsigned long max_cap, > unsigned int flags) > { > unsigned long boost; > > - sugov_iowait_boost(sg_cpu, time, flags); > - sg_cpu->last_update = time; > - > - ignore_dl_rate_limit(sg_cpu); > - > - if (!sugov_should_update_freq(sg_cpu->sg_policy, time)) > + if (!sugov_should_update_freq(sg_cpu->sg_policy, time, flags)) > return false; > > - boost = sugov_iowait_apply(sg_cpu, time, max_cap); > + sugov_iowait_boost(sg_cpu, time, flags); > + > + boost = sugov_iowait_apply(sg_cpu, time, max_cap, flags); > sugov_get_util(sg_cpu, boost); > > return true; > @@ -397,7 +429,7 @@ static void sugov_update_single_freq(struct update_util_data *hook, u64 time, > sg_policy->cached_raw_freq = cached_freq; > } > > - if (!sugov_update_next_freq(sg_policy, time, next_f)) > + if (!sugov_update_next_freq(sg_policy, time, next_f, flags)) > return; > > /* > @@ -449,10 +481,12 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time, > cpufreq_driver_adjust_perf(sg_cpu->cpu, sg_cpu->bw_min, > sg_cpu->util, max_cap); > > - sg_cpu->sg_policy->last_freq_update_time = time; > + if (!unlikely(flags & SCHED_CPUFREQ_FORCE_UPDATE)) > + sg_cpu->sg_policy->last_freq_update_time = time; > } > > -static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time) > +static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time, > + unsigned int flags) > { > struct sugov_policy *sg_policy = sg_cpu->sg_policy; > struct cpufreq_policy *policy = sg_policy->policy; > @@ -465,7 +499,7 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time) > struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j); > unsigned long boost; > > - boost = sugov_iowait_apply(j_sg_cpu, time, max_cap); > + boost = sugov_iowait_apply(j_sg_cpu, time, max_cap, flags); > sugov_get_util(j_sg_cpu, boost); > > util = max(j_sg_cpu->util, util); > @@ -483,22 +517,20 @@ sugov_update_shared(struct update_util_data *hook, u64 time, unsigned int flags) > > raw_spin_lock(&sg_policy->update_lock); > > - sugov_iowait_boost(sg_cpu, time, flags); > - sg_cpu->last_update = time; > + if (!sugov_should_update_freq(sg_policy, time, flags)) > + goto unlock; > > - ignore_dl_rate_limit(sg_cpu); > + sugov_iowait_boost(sg_cpu, time, flags); > > - if (sugov_should_update_freq(sg_policy, time)) { > - next_f = sugov_next_freq_shared(sg_cpu, time); > + next_f = sugov_next_freq_shared(sg_cpu, time, flags); > > - if (!sugov_update_next_freq(sg_policy, time, next_f)) > - goto unlock; > + if (!sugov_update_next_freq(sg_policy, time, next_f, flags)) > + goto unlock; > > - if (sg_policy->policy->fast_switch_enabled) > - cpufreq_driver_fast_switch(sg_policy->policy, next_f); > - else > - sugov_deferred_update(sg_policy); > - } > + if (sg_policy->policy->fast_switch_enabled) > + cpufreq_driver_fast_switch(sg_policy->policy, next_f); > + else > + sugov_deferred_update(sg_policy); > unlock: > raw_spin_unlock(&sg_policy->update_lock); > } > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c > index f59e5c19d944..8a4ccf532a7b 100644 > --- a/kernel/sched/deadline.c > +++ b/kernel/sched/deadline.c > @@ -251,8 +251,12 @@ void __add_running_bw(u64 dl_bw, struct dl_rq *dl_rq) > dl_rq->running_bw += dl_bw; > SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */ > SCHED_WARN_ON(dl_rq->running_bw > dl_rq->this_bw); > - /* kick cpufreq (see the comment in kernel/sched/sched.h). */ > - cpufreq_update_util(rq_of_dl_rq(dl_rq), 0); > + /* > + * Context switch handles updates, but this is an exception to ensure > + * multiple DL tasks run at the correct frequencies. We don't need > + * a cpufreq update on dequeue, context switch will handle that. > + */ > + cpufreq_update_util(rq_of_dl_rq(dl_rq), SCHED_CPUFREQ_FORCE_UPDATE); > } > > static inline > @@ -265,8 +269,6 @@ void __sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq) > SCHED_WARN_ON(dl_rq->running_bw > old); /* underflow */ > if (dl_rq->running_bw > old) > dl_rq->running_bw = 0; > - /* kick cpufreq (see the comment in kernel/sched/sched.h). */ > - cpufreq_update_util(rq_of_dl_rq(dl_rq), 0); > } > > static inline > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 9057584ec06d..8fe7a7124c70 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -3987,29 +3987,6 @@ static inline void update_cfs_group(struct sched_entity *se) > } > #endif /* CONFIG_FAIR_GROUP_SCHED */ > > -static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq, int flags) > -{ > - struct rq *rq = rq_of(cfs_rq); > - > - if (&rq->cfs == cfs_rq) { > - /* > - * There are a few boundary cases this might miss but it should > - * get called often enough that that should (hopefully) not be > - * a real problem. > - * > - * It will not get called when we go idle, because the idle > - * thread is a different class (!fair), nor will the utilization > - * number include things like RT tasks. > - * > - * As is, the util number is not freq-invariant (we'd have to > - * implement arch_scale_freq_capacity() for that). > - * > - * See cpu_util_cfs(). > - */ > - cpufreq_update_util(rq, flags); > - } > -} > - > #ifdef CONFIG_SMP > static inline bool load_avg_is_decayed(struct sched_avg *sa) > { > @@ -4687,8 +4664,6 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > > add_tg_cfs_propagate(cfs_rq, se->avg.load_sum); > > - cfs_rq_util_change(cfs_rq, 0); > - > trace_pelt_cfs_tp(cfs_rq); > } > > @@ -4717,8 +4692,6 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > > add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum); > > - cfs_rq_util_change(cfs_rq, 0); > - > trace_pelt_cfs_tp(cfs_rq); > } > > @@ -4765,12 +4738,16 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > */ > detach_entity_load_avg(cfs_rq, se); > update_tg_load_avg(cfs_rq); > - } else if (decayed) { > - cfs_rq_util_change(cfs_rq, 0); > - > - if (flags & UPDATE_TG) > - update_tg_load_avg(cfs_rq); > + } else if (decayed && (flags & UPDATE_TG)) { > + update_tg_load_avg(cfs_rq); > } > + > + /* > + * If this is the root cfs_rq, set the decayed flag to let the world > + * know a cpufreq update is required. > + */ > + if (cfs_rq == &rq_of(cfs_rq)->cfs) > + cfs_rq->decayed |= decayed; > } > > /* > @@ -5144,7 +5121,6 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) > > static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int not_used1) > { > - cfs_rq_util_change(cfs_rq, 0); > } > > static inline void remove_entity_load_avg(struct sched_entity *se) {} > @@ -6759,14 +6735,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) > */ > util_est_enqueue(&rq->cfs, p); > > - /* > - * If in_iowait is set, the code below may not trigger any cpufreq > - * utilization updates, so do it here explicitly with the IOWAIT flag > - * passed. > - */ > - if (p->in_iowait) > - cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT); > - > for_each_sched_entity(se) { > if (se->on_rq) > break; > @@ -8353,7 +8321,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int > int cse_is_idle, pse_is_idle; > > if (unlikely(se == pse)) > - return; > + goto nopreempt; > > /* > * This is possible from callers such as attach_tasks(), in which we > @@ -8362,7 +8330,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int > * next-buddy nomination below. > */ > if (unlikely(throttled_hierarchy(cfs_rq_of(pse)))) > - return; > + goto nopreempt; > > if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK)) { > set_next_buddy(pse); > @@ -8379,7 +8347,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int > * below. > */ > if (test_tsk_need_resched(curr)) > - return; > + goto nopreempt; > > /* Idle tasks are by definition preempted by non-idle tasks. */ > if (unlikely(task_has_idle_policy(curr)) && > @@ -8391,7 +8359,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int > * is driven by the tick): > */ > if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION)) > - return; > + goto nopreempt; > > find_matching_se(&se, &pse); > WARN_ON_ONCE(!pse); > @@ -8406,7 +8374,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int > if (cse_is_idle && !pse_is_idle) > goto preempt; > if (cse_is_idle != pse_is_idle) > - return; > + goto nopreempt; > > cfs_rq = cfs_rq_of(se); > update_curr(cfs_rq); > @@ -8417,6 +8385,24 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int > if (pick_eevdf(cfs_rq) == pse) > goto preempt; > > +nopreempt: > + if (rq->cfs.h_nr_running > 1) { > +#ifdef CONFIG_SMP > + /* > + * When a task is added, its util_est could be high but the > + * enqueue might not have caused rq->cfs.decayed to be updated > + * as it is small after a long sleep. So set it to ensure next > + * context switch will definitely trigger an update after the > + * new enqueue. > + * > + * TODO: we need to make cpufreq_update_util() return true if > + * the operation was successful or false if it failed and use > + * that to reset rq->cfs.decayed. > + */ > + rq->cfs.decayed = true; > +#endif > + cpufreq_update_util(rq, SCHED_CPUFREQ_TASK_ENQUEUED); > + } > return; > > preempt: > @@ -9352,10 +9338,6 @@ static bool __update_blocked_others(struct rq *rq, bool *done) > unsigned long hw_pressure; > bool decayed; > > - /* > - * update_load_avg() can call cpufreq_update_util(). Make sure that RT, > - * DL and IRQ signals have been updated before updating CFS. > - */ > curr_class = rq->curr->sched_class; > > hw_pressure = arch_scale_hw_pressure(cpu_of(rq)); > @@ -12692,6 +12674,15 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) > update_misfit_status(curr, rq); > check_update_overutilized_status(task_rq(curr)); > > +#ifdef CONFIG_SMP > + if (rq->cfs.decayed) { > + rq->cfs.decayed = false; > + cpufreq_update_util(rq, 0); > + } > +#else > + cpufreq_update_util(rq, 0); > +#endif > + > task_tick_core(rq, curr); > } > > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c > index 63e49c8ffc4d..92ed373e5b90 100644 > --- a/kernel/sched/rt.c > +++ b/kernel/sched/rt.c > @@ -555,11 +555,8 @@ static void sched_rt_rq_dequeue(struct rt_rq *rt_rq) > > rt_se = rt_rq->tg->rt_se[cpu]; > > - if (!rt_se) { > + if (!rt_se) > dequeue_top_rt_rq(rt_rq, rt_rq->rt_nr_running); > - /* Kick cpufreq (see the comment in kernel/sched/sched.h). */ > - cpufreq_update_util(rq_of_rt_rq(rt_rq), 0); > - } > else if (on_rt_rq(rt_se)) > dequeue_rt_entity(rt_se, 0); > } > @@ -1064,9 +1061,6 @@ enqueue_top_rt_rq(struct rt_rq *rt_rq) > add_nr_running(rq, rt_rq->rt_nr_running); > rt_rq->rt_queued = 1; > } > - > - /* Kick cpufreq (see the comment in kernel/sched/sched.h). */ > - cpufreq_update_util(rq, 0); > } > > #if defined CONFIG_SMP > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 4c36cc680361..1fc9339dd5c7 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -639,6 +639,11 @@ struct cfs_rq { > unsigned long runnable_avg; > } removed; > > + /* > + * Store whether last update_load_avg() has decayed > + */ > + bool decayed; > + > #ifdef CONFIG_FAIR_GROUP_SCHED > u64 last_update_tg_load_avg; > unsigned long tg_load_avg_contrib; > @@ -3609,10 +3614,12 @@ extern void set_load_weight(struct task_struct *p, bool update_load); > extern void enqueue_task(struct rq *rq, struct task_struct *p, int flags); > extern void dequeue_task(struct rq *rq, struct task_struct *p, int flags); > > -extern void check_class_changed(struct rq *rq, struct task_struct *p, > +extern bool check_class_changed(struct rq *rq, struct task_struct *p, > const struct sched_class *prev_class, > int oldprio); > > +extern void update_cpufreq_current(struct rq *rq); > + > #ifdef CONFIG_SMP > extern struct balance_callback *splice_balance_callbacks(struct rq *rq); > extern void balance_callbacks(struct rq *rq, struct balance_callback *head); > diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c > index ae1b42775ef9..c24769cf1a4f 100644 > --- a/kernel/sched/syscalls.c > +++ b/kernel/sched/syscalls.c > @@ -491,7 +491,7 @@ static bool uclamp_reset(const struct sched_attr *attr, > return false; > } > > -static void __setscheduler_uclamp(struct task_struct *p, > +static bool __setscheduler_uclamp(struct task_struct *p, > const struct sched_attr *attr) > { > enum uclamp_id clamp_id; > @@ -517,7 +517,7 @@ static void __setscheduler_uclamp(struct task_struct *p, > } > > if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP))) > - return; > + return false; > > if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN && > attr->sched_util_min != -1) { > @@ -530,6 +530,8 @@ static void __setscheduler_uclamp(struct task_struct *p, > uclamp_se_set(&p->uclamp_req[UCLAMP_MAX], > attr->sched_util_max, true); > } > + > + return true; > } > > #else /* !CONFIG_UCLAMP_TASK: */ > @@ -539,8 +541,11 @@ static inline int uclamp_validate(struct task_struct *p, > { > return -EOPNOTSUPP; > } > -static void __setscheduler_uclamp(struct task_struct *p, > - const struct sched_attr *attr) { } > +static bool __setscheduler_uclamp(struct task_struct *p, > + const struct sched_attr *attr) > +{ > + return false; > +} > #endif > > /* > @@ -614,6 +619,7 @@ int __sched_setscheduler(struct task_struct *p, > int retval, oldprio, newprio, queued, running; > const struct sched_class *prev_class; > struct balance_callback *head; > + bool update_cpufreq; > struct rq_flags rf; > int reset_on_fork; > int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK; > @@ -796,7 +802,8 @@ int __sched_setscheduler(struct task_struct *p, > __setscheduler_params(p, attr); > __setscheduler_prio(p, newprio); > } > - __setscheduler_uclamp(p, attr); > + > + update_cpufreq = __setscheduler_uclamp(p, attr); > > if (queued) { > /* > @@ -811,7 +818,18 @@ int __sched_setscheduler(struct task_struct *p, > if (running) > set_next_task(rq, p); > > - check_class_changed(rq, p, prev_class, oldprio); > + update_cpufreq |= check_class_changed(rq, p, prev_class, oldprio); > + > + /* > + * Changing class or uclamp value implies requiring to send cpufreq > + * update. > + */ > + if (update_cpufreq) { > + if (running) > + update_cpufreq_current(rq); > + else if (queued) > + cpufreq_update_util(rq, SCHED_CPUFREQ_TASK_ENQUEUED); > + } > > /* Avoid rq from going away on us: */ > preempt_disable(); > -- > 2.34.1 >
On 8/13/24 09:27, Vincent Guittot wrote: > On Tue, 13 Aug 2024 at 10:25, Vincent Guittot > <vincent.guittot@linaro.org> wrote: >> >> On Mon, 5 Aug 2024 at 17:35, Christian Loehle <christian.loehle@arm.com> wrote: >>> >>> On 7/28/24 19:45, Qais Yousef wrote: >>>> Improve the interaction with cpufreq governors by making the >>>> cpufreq_update_util() calls more intentional. >>>> >>>> At the moment we send them when load is updated for CFS, bandwidth for >>>> DL and at enqueue/dequeue for RT. But this can lead to too many updates >>>> sent in a short period of time and potentially be ignored at a critical >>>> moment due to the rate_limit_us in schedutil. >>>> >>>> For example, simultaneous task enqueue on the CPU where 2nd task is >>>> bigger and requires higher freq. The trigger to cpufreq_update_util() by >>>> the first task will lead to dropping the 2nd request until tick. Or >>>> another CPU in the same policy triggers a freq update shortly after. >>>> >>>> Updates at enqueue for RT are not strictly required. Though they do help >>>> to reduce the delay for switching the frequency and the potential >>>> observation of lower frequency during this delay. But current logic >>>> doesn't intentionally (at least to my understanding) try to speed up the >>>> request. >>>> >>>> To help reduce the amount of cpufreq updates and make them more >>>> purposeful, consolidate them into these locations: >>>> >>>> 1. context_switch() >>>> 2. task_tick_fair() >>>> 3. sched_balance_update_blocked_averages() >>>> 4. on sched_setscheduler() syscall that changes policy or uclamp values >>>> 5. on check_preempt_wakeup_fair() if wakeup preemption failed >>>> 6. on __add_running_bw() to guarantee DL bandwidth requirements. >>>> >>>> The update at context switch should help guarantee that RT get the right >>>> frequency straightaway when they're RUNNING. As mentioned though the >>>> update will happen slightly after enqueue_task(); though in an ideal >>>> world these tasks should be RUNNING ASAP and this additional delay >>>> should be negligible. For fair tasks we need to make sure we send >>>> a single update for every decay for the root cfs_rq. Any changes to the >>>> rq will be deferred until the next task is ready to run, or we hit TICK. >>>> But we are guaranteed the task is running at a level that meets its >>>> requirements after enqueue. >>>> >>>> To guarantee RT and DL tasks updates are never missed, we add a new >>>> SCHED_CPUFREQ_FORCE_UPDATE to ignore the rate_limit_us. If we are >>>> already running at the right freq, the governor will end up doing >>>> nothing, but we eliminate the risk of the task ending up accidentally >>>> running at the wrong freq due to rate_limit_us. >>>> >>>> Similarly for iowait boost, we ignore rate limits. We also handle a case >>>> of a boost reset prematurely by adding a guard in sugov_iowait_apply() >>>> to reduce the boost after 1ms which seems iowait boost mechanism relied >>>> on rate_limit_us and cfs_rq.decayed preventing any updates to happen >>>> soon after iowait boost. >>>> >>>> The new SCHED_CPUFREQ_FORCE_UPDATE should not impact the rate limit >>>> time stamps otherwise we can end up delaying updates for normal >>>> requests. >>> >>> Hi Qais, >>> the idea of SCHED_CPUFREQ_FORCE_UPDATE and the possiblity of spamming >>> freq updates still bothered me so let me share my thoughts even though >>> it might be niche enough for us not to care. >>> >>> 1. On fast_switch systems, assuming they are fine with handling the >>> actual updates, we have a bit more work on each context_switch() and >>> some synchronisation, too. That should be fine, if anything there's >>> some performance regression in a couple of niche cases. >>> >>> 2. On !fast_switch systems this gets more interesting IMO. So we have >>> a sugov DEADLINE task wakeup for every (in a freq-diff resulting) >>> update request. This task will preempt whatever and currently will >>> pretty much always be running on the CPU it ran last on (so first CPU >>> of the PD). >> >> The !fast_switch is a bit of concern for me too but not for the same >> reason and maybe the opposite of yours IIUC your proposal below: >> >> With fast_switch we have the following sequence: >> >> sched_switch() to task A >> cpufreq_driver_fast_switch -> write new freq target >> run task A >> >> This is pretty straight forward but we have the following sequence >> with !fast_switch >> >> sched_switch() to task A >> queue_irq_work -> raise an IPI on local CPU >> Handle IPI -> wakeup and queue sugov dl worker on local CPU (always >> with 1 CPU per PD) >> sched_switch() to sugov dl task >> __cpufreq_driver_target() which can possibly block on a lock >> sched_switch() to task A >> run task A >> > > sent a bit too early > >> We can possibly have 2 context switch and one IPi for each "normal" >> context switch which is not really optimal > > It would be good to find a way to skip the spurious back and forth > between the normal task and sugov Just to confirm I understand your concern correctly, that's more or less the behavior without Qais' patch as well though, isn't it? Ignoring the move from "This happens at enqueue" vs. "this happens at context switch". Since sugov doesn't queue any work if the desired frequency doesn't change I imagine it's not too bad? Or are you more concerned that the work is queued multiple times simultaneously for multiple CPUs of the same PD? If so we can work around that with the work_in_progress state to at least limit that window by a lot. [snip]
On Tue, 13 Aug 2024 at 18:26, Christian Loehle <christian.loehle@arm.com> wrote: > > On 8/13/24 09:27, Vincent Guittot wrote: > > On Tue, 13 Aug 2024 at 10:25, Vincent Guittot > > <vincent.guittot@linaro.org> wrote: > >> > >> On Mon, 5 Aug 2024 at 17:35, Christian Loehle <christian.loehle@arm.com> wrote: > >>> > >>> On 7/28/24 19:45, Qais Yousef wrote: > >>>> Improve the interaction with cpufreq governors by making the > >>>> cpufreq_update_util() calls more intentional. > >>>> > >>>> At the moment we send them when load is updated for CFS, bandwidth for > >>>> DL and at enqueue/dequeue for RT. But this can lead to too many updates > >>>> sent in a short period of time and potentially be ignored at a critical > >>>> moment due to the rate_limit_us in schedutil. > >>>> > >>>> For example, simultaneous task enqueue on the CPU where 2nd task is > >>>> bigger and requires higher freq. The trigger to cpufreq_update_util() by > >>>> the first task will lead to dropping the 2nd request until tick. Or > >>>> another CPU in the same policy triggers a freq update shortly after. > >>>> > >>>> Updates at enqueue for RT are not strictly required. Though they do help > >>>> to reduce the delay for switching the frequency and the potential > >>>> observation of lower frequency during this delay. But current logic > >>>> doesn't intentionally (at least to my understanding) try to speed up the > >>>> request. > >>>> > >>>> To help reduce the amount of cpufreq updates and make them more > >>>> purposeful, consolidate them into these locations: > >>>> > >>>> 1. context_switch() > >>>> 2. task_tick_fair() > >>>> 3. sched_balance_update_blocked_averages() > >>>> 4. on sched_setscheduler() syscall that changes policy or uclamp values > >>>> 5. on check_preempt_wakeup_fair() if wakeup preemption failed > >>>> 6. on __add_running_bw() to guarantee DL bandwidth requirements. > >>>> > >>>> The update at context switch should help guarantee that RT get the right > >>>> frequency straightaway when they're RUNNING. As mentioned though the > >>>> update will happen slightly after enqueue_task(); though in an ideal > >>>> world these tasks should be RUNNING ASAP and this additional delay > >>>> should be negligible. For fair tasks we need to make sure we send > >>>> a single update for every decay for the root cfs_rq. Any changes to the > >>>> rq will be deferred until the next task is ready to run, or we hit TICK. > >>>> But we are guaranteed the task is running at a level that meets its > >>>> requirements after enqueue. > >>>> > >>>> To guarantee RT and DL tasks updates are never missed, we add a new > >>>> SCHED_CPUFREQ_FORCE_UPDATE to ignore the rate_limit_us. If we are > >>>> already running at the right freq, the governor will end up doing > >>>> nothing, but we eliminate the risk of the task ending up accidentally > >>>> running at the wrong freq due to rate_limit_us. > >>>> > >>>> Similarly for iowait boost, we ignore rate limits. We also handle a case > >>>> of a boost reset prematurely by adding a guard in sugov_iowait_apply() > >>>> to reduce the boost after 1ms which seems iowait boost mechanism relied > >>>> on rate_limit_us and cfs_rq.decayed preventing any updates to happen > >>>> soon after iowait boost. > >>>> > >>>> The new SCHED_CPUFREQ_FORCE_UPDATE should not impact the rate limit > >>>> time stamps otherwise we can end up delaying updates for normal > >>>> requests. > >>> > >>> Hi Qais, > >>> the idea of SCHED_CPUFREQ_FORCE_UPDATE and the possiblity of spamming > >>> freq updates still bothered me so let me share my thoughts even though > >>> it might be niche enough for us not to care. > >>> > >>> 1. On fast_switch systems, assuming they are fine with handling the > >>> actual updates, we have a bit more work on each context_switch() and > >>> some synchronisation, too. That should be fine, if anything there's > >>> some performance regression in a couple of niche cases. > >>> > >>> 2. On !fast_switch systems this gets more interesting IMO. So we have > >>> a sugov DEADLINE task wakeup for every (in a freq-diff resulting) > >>> update request. This task will preempt whatever and currently will > >>> pretty much always be running on the CPU it ran last on (so first CPU > >>> of the PD). > >> > >> The !fast_switch is a bit of concern for me too but not for the same > >> reason and maybe the opposite of yours IIUC your proposal below: > >> > >> With fast_switch we have the following sequence: > >> > >> sched_switch() to task A > >> cpufreq_driver_fast_switch -> write new freq target > >> run task A > >> > >> This is pretty straight forward but we have the following sequence > >> with !fast_switch > >> > >> sched_switch() to task A > >> queue_irq_work -> raise an IPI on local CPU > >> Handle IPI -> wakeup and queue sugov dl worker on local CPU (always > >> with 1 CPU per PD) > >> sched_switch() to sugov dl task > >> __cpufreq_driver_target() which can possibly block on a lock > >> sched_switch() to task A > >> run task A > >> > > > > sent a bit too early > > > >> We can possibly have 2 context switch and one IPi for each "normal" > >> context switch which is not really optimal > > > > It would be good to find a way to skip the spurious back and forth > > between the normal task and sugov > > Just to confirm I understand your concern correctly, that's more or > less the behavior without Qais' patch as well though, isn't it? > Ignoring the move from "This happens at enqueue" vs. "this > happens at context switch". without Qais patch, we save a useless context switch to task A enqueue task A queue_irq_work -> raise an IPI on local CPU Handle IPI -> wakeup and queue sugov dl worker on local CPU (always with 1 CPU per PD) sched_switch() to sugov dl task __cpufreq_driver_target() which can possibly block on a lock sched_switch() to task A > Since sugov doesn't queue any work if the desired frequency doesn't > change I imagine it's not too bad? > Or are you more concerned that the work is queued multiple times > simultaneously for multiple CPUs of the same PD? If so we can > work around that with the work_in_progress state to at least limit > that window by a lot. > > [snip]
On 8/13/24 17:43, Vincent Guittot wrote: > On Tue, 13 Aug 2024 at 18:26, Christian Loehle <christian.loehle@arm.com> wrote: >> >> On 8/13/24 09:27, Vincent Guittot wrote: >>> On Tue, 13 Aug 2024 at 10:25, Vincent Guittot >>> <vincent.guittot@linaro.org> wrote: >>>> >>>> On Mon, 5 Aug 2024 at 17:35, Christian Loehle <christian.loehle@arm.com> wrote: >>>>> >>>>> On 7/28/24 19:45, Qais Yousef wrote: >>>>>> Improve the interaction with cpufreq governors by making the >>>>>> cpufreq_update_util() calls more intentional. >>>>>> >>>>>> At the moment we send them when load is updated for CFS, bandwidth for >>>>>> DL and at enqueue/dequeue for RT. But this can lead to too many updates >>>>>> sent in a short period of time and potentially be ignored at a critical >>>>>> moment due to the rate_limit_us in schedutil. >>>>>> >>>>>> For example, simultaneous task enqueue on the CPU where 2nd task is >>>>>> bigger and requires higher freq. The trigger to cpufreq_update_util() by >>>>>> the first task will lead to dropping the 2nd request until tick. Or >>>>>> another CPU in the same policy triggers a freq update shortly after. >>>>>> >>>>>> Updates at enqueue for RT are not strictly required. Though they do help >>>>>> to reduce the delay for switching the frequency and the potential >>>>>> observation of lower frequency during this delay. But current logic >>>>>> doesn't intentionally (at least to my understanding) try to speed up the >>>>>> request. >>>>>> >>>>>> To help reduce the amount of cpufreq updates and make them more >>>>>> purposeful, consolidate them into these locations: >>>>>> >>>>>> 1. context_switch() >>>>>> 2. task_tick_fair() >>>>>> 3. sched_balance_update_blocked_averages() >>>>>> 4. on sched_setscheduler() syscall that changes policy or uclamp values >>>>>> 5. on check_preempt_wakeup_fair() if wakeup preemption failed >>>>>> 6. on __add_running_bw() to guarantee DL bandwidth requirements. >>>>>> >>>>>> The update at context switch should help guarantee that RT get the right >>>>>> frequency straightaway when they're RUNNING. As mentioned though the >>>>>> update will happen slightly after enqueue_task(); though in an ideal >>>>>> world these tasks should be RUNNING ASAP and this additional delay >>>>>> should be negligible. For fair tasks we need to make sure we send >>>>>> a single update for every decay for the root cfs_rq. Any changes to the >>>>>> rq will be deferred until the next task is ready to run, or we hit TICK. >>>>>> But we are guaranteed the task is running at a level that meets its >>>>>> requirements after enqueue. >>>>>> >>>>>> To guarantee RT and DL tasks updates are never missed, we add a new >>>>>> SCHED_CPUFREQ_FORCE_UPDATE to ignore the rate_limit_us. If we are >>>>>> already running at the right freq, the governor will end up doing >>>>>> nothing, but we eliminate the risk of the task ending up accidentally >>>>>> running at the wrong freq due to rate_limit_us. >>>>>> >>>>>> Similarly for iowait boost, we ignore rate limits. We also handle a case >>>>>> of a boost reset prematurely by adding a guard in sugov_iowait_apply() >>>>>> to reduce the boost after 1ms which seems iowait boost mechanism relied >>>>>> on rate_limit_us and cfs_rq.decayed preventing any updates to happen >>>>>> soon after iowait boost. >>>>>> >>>>>> The new SCHED_CPUFREQ_FORCE_UPDATE should not impact the rate limit >>>>>> time stamps otherwise we can end up delaying updates for normal >>>>>> requests. >>>>> >>>>> Hi Qais, >>>>> the idea of SCHED_CPUFREQ_FORCE_UPDATE and the possiblity of spamming >>>>> freq updates still bothered me so let me share my thoughts even though >>>>> it might be niche enough for us not to care. >>>>> >>>>> 1. On fast_switch systems, assuming they are fine with handling the >>>>> actual updates, we have a bit more work on each context_switch() and >>>>> some synchronisation, too. That should be fine, if anything there's >>>>> some performance regression in a couple of niche cases. >>>>> >>>>> 2. On !fast_switch systems this gets more interesting IMO. So we have >>>>> a sugov DEADLINE task wakeup for every (in a freq-diff resulting) >>>>> update request. This task will preempt whatever and currently will >>>>> pretty much always be running on the CPU it ran last on (so first CPU >>>>> of the PD). >>>> >>>> The !fast_switch is a bit of concern for me too but not for the same >>>> reason and maybe the opposite of yours IIUC your proposal below: >>>> >>>> With fast_switch we have the following sequence: >>>> >>>> sched_switch() to task A >>>> cpufreq_driver_fast_switch -> write new freq target >>>> run task A >>>> >>>> This is pretty straight forward but we have the following sequence >>>> with !fast_switch >>>> >>>> sched_switch() to task A >>>> queue_irq_work -> raise an IPI on local CPU >>>> Handle IPI -> wakeup and queue sugov dl worker on local CPU (always >>>> with 1 CPU per PD) >>>> sched_switch() to sugov dl task >>>> __cpufreq_driver_target() which can possibly block on a lock >>>> sched_switch() to task A >>>> run task A >>>> >>> >>> sent a bit too early >>> >>>> We can possibly have 2 context switch and one IPi for each "normal" >>>> context switch which is not really optimal >>> >>> It would be good to find a way to skip the spurious back and forth >>> between the normal task and sugov >> >> Just to confirm I understand your concern correctly, that's more or >> less the behavior without Qais' patch as well though, isn't it? >> Ignoring the move from "This happens at enqueue" vs. "this >> happens at context switch". > > without Qais patch, we save a useless context switch to task A Ah right gotcha, the 'mostly' complete switch to task A that is annulled by the switch to sugov dl task is indeed wasteful. > > enqueue task A > queue_irq_work -> raise an IPI on local CPU > Handle IPI -> wakeup and queue sugov dl worker on local CPU (always > with 1 CPU per PD) > sched_switch() to sugov dl task > __cpufreq_driver_target() which can possibly block on a lock > sched_switch() to task A > > >> Since sugov doesn't queue any work if the desired frequency doesn't >> change I imagine it's not too bad? >> Or are you more concerned that the work is queued multiple times >> simultaneously for multiple CPUs of the same PD? If so we can >> work around that with the work_in_progress state to at least limit >> that window by a lot. >> >> [snip]
On 08/13/24 10:27, Vincent Guittot wrote: > On Tue, 13 Aug 2024 at 10:25, Vincent Guittot > <vincent.guittot@linaro.org> wrote: > > > > On Mon, 5 Aug 2024 at 17:35, Christian Loehle <christian.loehle@arm.com> wrote: > > > Hi Qais, > > > the idea of SCHED_CPUFREQ_FORCE_UPDATE and the possiblity of spamming > > > freq updates still bothered me so let me share my thoughts even though > > > it might be niche enough for us not to care. > > > > > > 1. On fast_switch systems, assuming they are fine with handling the > > > actual updates, we have a bit more work on each context_switch() and > > > some synchronisation, too. That should be fine, if anything there's > > > some performance regression in a couple of niche cases. > > > > > > 2. On !fast_switch systems this gets more interesting IMO. So we have > > > a sugov DEADLINE task wakeup for every (in a freq-diff resulting) > > > update request. This task will preempt whatever and currently will > > > pretty much always be running on the CPU it ran last on (so first CPU > > > of the PD). > > > > The !fast_switch is a bit of concern for me too but not for the same > > reason and maybe the opposite of yours IIUC your proposal below: > > > > With fast_switch we have the following sequence: > > > > sched_switch() to task A > > cpufreq_driver_fast_switch -> write new freq target > > run task A > > > > This is pretty straight forward but we have the following sequence > > with !fast_switch > > > > sched_switch() to task A > > queue_irq_work -> raise an IPI on local CPU > > Handle IPI -> wakeup and queue sugov dl worker on local CPU (always > > with 1 CPU per PD) > > sched_switch() to sugov dl task > > __cpufreq_driver_target() which can possibly block on a lock > > sched_switch() to task A > > run task A > > > > sent a bit too early > > > We can possibly have 2 context switch and one IPi for each "normal" > > context switch which is not really optimal > > It would be good to find a way to skip the spurious back and forth > between the normal task and sugov Hmm I think we use affinity to keep the sugov running on policy->related_cpus. Relaxing this will make it less of a problem, but won't eliminate it. I'll have a think about it, is this a blocker for now?
On 08/13/24 12:02, Vincent Guittot wrote: > > void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags) > > @@ -4913,6 +4923,93 @@ static inline void __balance_callbacks(struct rq *rq) > > > > #endif > > > > +static __always_inline void > > +__update_cpufreq_ctx_switch(struct rq *rq, struct task_struct *prev) > > +{ > > +#ifdef CONFIG_CPU_FREQ > > + if (prev && prev->dl.flags & SCHED_FLAG_SUGOV) { > > + /* Sugov just did an update, don't be too aggressive */ > > + return; > > + } > > + > > + /* > > + * RT and DL should always send a freq update. But we can do some > > + * simple checks to avoid it when we know it's not necessary. > > + * > > + * iowait_boost will always trigger a freq update too. > > + * > > + * Fair tasks will only trigger an update if the root cfs_rq has > > + * decayed. > > + * > > + * Everything else should do nothing. > > + */ > > + switch (current->policy) { > > + case SCHED_NORMAL: > > + case SCHED_BATCH: > > + case SCHED_IDLE: > > + if (unlikely(current->in_iowait)) { > > + cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT | SCHED_CPUFREQ_FORCE_UPDATE); > > + return; > > + } > > + > > +#ifdef CONFIG_SMP > > + /* > > + * Send an update if we switched from RT or DL as they tend to > > + * boost the CPU and we are likely able to reduce the freq now. > > + */ > > + rq->cfs.decayed |= prev && (rt_policy(prev->policy) || dl_policy(prev->policy)); > > + > > + if (unlikely(rq->cfs.decayed)) { > > My previous use case of a task non preempting current with large > util_est is fixed with this version but I'm facing a new one a bit > similar because of waiting for the context switch and the decay to try > to update the frequency. > > When the task wakes up on an idle cpu, you wait for the decay to > update the freq but if the freq is low and the pelt has been updated > recently (less than 1024us) you can wait a long time before the next > decay and the freq update. This is a problem if the task's util_est is > large because you can stay several ms at low frequency before taking > into account task's util_est It is a symptom of the same problem. It seems we don't decay and we omit the cpufreq update. Why this was not a problem before? AFAICT we only send an update before my patch if we had a decay and I didn't change this condition. Were we just getting more lucky or did I change some behavior unwittingly? The problem with my patch is that I do this unconditional only if we failed preemption check. But looks like I must enforce a cpufreq update after every enqueue. I think the overhead of not checking rq->cfs.decayed would be high if we always call a cpufreq update. I'll just set rq->cfs.decayaed in util_est_enqueue() which should address both use cases. Thanks!
On Sun, 1 Sept 2024 at 19:51, Qais Yousef <qyousef@layalina.io> wrote: > > On 08/13/24 10:27, Vincent Guittot wrote: > > On Tue, 13 Aug 2024 at 10:25, Vincent Guittot > > <vincent.guittot@linaro.org> wrote: > > > > > > On Mon, 5 Aug 2024 at 17:35, Christian Loehle <christian.loehle@arm.com> wrote: > > > > Hi Qais, > > > > the idea of SCHED_CPUFREQ_FORCE_UPDATE and the possiblity of spamming > > > > freq updates still bothered me so let me share my thoughts even though > > > > it might be niche enough for us not to care. > > > > > > > > 1. On fast_switch systems, assuming they are fine with handling the > > > > actual updates, we have a bit more work on each context_switch() and > > > > some synchronisation, too. That should be fine, if anything there's > > > > some performance regression in a couple of niche cases. > > > > > > > > 2. On !fast_switch systems this gets more interesting IMO. So we have > > > > a sugov DEADLINE task wakeup for every (in a freq-diff resulting) > > > > update request. This task will preempt whatever and currently will > > > > pretty much always be running on the CPU it ran last on (so first CPU > > > > of the PD). > > > > > > The !fast_switch is a bit of concern for me too but not for the same > > > reason and maybe the opposite of yours IIUC your proposal below: > > > > > > With fast_switch we have the following sequence: > > > > > > sched_switch() to task A > > > cpufreq_driver_fast_switch -> write new freq target > > > run task A > > > > > > This is pretty straight forward but we have the following sequence > > > with !fast_switch > > > > > > sched_switch() to task A > > > queue_irq_work -> raise an IPI on local CPU > > > Handle IPI -> wakeup and queue sugov dl worker on local CPU (always > > > with 1 CPU per PD) > > > sched_switch() to sugov dl task > > > __cpufreq_driver_target() which can possibly block on a lock > > > sched_switch() to task A > > > run task A > > > > > > > sent a bit too early > > > > > We can possibly have 2 context switch and one IPi for each "normal" > > > context switch which is not really optimal > > > > It would be good to find a way to skip the spurious back and forth > > between the normal task and sugov > > Hmm I think we use affinity to keep the sugov running on policy->related_cpus. > Relaxing this will make it less of a problem, but won't eliminate it. yes, but it's not a problem of relaxing affinity here The problem is that the 1st switch to task A will be preempted by sugov so the 1st switch is useless. You should call cpufreq_update before switching to A so that we skip the useless switch to task A and directly switch to sugov 1st then task A > > I'll have a think about it, is this a blocker for now? > >
On 9/2/24 13:30, Vincent Guittot wrote: > On Sun, 1 Sept 2024 at 19:51, Qais Yousef <qyousef@layalina.io> wrote: >> >> On 08/13/24 10:27, Vincent Guittot wrote: >>> On Tue, 13 Aug 2024 at 10:25, Vincent Guittot >>> <vincent.guittot@linaro.org> wrote: >>>> >>>> On Mon, 5 Aug 2024 at 17:35, Christian Loehle <christian.loehle@arm.com> wrote: >>>>> Hi Qais, >>>>> the idea of SCHED_CPUFREQ_FORCE_UPDATE and the possiblity of spamming >>>>> freq updates still bothered me so let me share my thoughts even though >>>>> it might be niche enough for us not to care. >>>>> >>>>> 1. On fast_switch systems, assuming they are fine with handling the >>>>> actual updates, we have a bit more work on each context_switch() and >>>>> some synchronisation, too. That should be fine, if anything there's >>>>> some performance regression in a couple of niche cases. >>>>> >>>>> 2. On !fast_switch systems this gets more interesting IMO. So we have >>>>> a sugov DEADLINE task wakeup for every (in a freq-diff resulting) >>>>> update request. This task will preempt whatever and currently will >>>>> pretty much always be running on the CPU it ran last on (so first CPU >>>>> of the PD). >>>> >>>> The !fast_switch is a bit of concern for me too but not for the same >>>> reason and maybe the opposite of yours IIUC your proposal below: >>>> >>>> With fast_switch we have the following sequence: >>>> >>>> sched_switch() to task A >>>> cpufreq_driver_fast_switch -> write new freq target >>>> run task A >>>> >>>> This is pretty straight forward but we have the following sequence >>>> with !fast_switch >>>> >>>> sched_switch() to task A >>>> queue_irq_work -> raise an IPI on local CPU >>>> Handle IPI -> wakeup and queue sugov dl worker on local CPU (always >>>> with 1 CPU per PD) >>>> sched_switch() to sugov dl task >>>> __cpufreq_driver_target() which can possibly block on a lock >>>> sched_switch() to task A >>>> run task A >>>> >>> >>> sent a bit too early >>> >>>> We can possibly have 2 context switch and one IPi for each "normal" >>>> context switch which is not really optimal >>> >>> It would be good to find a way to skip the spurious back and forth >>> between the normal task and sugov >> >> Hmm I think we use affinity to keep the sugov running on policy->related_cpus. >> Relaxing this will make it less of a problem, but won't eliminate it. > > yes, but it's not a problem of relaxing affinity here > > The problem is that the 1st switch to task A will be preempted by > sugov so the 1st switch is useless. You should call cpufreq_update > before switching to A so that we skip the useless switch to task A and > directly switch to sugov 1st then task A Not necessarily, if you relax them to all CPUs the sugov tasks for all PDs can run on CPU0, no matter which CPU task A is on. There is some benefit on having all sugov threads on the littles for many platforms (i.e. restricting them), making a good mainline policy out of it isn't quite so obvious to me.
On Mon, 2 Sept 2024 at 14:35, Christian Loehle <christian.loehle@arm.com> wrote: > > On 9/2/24 13:30, Vincent Guittot wrote: > > On Sun, 1 Sept 2024 at 19:51, Qais Yousef <qyousef@layalina.io> wrote: > >> > >> On 08/13/24 10:27, Vincent Guittot wrote: > >>> On Tue, 13 Aug 2024 at 10:25, Vincent Guittot > >>> <vincent.guittot@linaro.org> wrote: > >>>> > >>>> On Mon, 5 Aug 2024 at 17:35, Christian Loehle <christian.loehle@arm.com> wrote: > >>>>> Hi Qais, > >>>>> the idea of SCHED_CPUFREQ_FORCE_UPDATE and the possiblity of spamming > >>>>> freq updates still bothered me so let me share my thoughts even though > >>>>> it might be niche enough for us not to care. > >>>>> > >>>>> 1. On fast_switch systems, assuming they are fine with handling the > >>>>> actual updates, we have a bit more work on each context_switch() and > >>>>> some synchronisation, too. That should be fine, if anything there's > >>>>> some performance regression in a couple of niche cases. > >>>>> > >>>>> 2. On !fast_switch systems this gets more interesting IMO. So we have > >>>>> a sugov DEADLINE task wakeup for every (in a freq-diff resulting) > >>>>> update request. This task will preempt whatever and currently will > >>>>> pretty much always be running on the CPU it ran last on (so first CPU > >>>>> of the PD). > >>>> > >>>> The !fast_switch is a bit of concern for me too but not for the same > >>>> reason and maybe the opposite of yours IIUC your proposal below: > >>>> > >>>> With fast_switch we have the following sequence: > >>>> > >>>> sched_switch() to task A > >>>> cpufreq_driver_fast_switch -> write new freq target > >>>> run task A > >>>> > >>>> This is pretty straight forward but we have the following sequence > >>>> with !fast_switch > >>>> > >>>> sched_switch() to task A > >>>> queue_irq_work -> raise an IPI on local CPU > >>>> Handle IPI -> wakeup and queue sugov dl worker on local CPU (always > >>>> with 1 CPU per PD) > >>>> sched_switch() to sugov dl task > >>>> __cpufreq_driver_target() which can possibly block on a lock > >>>> sched_switch() to task A > >>>> run task A > >>>> > >>> > >>> sent a bit too early > >>> > >>>> We can possibly have 2 context switch and one IPi for each "normal" > >>>> context switch which is not really optimal > >>> > >>> It would be good to find a way to skip the spurious back and forth > >>> between the normal task and sugov > >> > >> Hmm I think we use affinity to keep the sugov running on policy->related_cpus. > >> Relaxing this will make it less of a problem, but won't eliminate it. > > > > yes, but it's not a problem of relaxing affinity here > > > > The problem is that the 1st switch to task A will be preempted by > > sugov so the 1st switch is useless. You should call cpufreq_update > > before switching to A so that we skip the useless switch to task A and > > directly switch to sugov 1st then task A > > Not necessarily, if you relax them to all CPUs the sugov tasks for all PDs > can run on CPU0, no matter which CPU task A is on. You can't always run sugov on other CPUs than those impacted by the freq change > > There is some benefit on having all sugov threads on the littles for many > platforms (i.e. restricting them), making a good mainline policy out of > it isn't quite so obvious to me. This is a particular case whereas we want this to work in all cases >
On 09/02/24 14:30, Vincent Guittot wrote: > On Sun, 1 Sept 2024 at 19:51, Qais Yousef <qyousef@layalina.io> wrote: > > > > On 08/13/24 10:27, Vincent Guittot wrote: > > > On Tue, 13 Aug 2024 at 10:25, Vincent Guittot > > > <vincent.guittot@linaro.org> wrote: > > > > > > > > On Mon, 5 Aug 2024 at 17:35, Christian Loehle <christian.loehle@arm.com> wrote: > > > > > Hi Qais, > > > > > the idea of SCHED_CPUFREQ_FORCE_UPDATE and the possiblity of spamming > > > > > freq updates still bothered me so let me share my thoughts even though > > > > > it might be niche enough for us not to care. > > > > > > > > > > 1. On fast_switch systems, assuming they are fine with handling the > > > > > actual updates, we have a bit more work on each context_switch() and > > > > > some synchronisation, too. That should be fine, if anything there's > > > > > some performance regression in a couple of niche cases. > > > > > > > > > > 2. On !fast_switch systems this gets more interesting IMO. So we have > > > > > a sugov DEADLINE task wakeup for every (in a freq-diff resulting) > > > > > update request. This task will preempt whatever and currently will > > > > > pretty much always be running on the CPU it ran last on (so first CPU > > > > > of the PD). > > > > > > > > The !fast_switch is a bit of concern for me too but not for the same > > > > reason and maybe the opposite of yours IIUC your proposal below: > > > > > > > > With fast_switch we have the following sequence: > > > > > > > > sched_switch() to task A > > > > cpufreq_driver_fast_switch -> write new freq target > > > > run task A > > > > > > > > This is pretty straight forward but we have the following sequence > > > > with !fast_switch > > > > > > > > sched_switch() to task A > > > > queue_irq_work -> raise an IPI on local CPU > > > > Handle IPI -> wakeup and queue sugov dl worker on local CPU (always > > > > with 1 CPU per PD) > > > > sched_switch() to sugov dl task > > > > __cpufreq_driver_target() which can possibly block on a lock > > > > sched_switch() to task A > > > > run task A > > > > > > > > > > sent a bit too early > > > > > > > We can possibly have 2 context switch and one IPi for each "normal" > > > > context switch which is not really optimal > > > > > > It would be good to find a way to skip the spurious back and forth > > > between the normal task and sugov > > > > Hmm I think we use affinity to keep the sugov running on policy->related_cpus. > > Relaxing this will make it less of a problem, but won't eliminate it. > > yes, but it's not a problem of relaxing affinity here If we have 1 CPU per PD, then relaxing affinity will allow it to run anywhere. I am just this will be safe on all platforms of course. But yeah, I don't think this is a solution anyway but the simplest thing to make it harder to hit. > The problem is that the 1st switch to task A will be preempted by > sugov so the 1st switch is useless. You should call cpufreq_update > before switching to A so that we skip the useless switch to task A and > directly switch to sugov 1st then task A Can we do this safely after we pick task A, but before we do the actual context switch? One of the reasons I put this too late is because there's a late call to balance_calbacks() that can impact the state of the rq and important to take into account based on my previous testing and analysis. Any reason we need to run the sugov worker as DL instead for example being a softirq?
On 09/02/24 13:58, Qais Yousef wrote: > On 09/02/24 14:30, Vincent Guittot wrote: > > On Sun, 1 Sept 2024 at 19:51, Qais Yousef <qyousef@layalina.io> wrote: > > > > > > On 08/13/24 10:27, Vincent Guittot wrote: > > > > On Tue, 13 Aug 2024 at 10:25, Vincent Guittot > > > > <vincent.guittot@linaro.org> wrote: > > > > > > > > > > On Mon, 5 Aug 2024 at 17:35, Christian Loehle <christian.loehle@arm.com> wrote: > > > > > > Hi Qais, > > > > > > the idea of SCHED_CPUFREQ_FORCE_UPDATE and the possiblity of spamming > > > > > > freq updates still bothered me so let me share my thoughts even though > > > > > > it might be niche enough for us not to care. > > > > > > > > > > > > 1. On fast_switch systems, assuming they are fine with handling the > > > > > > actual updates, we have a bit more work on each context_switch() and > > > > > > some synchronisation, too. That should be fine, if anything there's > > > > > > some performance regression in a couple of niche cases. > > > > > > > > > > > > 2. On !fast_switch systems this gets more interesting IMO. So we have > > > > > > a sugov DEADLINE task wakeup for every (in a freq-diff resulting) > > > > > > update request. This task will preempt whatever and currently will > > > > > > pretty much always be running on the CPU it ran last on (so first CPU > > > > > > of the PD). > > > > > > > > > > The !fast_switch is a bit of concern for me too but not for the same > > > > > reason and maybe the opposite of yours IIUC your proposal below: > > > > > > > > > > With fast_switch we have the following sequence: > > > > > > > > > > sched_switch() to task A > > > > > cpufreq_driver_fast_switch -> write new freq target > > > > > run task A > > > > > > > > > > This is pretty straight forward but we have the following sequence > > > > > with !fast_switch > > > > > > > > > > sched_switch() to task A > > > > > queue_irq_work -> raise an IPI on local CPU > > > > > Handle IPI -> wakeup and queue sugov dl worker on local CPU (always > > > > > with 1 CPU per PD) > > > > > sched_switch() to sugov dl task > > > > > __cpufreq_driver_target() which can possibly block on a lock > > > > > sched_switch() to task A > > > > > run task A > > > > > > > > > > > > > sent a bit too early > > > > > > > > > We can possibly have 2 context switch and one IPi for each "normal" > > > > > context switch which is not really optimal > > > > > > > > It would be good to find a way to skip the spurious back and forth > > > > between the normal task and sugov > > > > > > Hmm I think we use affinity to keep the sugov running on policy->related_cpus. > > > Relaxing this will make it less of a problem, but won't eliminate it. > > > > yes, but it's not a problem of relaxing affinity here > > If we have 1 CPU per PD, then relaxing affinity will allow it to run anywhere. > I am just this will be safe on all platforms of course. > > But yeah, I don't think this is a solution anyway but the simplest thing to > make it harder to hit. > > > The problem is that the 1st switch to task A will be preempted by > > sugov so the 1st switch is useless. You should call cpufreq_update > > before switching to A so that we skip the useless switch to task A and > > directly switch to sugov 1st then task A > > Can we do this safely after we pick task A, but before we do the actual context > switch? One of the reasons I put this too late is because there's a late call > to balance_calbacks() that can impact the state of the rq and important to take > into account based on my previous testing and analysis. > > Any reason we need to run the sugov worker as DL instead for example being > a softirq? I assume it performs non interrupt context safe operations. But I don't think I've ever seen it sleep during an activation.
On Mon, 2 Sept 2024 at 14:58, Qais Yousef <qyousef@layalina.io> wrote: > > On 09/02/24 14:30, Vincent Guittot wrote: > > On Sun, 1 Sept 2024 at 19:51, Qais Yousef <qyousef@layalina.io> wrote: > > > > > > On 08/13/24 10:27, Vincent Guittot wrote: > > > > On Tue, 13 Aug 2024 at 10:25, Vincent Guittot > > > > <vincent.guittot@linaro.org> wrote: > > > > > > > > > > On Mon, 5 Aug 2024 at 17:35, Christian Loehle <christian.loehle@arm.com> wrote: > > > > > > Hi Qais, > > > > > > the idea of SCHED_CPUFREQ_FORCE_UPDATE and the possiblity of spamming > > > > > > freq updates still bothered me so let me share my thoughts even though > > > > > > it might be niche enough for us not to care. > > > > > > > > > > > > 1. On fast_switch systems, assuming they are fine with handling the > > > > > > actual updates, we have a bit more work on each context_switch() and > > > > > > some synchronisation, too. That should be fine, if anything there's > > > > > > some performance regression in a couple of niche cases. > > > > > > > > > > > > 2. On !fast_switch systems this gets more interesting IMO. So we have > > > > > > a sugov DEADLINE task wakeup for every (in a freq-diff resulting) > > > > > > update request. This task will preempt whatever and currently will > > > > > > pretty much always be running on the CPU it ran last on (so first CPU > > > > > > of the PD). > > > > > > > > > > The !fast_switch is a bit of concern for me too but not for the same > > > > > reason and maybe the opposite of yours IIUC your proposal below: > > > > > > > > > > With fast_switch we have the following sequence: > > > > > > > > > > sched_switch() to task A > > > > > cpufreq_driver_fast_switch -> write new freq target > > > > > run task A > > > > > > > > > > This is pretty straight forward but we have the following sequence > > > > > with !fast_switch > > > > > > > > > > sched_switch() to task A > > > > > queue_irq_work -> raise an IPI on local CPU > > > > > Handle IPI -> wakeup and queue sugov dl worker on local CPU (always > > > > > with 1 CPU per PD) > > > > > sched_switch() to sugov dl task > > > > > __cpufreq_driver_target() which can possibly block on a lock > > > > > sched_switch() to task A > > > > > run task A > > > > > > > > > > > > > sent a bit too early > > > > > > > > > We can possibly have 2 context switch and one IPi for each "normal" > > > > > context switch which is not really optimal > > > > > > > > It would be good to find a way to skip the spurious back and forth > > > > between the normal task and sugov > > > > > > Hmm I think we use affinity to keep the sugov running on policy->related_cpus. > > > Relaxing this will make it less of a problem, but won't eliminate it. > > > > yes, but it's not a problem of relaxing affinity here > > If we have 1 CPU per PD, then relaxing affinity will allow it to run anywhere. > I am just this will be safe on all platforms of course. > > But yeah, I don't think this is a solution anyway but the simplest thing to > make it harder to hit. > > > The problem is that the 1st switch to task A will be preempted by > > sugov so the 1st switch is useless. You should call cpufreq_update > > before switching to A so that we skip the useless switch to task A and > > directly switch to sugov 1st then task A > > Can we do this safely after we pick task A, but before we do the actual context > switch? One of the reasons I put this too late is because there's a late call > to balance_calbacks() that can impact the state of the rq and important to take > into account based on my previous testing and analysis. I don't have all cases in mind and it would need more thinking but this should be doable > > Any reason we need to run the sugov worker as DL instead for example being > a softirq? sugov can sleep
On 9/2/24 14:34, Qais Yousef wrote: > On 09/02/24 13:58, Qais Yousef wrote: >> On 09/02/24 14:30, Vincent Guittot wrote: >>> On Sun, 1 Sept 2024 at 19:51, Qais Yousef <qyousef@layalina.io> wrote: >>>> >>>> On 08/13/24 10:27, Vincent Guittot wrote: >>>>> On Tue, 13 Aug 2024 at 10:25, Vincent Guittot >>>>> <vincent.guittot@linaro.org> wrote: >>>>>> >>>>>> On Mon, 5 Aug 2024 at 17:35, Christian Loehle <christian.loehle@arm.com> wrote: >>>>>>> Hi Qais, >>>>>>> the idea of SCHED_CPUFREQ_FORCE_UPDATE and the possiblity of spamming >>>>>>> freq updates still bothered me so let me share my thoughts even though >>>>>>> it might be niche enough for us not to care. >>>>>>> >>>>>>> 1. On fast_switch systems, assuming they are fine with handling the >>>>>>> actual updates, we have a bit more work on each context_switch() and >>>>>>> some synchronisation, too. That should be fine, if anything there's >>>>>>> some performance regression in a couple of niche cases. >>>>>>> >>>>>>> 2. On !fast_switch systems this gets more interesting IMO. So we have >>>>>>> a sugov DEADLINE task wakeup for every (in a freq-diff resulting) >>>>>>> update request. This task will preempt whatever and currently will >>>>>>> pretty much always be running on the CPU it ran last on (so first CPU >>>>>>> of the PD). >>>>>> >>>>>> The !fast_switch is a bit of concern for me too but not for the same >>>>>> reason and maybe the opposite of yours IIUC your proposal below: >>>>>> >>>>>> With fast_switch we have the following sequence: >>>>>> >>>>>> sched_switch() to task A >>>>>> cpufreq_driver_fast_switch -> write new freq target >>>>>> run task A >>>>>> >>>>>> This is pretty straight forward but we have the following sequence >>>>>> with !fast_switch >>>>>> >>>>>> sched_switch() to task A >>>>>> queue_irq_work -> raise an IPI on local CPU >>>>>> Handle IPI -> wakeup and queue sugov dl worker on local CPU (always >>>>>> with 1 CPU per PD) >>>>>> sched_switch() to sugov dl task >>>>>> __cpufreq_driver_target() which can possibly block on a lock >>>>>> sched_switch() to task A >>>>>> run task A >>>>>> >>>>> >>>>> sent a bit too early >>>>> >>>>>> We can possibly have 2 context switch and one IPi for each "normal" >>>>>> context switch which is not really optimal >>>>> >>>>> It would be good to find a way to skip the spurious back and forth >>>>> between the normal task and sugov >>>> >>>> Hmm I think we use affinity to keep the sugov running on policy->related_cpus. >>>> Relaxing this will make it less of a problem, but won't eliminate it. >>> >>> yes, but it's not a problem of relaxing affinity here >> >> If we have 1 CPU per PD, then relaxing affinity will allow it to run anywhere. >> I am just this will be safe on all platforms of course. >> >> But yeah, I don't think this is a solution anyway but the simplest thing to >> make it harder to hit. >> >>> The problem is that the 1st switch to task A will be preempted by >>> sugov so the 1st switch is useless. You should call cpufreq_update >>> before switching to A so that we skip the useless switch to task A and >>> directly switch to sugov 1st then task A >> >> Can we do this safely after we pick task A, but before we do the actual context >> switch? One of the reasons I put this too late is because there's a late call >> to balance_calbacks() that can impact the state of the rq and important to take >> into account based on my previous testing and analysis. >> >> Any reason we need to run the sugov worker as DL instead for example being >> a softirq? > > I assume it performs non interrupt context safe operations. But I don't think > I've ever seen it sleep during an activation. That is the distinction of fast_switch and slow_switch though, isn't it? See documentation of cpufreq_driver_fast_switch() * Carry out a fast frequency switch without sleeping.
On 09/02/24 15:36, Vincent Guittot wrote: > > If we have 1 CPU per PD, then relaxing affinity will allow it to run anywhere. > > I am just this will be safe on all platforms of course. > > > > But yeah, I don't think this is a solution anyway but the simplest thing to > > make it harder to hit. > > > > > The problem is that the 1st switch to task A will be preempted by > > > sugov so the 1st switch is useless. You should call cpufreq_update > > > before switching to A so that we skip the useless switch to task A and > > > directly switch to sugov 1st then task A > > > > Can we do this safely after we pick task A, but before we do the actual context > > switch? One of the reasons I put this too late is because there's a late call > > to balance_calbacks() that can impact the state of the rq and important to take > > into account based on my previous testing and analysis. > > I don't have all cases in mind and it would need more thinking but > this should be doable > > > > > Any reason we need to run the sugov worker as DL instead for example being > > a softirq? > > sugov can sleep Hmm. I thought the biggest worry is about this operation requires synchronous operation to talk to hw/fw to change the frequency which can be slow and we don't want this to happen from the scheduler fast path with all the critical locks held. If we sleep, then the sugov DL task will experience multiple context switches to perform its work. So it is very slow anyway. IMO refactoring the code so we allow drivers that don't sleep to run from softirq context to speed things up and avoid any context switches is the sensible optimization to do. Drivers that sleep will experience significant delays and I'm not seeing the point of optimizing an additional context switch. I don't see we need to get out of our way to accommodate these slow platforms. But give them the option to move to something better if they manage to write their code to avoid sleeps. Would this be a suitable option to move forward? FWIW I did test this on pixel 6 which !fast_switch and benchmark scores are either the same or better (less frame drops in UI particularly).
On Mon, 2 Sept 2024 at 22:43, Qais Yousef <qyousef@layalina.io> wrote: > > On 09/02/24 15:36, Vincent Guittot wrote: > > > > If we have 1 CPU per PD, then relaxing affinity will allow it to run anywhere. > > > I am just this will be safe on all platforms of course. > > > > > > But yeah, I don't think this is a solution anyway but the simplest thing to > > > make it harder to hit. > > > > > > > The problem is that the 1st switch to task A will be preempted by > > > > sugov so the 1st switch is useless. You should call cpufreq_update > > > > before switching to A so that we skip the useless switch to task A and > > > > directly switch to sugov 1st then task A > > > > > > Can we do this safely after we pick task A, but before we do the actual context > > > switch? One of the reasons I put this too late is because there's a late call > > > to balance_calbacks() that can impact the state of the rq and important to take > > > into account based on my previous testing and analysis. > > > > I don't have all cases in mind and it would need more thinking but > > this should be doable > > > > > > > > Any reason we need to run the sugov worker as DL instead for example being > > > a softirq? > > > > sugov can sleep > > Hmm. I thought the biggest worry is about this operation requires synchronous > operation to talk to hw/fw to change the frequency which can be slow and we > don't want this to happen from the scheduler fast path with all the critical > locks held. > > If we sleep, then the sugov DL task will experience multiple context switches > to perform its work. So it is very slow anyway. A good reason to not add more > > IMO refactoring the code so we allow drivers that don't sleep to run from > softirq context to speed things up and avoid any context switches is the > sensible optimization to do. AFAICT, only cpufreq fast_switch is known to be atomic, others can take a lock and as a result sleep so it's not possible. Please use fast_switch in this case but not softirq which can end up in a daemon anyway. > > Drivers that sleep will experience significant delays and I'm not seeing the > point of optimizing an additional context switch. I don't see we need to get No, They don't have spurious wakeups now, your patch is adding one more. I don't see why they should accept spurious context switch > out of our way to accommodate these slow platforms. But give them the option to > move to something better if they manage to write their code to avoid sleeps. > > Would this be a suitable option to move forward? > > FWIW I did test this on pixel 6 which !fast_switch and benchmark scores are > either the same or better (less frame drops in UI particularly).
On Sun, 1 Sept 2024 at 20:01, Qais Yousef <qyousef@layalina.io> wrote: > > On 08/13/24 12:02, Vincent Guittot wrote: > > > void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags) > > > @@ -4913,6 +4923,93 @@ static inline void __balance_callbacks(struct rq *rq) > > > > > > #endif > > > > > > +static __always_inline void > > > +__update_cpufreq_ctx_switch(struct rq *rq, struct task_struct *prev) > > > +{ > > > +#ifdef CONFIG_CPU_FREQ > > > + if (prev && prev->dl.flags & SCHED_FLAG_SUGOV) { > > > + /* Sugov just did an update, don't be too aggressive */ > > > + return; > > > + } > > > + > > > + /* > > > + * RT and DL should always send a freq update. But we can do some > > > + * simple checks to avoid it when we know it's not necessary. > > > + * > > > + * iowait_boost will always trigger a freq update too. > > > + * > > > + * Fair tasks will only trigger an update if the root cfs_rq has > > > + * decayed. > > > + * > > > + * Everything else should do nothing. > > > + */ > > > + switch (current->policy) { > > > + case SCHED_NORMAL: > > > + case SCHED_BATCH: > > > + case SCHED_IDLE: > > > + if (unlikely(current->in_iowait)) { > > > + cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT | SCHED_CPUFREQ_FORCE_UPDATE); > > > + return; > > > + } > > > + > > > +#ifdef CONFIG_SMP > > > + /* > > > + * Send an update if we switched from RT or DL as they tend to > > > + * boost the CPU and we are likely able to reduce the freq now. > > > + */ > > > + rq->cfs.decayed |= prev && (rt_policy(prev->policy) || dl_policy(prev->policy)); > > > + > > > + if (unlikely(rq->cfs.decayed)) { > > > > My previous use case of a task non preempting current with large > > util_est is fixed with this version but I'm facing a new one a bit > > similar because of waiting for the context switch and the decay to try > > to update the frequency. > > > > When the task wakes up on an idle cpu, you wait for the decay to > > update the freq but if the freq is low and the pelt has been updated > > recently (less than 1024us) you can wait a long time before the next > > decay and the freq update. This is a problem if the task's util_est is > > large because you can stay several ms at low frequency before taking > > into account task's util_est > > It is a symptom of the same problem. It seems we don't decay and we omit the > cpufreq update. > > Why this was not a problem before? AFAICT we only send an update before my > patch if we had a decay and I didn't change this condition. Were we just > getting more lucky or did I change some behavior unwittingly? I'm not able to reproduce this behavior with current code but I don't know why the behavior is different to be honest > > The problem with my patch is that I do this unconditional only if we failed > preemption check. But looks like I must enforce a cpufreq update after every > enqueue. I think the overhead of not checking rq->cfs.decayed would be high if > we always call a cpufreq update. > > I'll just set rq->cfs.decayaed in util_est_enqueue() which should address both > use cases. > > Thanks!
On 7/28/24 19:45, Qais Yousef wrote: > Improve the interaction with cpufreq governors by making the > cpufreq_update_util() calls more intentional. > > At the moment we send them when load is updated for CFS, bandwidth for > DL and at enqueue/dequeue for RT. But this can lead to too many updates > sent in a short period of time and potentially be ignored at a critical > moment due to the rate_limit_us in schedutil. > > For example, simultaneous task enqueue on the CPU where 2nd task is > bigger and requires higher freq. The trigger to cpufreq_update_util() by > the first task will lead to dropping the 2nd request until tick. Or > another CPU in the same policy triggers a freq update shortly after. > > Updates at enqueue for RT are not strictly required. Though they do help > to reduce the delay for switching the frequency and the potential > observation of lower frequency during this delay. But current logic > doesn't intentionally (at least to my understanding) try to speed up the > request. > > To help reduce the amount of cpufreq updates and make them more > purposeful, consolidate them into these locations: > > 1. context_switch() > 2. task_tick_fair() > 3. sched_balance_update_blocked_averages() > 4. on sched_setscheduler() syscall that changes policy or uclamp values > 5. on check_preempt_wakeup_fair() if wakeup preemption failed > 6. on __add_running_bw() to guarantee DL bandwidth requirements. > Actually now reading that code again reminded me, there is another iowait boost change for intel_pstate. intel_pstate has either intel_pstate_update_util() or intel_pstate_update_util_hwp(). Both have if (smp_processor_id() != cpu->cpu) return; Now since we move that update from enqueue to context_switch() that will always be false. I don't think that was deliberate but rather to simplify intel_pstate synchronization, although !mcq device IO won't be boosted which you could argue is good. Just wanted to mention that, doesn't have to be a bad, but surely some behavior change.
On Wed, Sep 11, 2024 at 10:34 PM Christian Loehle <christian.loehle@arm.com> wrote: > > On 7/28/24 19:45, Qais Yousef wrote: > > Improve the interaction with cpufreq governors by making the > > cpufreq_update_util() calls more intentional. > > > > At the moment we send them when load is updated for CFS, bandwidth for > > DL and at enqueue/dequeue for RT. But this can lead to too many updates > > sent in a short period of time and potentially be ignored at a critical > > moment due to the rate_limit_us in schedutil. > > > > For example, simultaneous task enqueue on the CPU where 2nd task is > > bigger and requires higher freq. The trigger to cpufreq_update_util() by > > the first task will lead to dropping the 2nd request until tick. Or > > another CPU in the same policy triggers a freq update shortly after. > > > > Updates at enqueue for RT are not strictly required. Though they do help > > to reduce the delay for switching the frequency and the potential > > observation of lower frequency during this delay. But current logic > > doesn't intentionally (at least to my understanding) try to speed up the > > request. > > > > To help reduce the amount of cpufreq updates and make them more > > purposeful, consolidate them into these locations: > > > > 1. context_switch() > > 2. task_tick_fair() > > 3. sched_balance_update_blocked_averages() > > 4. on sched_setscheduler() syscall that changes policy or uclamp values > > 5. on check_preempt_wakeup_fair() if wakeup preemption failed > > 6. on __add_running_bw() to guarantee DL bandwidth requirements. > > > > Actually now reading that code again reminded me, there is another > iowait boost change for intel_pstate. > intel_pstate has either intel_pstate_update_util() or > intel_pstate_update_util_hwp(). > Both have > if (smp_processor_id() != cpu->cpu) > return; > Now since we move that update from enqueue to context_switch() that will > always be false. > I don't think that was deliberate but rather to simplify intel_pstate > synchronization, although !mcq device IO won't be boosted which you > could argue is good. > Just wanted to mention that, doesn't have to be a bad, but surely some > behavior change. This particular change shouldn't be problematic.
Hi, I tested this patch to see if it causes any regressions on bare-metal power9 systems with microbenchmarks. The test system is a 2 NUMA node 128 cpu powernv power9 system. The conservative governor is enabled. I took the baseline as the 6.10.0-rc1 tip sched/core kernel. No regressions were found. +------------------------------------------------------+--------------------+----------+ | Benchmark | Baseline | Baseline | | | (6.10.0-rc1 tip | + patch | | | sched/core) | | +------------------------------------------------------+--------------------+----------+ |Hackbench run duration (sec) | 1 | 1.01 | |Lmbench simple fstat (usec) | 1 | 0.99 | |Lmbench simple open/close (usec) | 1 | 1.02 | |Lmbench simple read (usec) | 1 | 1 | |Lmbench simple stat (usec) | 1 | 1.01 | |Lmbench simple syscall (usec) | 1 | 1.01 | |Lmbench simple write (usec) | 1 | 1 | |stressng (bogo ops) | 1 | 0.94 | |Unixbench execl throughput (lps) | 1 | 0.97 | |Unixbench Pipebased Context Switching throughput (lps)| 1 | 0.94 | |Unixbench Process Creation (lps) | 1 | 1 | |Unixbench Shell Scripts (1 concurrent) (lpm) | 1 | 1 | |Unixbench Shell Scripts (8 concurrent) (lpm) | 1 | 1.01 | +------------------------------------------------------+--------------------+----------+ Thank you, Anjali K
On 10/7/24 18:20, Anjali K wrote: > Hi, I tested this patch to see if it causes any regressions on bare-metal power9 systems with microbenchmarks. > The test system is a 2 NUMA node 128 cpu powernv power9 system. The conservative governor is enabled. > I took the baseline as the 6.10.0-rc1 tip sched/core kernel. > No regressions were found. > > +------------------------------------------------------+--------------------+----------+ > | Benchmark | Baseline | Baseline | > | | (6.10.0-rc1 tip | + patch | > | | sched/core) | | > +------------------------------------------------------+--------------------+----------+ > |Hackbench run duration (sec) | 1 | 1.01 | > |Lmbench simple fstat (usec) | 1 | 0.99 | > |Lmbench simple open/close (usec) | 1 | 1.02 | > |Lmbench simple read (usec) | 1 | 1 | > |Lmbench simple stat (usec) | 1 | 1.01 | > |Lmbench simple syscall (usec) | 1 | 1.01 | > |Lmbench simple write (usec) | 1 | 1 | > |stressng (bogo ops) | 1 | 0.94 | > |Unixbench execl throughput (lps) | 1 | 0.97 | > |Unixbench Pipebased Context Switching throughput (lps)| 1 | 0.94 | > |Unixbench Process Creation (lps) | 1 | 1 | > |Unixbench Shell Scripts (1 concurrent) (lpm) | 1 | 1 | > |Unixbench Shell Scripts (8 concurrent) (lpm) | 1 | 1.01 | > +------------------------------------------------------+--------------------+----------+ > > Thank you, > Anjali K > The default CPUFREQ_DBS_MIN_SAMPLING_INTERVAL is still to have 2 ticks between cpufreq updates on conservative/ondemand. What is your sampling_rate setting? What's your HZ? Interestingly the context switch heavy benchmarks still show -6% don't they? Do you mind trying schedutil with a reasonable rate_limit_us, too? Regards, Christian
On 10/8/24 10:56, Christian Loehle wrote: > On 10/7/24 18:20, Anjali K wrote: >> Hi, I tested this patch to see if it causes any regressions on bare-metal power9 systems with microbenchmarks. >> The test system is a 2 NUMA node 128 cpu powernv power9 system. The conservative governor is enabled. >> I took the baseline as the 6.10.0-rc1 tip sched/core kernel. >> No regressions were found. >> >> +------------------------------------------------------+--------------------+----------+ >> | Benchmark | Baseline | Baseline | >> | | (6.10.0-rc1 tip | + patch | >> | | sched/core) | | >> +------------------------------------------------------+--------------------+----------+ >> |Hackbench run duration (sec) | 1 | 1.01 | >> |Lmbench simple fstat (usec) | 1 | 0.99 | >> |Lmbench simple open/close (usec) | 1 | 1.02 | >> |Lmbench simple read (usec) | 1 | 1 | >> |Lmbench simple stat (usec) | 1 | 1.01 | >> |Lmbench simple syscall (usec) | 1 | 1.01 | >> |Lmbench simple write (usec) | 1 | 1 | >> |stressng (bogo ops) | 1 | 0.94 | >> |Unixbench execl throughput (lps) | 1 | 0.97 | >> |Unixbench Pipebased Context Switching throughput (lps)| 1 | 0.94 | >> |Unixbench Process Creation (lps) | 1 | 1 | >> |Unixbench Shell Scripts (1 concurrent) (lpm) | 1 | 1 | >> |Unixbench Shell Scripts (8 concurrent) (lpm) | 1 | 1.01 | >> +------------------------------------------------------+--------------------+----------+ >> >> Thank you, >> Anjali K >> > > The default CPUFREQ_DBS_MIN_SAMPLING_INTERVAL is still to have 2 ticks between > cpufreq updates on conservative/ondemand. > What is your sampling_rate setting? What's your HZ? > Interestingly the context switch heavy benchmarks still show -6% don't they? > Do you mind trying schedutil with a reasonable rate_limit_us, too? After playing with this a bit more I can see a ~6% regression on workloads like hackbench too. That is to around 80% because of the update in check_preempt_wakeup_fair(), the rest because of the context-switch. Overall the number of cpufreq_update_util() calls for hackbench -pTl 20000 increased by a factor of 20-25x, removing the one in check_preempt_wakeup_fair() brings this down to 10x. For other workloads the amount of cpufreq_update_util() calls is in the same ballpark as mainline. I did also look into the forced_update mechanism because that still bugged me and have to say, I'd prefer removing rate_limit_us, last_freq_update_time and freq_update_delay_ns altogether. The number of updates blocked by the rate_limit was already pretty low and have become negligible now for most workloads/platforms. commit 37c6dccd6837 ("cpufreq: Remove LATENCY_MULTIPLIER") put the rate_limit_us in the microseconds but even for rate_limit_us==2000 I get on a rk3588 ([LLLL][bb][bb]), 250HZ: mainline: update_util update_util dropped by rate_limit_us actual freq changes 60s idle: 932 48 12 fio --name=test --rw=randread --bs=4k --runtime=30 --time_based --filename=/dev/nullb0 --thinktime=1ms 40274 129 36 hackbench -pTl 20000 319331 523 41 with $SUBJECT and rate_limit_us==93: 60s idle: 1031 5 11 fio --name=test --rw=randread --bs=4k --runtime=30 --time_based --filename=/dev/nullb0 --thinktime=1ms 40297 17 32 hackbench -pTl 20000 7252343 600 60 just to mention a few. This obviously depends on the OPPs, workload, and HZ though. Overall I find the update (mostly) coming from the perf-domain (and thus sugov update_lock also mostly contending there) quite appealing, but given we update more often in terms of frequency and arguably have more code locations calling the update (reintroduction of update at enqueue), what exactly are we still consolidating here? Regards, Christian
On 7/28/24 19:45, Qais Yousef wrote: > Improve the interaction with cpufreq governors by making the > cpufreq_update_util() calls more intentional. > > At the moment we send them when load is updated for CFS, bandwidth for > DL and at enqueue/dequeue for RT. But this can lead to too many updates > sent in a short period of time and potentially be ignored at a critical > moment due to the rate_limit_us in schedutil. > > For example, simultaneous task enqueue on the CPU where 2nd task is > bigger and requires higher freq. The trigger to cpufreq_update_util() by > the first task will lead to dropping the 2nd request until tick. Or > another CPU in the same policy triggers a freq update shortly after. > > Updates at enqueue for RT are not strictly required. Though they do help > to reduce the delay for switching the frequency and the potential > observation of lower frequency during this delay. But current logic > doesn't intentionally (at least to my understanding) try to speed up the > request. > > To help reduce the amount of cpufreq updates and make them more > purposeful, consolidate them into these locations: > > 1. context_switch() > 2. task_tick_fair() > 3. sched_balance_update_blocked_averages() > 4. on sched_setscheduler() syscall that changes policy or uclamp values > 5. on check_preempt_wakeup_fair() if wakeup preemption failed > 6. on __add_running_bw() to guarantee DL bandwidth requirements. > > The update at context switch should help guarantee that RT get the right > frequency straightaway when they're RUNNING. As mentioned though the > update will happen slightly after enqueue_task(); though in an ideal > world these tasks should be RUNNING ASAP and this additional delay > should be negligible. For fair tasks we need to make sure we send > a single update for every decay for the root cfs_rq. Any changes to the > rq will be deferred until the next task is ready to run, or we hit TICK. > But we are guaranteed the task is running at a level that meets its > requirements after enqueue. > > To guarantee RT and DL tasks updates are never missed, we add a new > SCHED_CPUFREQ_FORCE_UPDATE to ignore the rate_limit_us. If we are > already running at the right freq, the governor will end up doing > nothing, but we eliminate the risk of the task ending up accidentally > running at the wrong freq due to rate_limit_us. > > Similarly for iowait boost, we ignore rate limits. We also handle a case > of a boost reset prematurely by adding a guard in sugov_iowait_apply() > to reduce the boost after 1ms which seems iowait boost mechanism relied > on rate_limit_us and cfs_rq.decayed preventing any updates to happen > soon after iowait boost. > > The new SCHED_CPUFREQ_FORCE_UPDATE should not impact the rate limit > time stamps otherwise we can end up delaying updates for normal > requests. > > As a simple optimization, we avoid sending cpufreq updates when > switching from RT to another RT as RT tasks run at max freq by default. > If CONFIG_UCLAMP_TASK is enabled, we can do a simple check to see if > uclamp_min is different to avoid unnecessary cpufreq update as most RT > tasks are likely to be running at the same performance level, so we can > avoid unnecessary overhead of forced updates when there's nothing to do. > > We also ensure to ignore cpufreq udpates for sugov workers at context > switch if it was prev task. > > The update at task_tick_fair() will guarantee that the governor will > follow any updates to load for tasks/CPU or due to new enqueues/dequeues > to the rq. Since DL and RT always run at constant frequencies and have > no load tracking, this is only required for fair tasks. > > The update at update_blocked_averages() will ensure we decay frequency > as the CPU becomes idle for long enough. > > If the currently running task changes its policy or uclamp values, we > ensure we follow up with cpufreq update to ensure we follow up with any > potential new perf requirements based on the new change. > > To handle systems with long TICK where tasks could end up enqueued but > no preemption happens until TICK, we add an update in > check_preempt_wakeup_fair() if wake up preemption fails. This will send > special SCHED_CPUFREQ_TASK_ENQUEUED cpufreq update to tell the governor > that the state of the CPU has changed and it can consider an update if > it deems worthwhile. In schedutil this will do an update if no update > was done since 1ms which is how often util_avg changes roughly. > > To ensure DL tasks bandwidth are respected, we do the update on > __add_running_bw() instead of context switch as the delay could result > in missing a deadline when multiple DL tasks are RUNNING. > > Since now DL tasks always ignore rate limit, remove > ignore_dl_rate_limit() function as it's no longer necessary. > > Also move updating sg_cpu->last_update inside sugov_iowait_boost() where > this variable is associated and rename it to last_iowait_update to > better reflect it is iowait boost specific. > > Note worthy that we still have the following race condition on systems > that have shared policy: > > * CPUs with shared policy can end up sending simultaneous cpufreq > updates requests where the 2nd one will be unlucky and get blocked by > the rate_limit_us (schedutil). > > We can potentially address this limitation later, but it is out of the > scope of this patch. > > Signed-off-by: Qais Yousef <qyousef@layalina.io> > --- > > Changes since v6: > > * Fix typos in commit message > * Move DL to enqueue to address breaking bandwidth rules for DL > * Do freq updates for SCHED_IDLE too > * Ensure wakeup preemption will cause cpufreq updates even if > cfs_rq.decayed was false as util_est could be high and cfs_rq.decayed > wouldn't reflect that. > * Ensure we send an update if we switch to fair from RT or DL as this > is an opportunity to reduce freq even if cfs_rq.decayed is false. > * If sched_setsched() syscall for a queued task requires cpufreq > update, handle it like we do for wakeup_preemption_check() > * Use 1ms instead of base_slice to send an update if wakeup preemption > fails > * Fix a bug in setting sg_cpu->last_update being updated too early > causing some systems to always request 1024 io boost. > * Change delta_ns <= NSEC_PER_MSEC to be strictly less than > delta_ns < NSEC_PER_MSEC for iowait boost to match the condition for > when a task was enqueued. > * Moved the results of context switch test out of the commit messages > as I am seeing some variations that I am not sure are due to binary > differences causing weird caching effect or true overhead > > Results of > > taskset 1 perf record perf stat --repeat 10 -e cycles,instructions,task-clock perf bench sched pipe > > on AMD 3900X to verify any potential overhead because of the addition at > context switch against sched-core-2024-07-16 tip/sched/core > > tip sched-core-2024-07-16 schedutil: > ------------------------------------ > > Performance counter stats for '/home/qyousef/utils/perf bench sched pipe' (10 runs): > > 39,296,424,438 cycles # 3.208 GHz ( +- 0.05% ) > 20,350,055,343 instructions # 0.52 insn per cycle ( +- 0.03% ) > 12,274.17 msec task-clock # 1.002 CPUs utilized ( +- 0.06% ) > > 12.24917 +- 0.00783 seconds time elapsed ( +- 0.06% ) > > tip sched-core-2024-07-16 performance: > -------------------------------------- > > Performance counter stats for '/home/qyousef/utils/perf bench sched pipe' (10 runs): > > 40,610,243,585 cycles # 3.268 GHz ( +- 0.15% ) > 21,252,175,791 instructions # 0.53 insn per cycle ( +- 0.05% ) > 12,443.34 msec task-clock # 1.001 CPUs utilized ( +- 0.06% ) > > 12.42761 +- 0.00672 seconds time elapsed ( +- 0.05% ) > > patch: tip sched-core-2024-07-16 schedutil: > ------------------------------------------- > > Performance counter stats for '/home/qyousef/utils/perf bench sched pipe' (10 runs): > > 40,706,113,323 cycles # 3.253 GHz ( +- 0.07% ) > 21,163,304,319 instructions # 0.52 insn per cycle ( +- 0.04% ) > 12,494.93 msec task-clock # 0.998 CPUs utilized ( +- 0.04% ) > > 12.51557 +- 0.00486 seconds time elapsed ( +- 0.04% ) > > patch: tip sched-core-2024-07-16 performance: > --------------------------------------------- > > Performance counter stats for '/home/qyousef/utils/perf bench sched pipe' (10 runs): > > 39,654,998,545 cycles # 3.220 GHz ( +- 0.12% ) > 20,554,376,621 instructions # 0.52 insn per cycle ( +- 0.12% ) > 12,317.02 msec task-clock # 1.000 CPUs utilized ( +- 0.16% ) > > 12.3166 +- 0.0193 seconds time elapsed ( +- 0.16% ) > > We do better in performance governor than tip/sched/core. But schedutil looks > worse. Looking at perf diff I can see update_load_avg() and > sugov_update_single_freq() but not sure if this is due to this patch per se > rather than strange binary difference creating unexpected effect. The hot > instructions in update_load_avg() are not related to the new code added there. > Similarly for check_preempt_wakeup_fair(). > > For sugov_update_single_freq() this hasn't shown up in previous versions. > Removing the new cpufreq update in check_preempt_wakeup_fair() didn't help. > > Note that in v6 same test showed that schedutil was on par but performance was > slightly worse. Though the test was against 6.8.7 stable kernel then. > > perf diff schedutil: > -------------------- > > 10.56% -2.56% [kernel.kallsyms] [k] delay_halt_mwaitx > 14.56% -1.46% [kernel.kallsyms] [k] native_read_msr > 14.19% -1.40% [kernel.kallsyms] [k] native_write_msr > 0.63% +0.54% [kernel.kallsyms] [k] restore_fpregs_from_fpstate > 1.52% +0.52% [kernel.kallsyms] [k] update_load_avg > 0.01% +0.47% [kernel.kallsyms] [k] sugov_update_single_freq > 3.44% -0.35% [kernel.kallsyms] [k] amd_pmu_addr_offset > 4.67% -0.31% [kernel.kallsyms] [k] x86_pmu_disable_all > 0.35% +0.29% [kernel.kallsyms] [k] check_preempt_wakeup_fair > 1.81% -0.28% [kernel.kallsyms] [k] amd_pmu_check_overflow > 1.81% -0.27% [kernel.kallsyms] [k] amd_pmu_test_overflow_topbit > 1.20% +0.26% [kernel.kallsyms] [k] pick_next_task_fair > 0.01% +0.22% [kernel.kallsyms] [k] __get_user_8 > 1.41% +0.21% [kernel.kallsyms] [k] update_curr > 1.18% -0.21% [kernel.kallsyms] [k] delay_halt > 0.50% +0.21% [kernel.kallsyms] [k] pick_eevdf > 3.13% +0.20% [kernel.kallsyms] [k] srso_safe_ret > 0.00% +0.18% [kernel.kallsyms] [k] sugov_get_util > 1.23% +0.17% [kernel.kallsyms] [k] __schedule > 0.50% +0.16% [kernel.kallsyms] [k] enqueue_entity > 0.57% +0.16% [kernel.kallsyms] [k] psi_task_change > 0.57% +0.15% [kernel.kallsyms] [k] enqueue_task_fair > 1.06% -0.15% [kernel.kallsyms] [k] apparmor_file_permission > 0.80% +0.15% [kernel.kallsyms] [k] try_to_wake_up > 1.07% +0.14% [kernel.kallsyms] [k] psi_task_switch > 1.58% +0.14% [kernel.kallsyms] [k] pipe_write > 0.86% +0.14% [kernel.kallsyms] [k] syscall_exit_to_user_mode > 1.02% +0.13% [kernel.kallsyms] [k] native_sched_clock > 0.46% +0.11% [kernel.kallsyms] [k] __update_load_avg_se > > perf diff performance: > ---------------------- > > 13.09% +3.06% [kernel.kallsyms] [k] native_read_msr > 13.12% +2.84% [kernel.kallsyms] [k] native_write_msr > 7.94% +2.34% [kernel.kallsyms] [k] delay_halt_mwaitx > 2.15% -0.93% [kernel.kallsyms] [k] update_curr > 4.42% +0.87% [kernel.kallsyms] [k] x86_pmu_disable_all > 3.12% +0.74% [kernel.kallsyms] [k] amd_pmu_addr_offset > 2.84% -0.59% [kernel.kallsyms] [k] psi_group_change > 1.44% +0.53% [kernel.kallsyms] [k] amd_pmu_check_overflow > 1.45% +0.50% [kernel.kallsyms] [k] amd_pmu_test_overflow_topbit > 0.47% -0.47% [kernel.kallsyms] [k] __calc_delta.constprop.0 > 1.60% -0.40% [kernel.kallsyms] [k] pick_next_task_fair > 1.97% -0.37% [kernel.kallsyms] [k] update_load_avg > 0.57% -0.37% [kernel.kallsyms] [k] avg_vruntime > 0.82% -0.37% [kernel.kallsyms] [k] enqueue_task_fair > 1.54% -0.34% [kernel.kallsyms] [k] __schedule > 0.79% -0.32% [kernel.kallsyms] [k] pick_eevdf > 0.88% +0.32% [kernel.kallsyms] [k] delay_halt > 0.59% -0.28% [kernel.kallsyms] [k] update_cfs_group > 0.86% -0.25% [kernel.kallsyms] [k] try_to_wake_up > 1.18% -0.25% [kernel.kallsyms] [k] native_sched_clock > 0.45% -0.24% [kernel.kallsyms] [k] put_prev_entity > 0.49% -0.24% [kernel.kallsyms] [k] ttwu_do_activate > 0.64% -0.23% [kernel.kallsyms] [k] enqueue_entity > 0.72% -0.22% [kernel.kallsyms] [k] __update_load_avg_cfs_rq > 1.57% -0.22% [kernel.kallsyms] [k] pipe_write > 0.50% -0.20% [kernel.kallsyms] [k] update_min_vruntime > 3.31% -0.19% [kernel.kallsyms] [k] srso_safe_ret > 1.31% -0.18% [kernel.kallsyms] [k] psi_task_switch > 0.52% -0.18% [kernel.kallsyms] [k] check_preempt_wakeup_fair > 0.32% -0.16% [kernel.kallsyms] [k] __enqueue_entity > 0.87% -0.16% [kernel.kallsyms] [k] dequeue_task_fair > 0.44% -0.14% [kernel.kallsyms] [k] pick_next_entity > 0.63% -0.13% [kernel.kallsyms] [k] psi_task_change > 0.62% -0.13% [kernel.kallsyms] [k] sched_clock_cpu > > Changes since v5: > > * Fix a bug where switching between RT and sugov tasks triggered an > endless cycle of cpufreq updates. > * Only do cpufreq updates at tick for fair after verifying > rq->cfs.decayed > * Remove optimization in update_load_avg() to avoid sending an update > if util hasn't changed that caused a bug when switching from Idle > * Handle systems with long ticks by adding extra update on > check_preempt_wakeup_fair(). The idea is to rely on context switch > but still consider an update if wakeup preemption failed and no > update was sent since sysctl_sched_base_slice > * Remove ignore_dl_rate_limit() as this function is now redundant > * move sg_cpu->last_update = time inside sugov_iowait_boost() > * Update commit message with new details and with perf diff output > > Changes since v4: > > * Fix updating freq when uclamp changes before the dequeue/enqueue > dance. (Hongyan) > * Rebased on top of tip/sched/core 6.10-rc1 and resolve some conflicts > due to code shuffling to syscalls.c. Added new function > update_cpufreq_current() to be used outside core.c when > task_current() requires cpufreq update. > > Changes since v3: > > * Omit cpufreq updates at attach/detach_entity_load_avg(). They share > the update path from enqueue/dequeue which is not intended to trigger > an update. And task_change_group_fair() is not expected to cause the > root cfs_rq util to change significantly to warrant an immediate > update for enqueued tasks. Better defer for next context switch to > sample the state of the cpu taking all changes into account before > the next task is due to run. > Dietmar also pointed out a bug where we could send more updates vs > without the patch in this path as I wasn't sending the update for > cfs_rq == &rq->cfs. > > Changes since v2: > > * Clean up update_cpufreq_ctx_switch() to reduce branches (Peter) > * Fix issue with cpufreq updates missed on switching from idle (Vincent) > * perf bench sched pipe regressed after fixing the switch from idle, > detect when util_avg has changed when cfs_rq->decayed to fix it > * Ensure to issue cpufreq updates when task_current() switches > policy/uclamp values > > Changes since v1: > > * Use taskset and measure with performance governor as Ingo suggested > * Remove the static key as I found out we always register a function > for cpu_dbs in cpufreq_governor.c; and as Christian pointed out it > trigger a lock debug warning. > * Improve detection of sugov workers by using SCHED_FLAG_SUGOV > * Guard against NSEC_PER_MSEC instead of TICK_USEC to avoid prematurely > reducing iowait boost as the latter was a NOP and like > sugov_iowait_reset() like Christian pointed out. > > v1 discussion: https://lore.kernel.org/all/20240324020139.1032473-1-qyousef@layalina.io/ > v2 discussion: https://lore.kernel.org/lkml/20240505233103.168766-1-qyousef@layalina.io/ > v3 discussion: https://lore.kernel.org/lkml/20240512190018.531820-1-qyousef@layalina.io/ > v4 discussion: https://lore.kernel.org/lkml/20240516204802.846520-1-qyousef@layalina.io/ > v5 discussion: https://lore.kernel.org/lkml/20240530104653.1234004-1-qyousef@layalina.io/ > v6 discussion: https://lore.kernel.org/lkml/20240619201409.2071728-1-qyousef@layalina.io/ > > include/linux/sched/cpufreq.h | 4 +- > kernel/sched/core.c | 116 +++++++++++++++++++++++++++-- > kernel/sched/cpufreq_schedutil.c | 122 +++++++++++++++++++------------ > kernel/sched/deadline.c | 10 ++- > kernel/sched/fair.c | 91 +++++++++++------------ > kernel/sched/rt.c | 8 +- > kernel/sched/sched.h | 9 ++- > kernel/sched/syscalls.c | 30 ++++++-- > 8 files changed, 271 insertions(+), 119 deletions(-) > > diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h > index bdd31ab93bc5..5409a9f79cc0 100644 > --- a/include/linux/sched/cpufreq.h > +++ b/include/linux/sched/cpufreq.h > @@ -8,7 +8,9 @@ > * Interface between cpufreq drivers and the scheduler: > */ > > -#define SCHED_CPUFREQ_IOWAIT (1U << 0) > +#define SCHED_CPUFREQ_IOWAIT (1U << 0) > +#define SCHED_CPUFREQ_FORCE_UPDATE (1U << 1) /* ignore transition_delay_us */ > +#define SCHED_CPUFREQ_TASK_ENQUEUED (1U << 2) /* new fair task was enqueued */ > > #ifdef CONFIG_CPU_FREQ > struct cpufreq_policy; > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 6d35c48239be..a31d91a224d0 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -153,6 +153,9 @@ const_debug unsigned int sysctl_sched_nr_migrate = SCHED_NR_MIGRATE_BREAK; > > __read_mostly int scheduler_running; > > +static __always_inline void > +__update_cpufreq_ctx_switch(struct rq *rq, struct task_struct *prev); > + > #ifdef CONFIG_SCHED_CORE > > DEFINE_STATIC_KEY_FALSE(__sched_core_enabled); > @@ -2038,17 +2041,24 @@ inline int task_curr(const struct task_struct *p) > * this means any call to check_class_changed() must be followed by a call to > * balance_callback(). > */ > -void check_class_changed(struct rq *rq, struct task_struct *p, > +bool check_class_changed(struct rq *rq, struct task_struct *p, > const struct sched_class *prev_class, > int oldprio) > { > + bool class_changed = false; > + > if (prev_class != p->sched_class) { > if (prev_class->switched_from) > prev_class->switched_from(rq, p); > > p->sched_class->switched_to(rq, p); > - } else if (oldprio != p->prio || dl_task(p)) > + > + class_changed = true; > + } else if (oldprio != p->prio || dl_task(p)) { > p->sched_class->prio_changed(rq, p, oldprio); > + } > + > + return class_changed; > } > > void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags) > @@ -4913,6 +4923,93 @@ static inline void __balance_callbacks(struct rq *rq) > > #endif > > +static __always_inline void > +__update_cpufreq_ctx_switch(struct rq *rq, struct task_struct *prev) > +{ > +#ifdef CONFIG_CPU_FREQ > + if (prev && prev->dl.flags & SCHED_FLAG_SUGOV) { > + /* Sugov just did an update, don't be too aggressive */ > + return; > + } > + > + /* > + * RT and DL should always send a freq update. But we can do some > + * simple checks to avoid it when we know it's not necessary. > + * > + * iowait_boost will always trigger a freq update too. > + * > + * Fair tasks will only trigger an update if the root cfs_rq has > + * decayed. > + * > + * Everything else should do nothing. > + */ > + switch (current->policy) { > + case SCHED_NORMAL: > + case SCHED_BATCH: > + case SCHED_IDLE: > + if (unlikely(current->in_iowait)) { > + cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT | SCHED_CPUFREQ_FORCE_UPDATE); > + return; > + } > + > +#ifdef CONFIG_SMP > + /* > + * Send an update if we switched from RT or DL as they tend to > + * boost the CPU and we are likely able to reduce the freq now. > + */ > + rq->cfs.decayed |= prev && (rt_policy(prev->policy) || dl_policy(prev->policy)); > + > + if (unlikely(rq->cfs.decayed)) { > + rq->cfs.decayed = false; > + cpufreq_update_util(rq, 0); > + return; > + } > +#else > + cpufreq_update_util(rq, 0); > +#endif > + return; > + case SCHED_FIFO: > + case SCHED_RR: > + if (prev && rt_policy(prev->policy)) { > +#ifdef CONFIG_UCLAMP_TASK > + unsigned long curr_uclamp_min = uclamp_eff_value(current, UCLAMP_MIN); > + unsigned long prev_uclamp_min = uclamp_eff_value(prev, UCLAMP_MIN); > + > + if (curr_uclamp_min == prev_uclamp_min) > +#endif > + return; > + } > +#ifdef CONFIG_SMP > + /* Stopper task masquerades as RT */ > + if (unlikely(current->sched_class == &stop_sched_class)) > + return; > +#endif > + cpufreq_update_util(rq, SCHED_CPUFREQ_FORCE_UPDATE); > + return; > + case SCHED_DEADLINE: > + /* > + * This is handled at enqueue to avoid breaking DL bandwidth > + * rules when multiple DL tasks are running on the same CPU. > + * Deferring till context switch here could mean the bandwidth > + * calculations would be broken to ensure all the DL tasks meet > + * their deadlines. > + */ > + return; > + default: > + return; > + } > +#endif > +} > + > +/* > + * Call when currently running task had an attribute change that requires > + * an immediate cpufreq update. > + */ > +void update_cpufreq_current(struct rq *rq) > +{ > + __update_cpufreq_ctx_switch(rq, NULL); > +} > + > static inline void > prepare_lock_switch(struct rq *rq, struct task_struct *next, struct rq_flags *rf) > { > @@ -4930,7 +5027,7 @@ prepare_lock_switch(struct rq *rq, struct task_struct *next, struct rq_flags *rf > #endif > } > > -static inline void finish_lock_switch(struct rq *rq) > +static inline void finish_lock_switch(struct rq *rq, struct task_struct *prev) > { > /* > * If we are tracking spinlock dependencies then we have to > @@ -4939,6 +5036,11 @@ static inline void finish_lock_switch(struct rq *rq) > */ > spin_acquire(&__rq_lockp(rq)->dep_map, 0, 0, _THIS_IP_); > __balance_callbacks(rq); > + /* > + * Request freq update after __balance_callbacks to take into account > + * any changes to rq. > + */ > + __update_cpufreq_ctx_switch(rq, prev); > raw_spin_rq_unlock_irq(rq); > } > > @@ -5057,7 +5159,7 @@ static struct rq *finish_task_switch(struct task_struct *prev) > perf_event_task_sched_in(prev, current); > finish_task(prev); > tick_nohz_task_switch(); > - finish_lock_switch(rq); > + finish_lock_switch(rq, prev); > finish_arch_post_lock_switch(); > kcov_finish_switch(current); > /* > @@ -6920,6 +7022,7 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task) > int prio, oldprio, queued, running, queue_flag = > DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK; > const struct sched_class *prev_class; > + bool class_changed; > struct rq_flags rf; > struct rq *rq; > > @@ -7021,7 +7124,10 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task) > if (running) > set_next_task(rq, p); > > - check_class_changed(rq, p, prev_class, oldprio); > + class_changed = check_class_changed(rq, p, prev_class, oldprio); > + if (class_changed && running) > + update_cpufreq_current(rq); > + > out_unlock: > /* Avoid rq from going away on us: */ > preempt_disable(); > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c > index eece6244f9d2..64f614b3db20 100644 > --- a/kernel/sched/cpufreq_schedutil.c > +++ b/kernel/sched/cpufreq_schedutil.c > @@ -44,7 +44,7 @@ struct sugov_cpu { > > bool iowait_boost_pending; > unsigned int iowait_boost; > - u64 last_update; > + u64 last_iowait_update; > > unsigned long util; > unsigned long bw_min; > @@ -59,10 +59,31 @@ static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu); > > /************************ Governor internals ***********************/ > > -static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time) > +static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time, > + unsigned int flags) > { > s64 delta_ns; > > + delta_ns = time - sg_policy->last_freq_update_time; > + > + /* > + * We want to update cpufreq at context switch, but on systems with > + * long TICK values, this can happen after a long time while more tasks > + * would have been added meanwhile leaving us potentially running at > + * inadequate frequency for extended period of time. > + * > + * This logic should only apply when new fair task was added to the > + * CPU, we'd want to defer to context switch as much as possible, but > + * to avoid the potential delays mentioned above, let's check if this > + * additional tasks warrants sending an update sooner. > + * > + * We want to ensure there's at least an update every 1ms. > + */ > + if (likely(flags & SCHED_CPUFREQ_TASK_ENQUEUED)) { > + if (delta_ns < NSEC_PER_MSEC) > + return false; > + } > + > /* > * Since cpufreq_update_util() is called with rq->lock held for > * the @target_cpu, our per-CPU data is fully serialized. > @@ -87,13 +108,14 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time) > return true; > } > > - delta_ns = time - sg_policy->last_freq_update_time; > + if (unlikely(flags & SCHED_CPUFREQ_FORCE_UPDATE)) > + return true; > > return delta_ns >= sg_policy->freq_update_delay_ns; > } > > static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time, > - unsigned int next_freq) > + unsigned int next_freq, unsigned int flags) > { > if (sg_policy->need_freq_update) > sg_policy->need_freq_update = cpufreq_driver_test_flags(CPUFREQ_NEED_UPDATE_LIMITS); > @@ -101,7 +123,9 @@ static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time, > return false; > > sg_policy->next_freq = next_freq; > - sg_policy->last_freq_update_time = time; > + > + if (!unlikely(flags & SCHED_CPUFREQ_FORCE_UPDATE)) > + sg_policy->last_freq_update_time = time; > > return true; > } > @@ -219,7 +243,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost) > static bool sugov_iowait_reset(struct sugov_cpu *sg_cpu, u64 time, > bool set_iowait_boost) > { > - s64 delta_ns = time - sg_cpu->last_update; > + s64 delta_ns = time - sg_cpu->last_iowait_update; > > /* Reset boost only if a tick has elapsed since last request */ > if (delta_ns <= TICK_NSEC) > @@ -249,30 +273,33 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time, > unsigned int flags) > { > bool set_iowait_boost = flags & SCHED_CPUFREQ_IOWAIT; > + bool forced_update = flags & SCHED_CPUFREQ_FORCE_UPDATE; > > /* Reset boost if the CPU appears to have been idle enough */ > - if (sg_cpu->iowait_boost && > + if (sg_cpu->iowait_boost && !forced_update && > sugov_iowait_reset(sg_cpu, time, set_iowait_boost)) > - return; > + goto done; > > /* Boost only tasks waking up after IO */ > if (!set_iowait_boost) > - return; > + goto done; > > /* Ensure boost doubles only one time at each request */ > if (sg_cpu->iowait_boost_pending) > - return; > + goto done; > sg_cpu->iowait_boost_pending = true; > > /* Double the boost at each request */ > if (sg_cpu->iowait_boost) { > sg_cpu->iowait_boost = > min_t(unsigned int, sg_cpu->iowait_boost << 1, SCHED_CAPACITY_SCALE); > - return; > + goto done; > } > > /* First wakeup after IO: start with minimum boost */ > sg_cpu->iowait_boost = IOWAIT_BOOST_MIN; > +done: > + sg_cpu->last_iowait_update = time; > } > > /** > @@ -294,17 +321,34 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time, > * being more conservative on tasks which does sporadic IO operations. > */ > static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time, > - unsigned long max_cap) > + unsigned long max_cap, unsigned int flags) > { > + bool forced_update = flags & SCHED_CPUFREQ_FORCE_UPDATE; > + s64 delta_ns = time - sg_cpu->last_iowait_update; > + > /* No boost currently required */ > if (!sg_cpu->iowait_boost) > return 0; > > + if (forced_update) > + goto apply_boost; > + > /* Reset boost if the CPU appears to have been idle enough */ > if (sugov_iowait_reset(sg_cpu, time, false)) > return 0; > > if (!sg_cpu->iowait_boost_pending) { > + /* > + * This logic relied on PELT signal decays happening once every > + * 1ms. But due to changes to how updates are done now, we can > + * end up with more request coming up leading to iowait boost > + * to be prematurely reduced. Make the assumption explicit > + * until we improve the iowait boost logic to be better in > + * general as it is due for an overhaul. > + */ > + if (delta_ns < NSEC_PER_MSEC) > + goto apply_boost; > + > /* > * No boost pending; reduce the boost value. > */ > @@ -315,6 +359,7 @@ static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time, > } > } > > +apply_boost: > sg_cpu->iowait_boost_pending = false; > > /* > @@ -337,31 +382,18 @@ static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) > static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; } > #endif /* CONFIG_NO_HZ_COMMON */ > > -/* > - * Make sugov_should_update_freq() ignore the rate limit when DL > - * has increased the utilization. > - */ > -static inline void ignore_dl_rate_limit(struct sugov_cpu *sg_cpu) > -{ > - if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_min) > - sg_cpu->sg_policy->limits_changed = true; > -} > - > static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu, > u64 time, unsigned long max_cap, > unsigned int flags) > { > unsigned long boost; > > - sugov_iowait_boost(sg_cpu, time, flags); > - sg_cpu->last_update = time; > - > - ignore_dl_rate_limit(sg_cpu); > - > - if (!sugov_should_update_freq(sg_cpu->sg_policy, time)) > + if (!sugov_should_update_freq(sg_cpu->sg_policy, time, flags)) > return false; > > - boost = sugov_iowait_apply(sg_cpu, time, max_cap); > + sugov_iowait_boost(sg_cpu, time, flags); > + > + boost = sugov_iowait_apply(sg_cpu, time, max_cap, flags); > sugov_get_util(sg_cpu, boost); > > return true; > @@ -397,7 +429,7 @@ static void sugov_update_single_freq(struct update_util_data *hook, u64 time, > sg_policy->cached_raw_freq = cached_freq; > } > > - if (!sugov_update_next_freq(sg_policy, time, next_f)) > + if (!sugov_update_next_freq(sg_policy, time, next_f, flags)) > return; > > /* > @@ -449,10 +481,12 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time, > cpufreq_driver_adjust_perf(sg_cpu->cpu, sg_cpu->bw_min, > sg_cpu->util, max_cap); > > - sg_cpu->sg_policy->last_freq_update_time = time; > + if (!unlikely(flags & SCHED_CPUFREQ_FORCE_UPDATE)) > + sg_cpu->sg_policy->last_freq_update_time = time; > } > > -static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time) > +static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time, > + unsigned int flags) > { > struct sugov_policy *sg_policy = sg_cpu->sg_policy; > struct cpufreq_policy *policy = sg_policy->policy; > @@ -465,7 +499,7 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time) > struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j); > unsigned long boost; > > - boost = sugov_iowait_apply(j_sg_cpu, time, max_cap); > + boost = sugov_iowait_apply(j_sg_cpu, time, max_cap, flags); > sugov_get_util(j_sg_cpu, boost); > > util = max(j_sg_cpu->util, util); > @@ -483,22 +517,20 @@ sugov_update_shared(struct update_util_data *hook, u64 time, unsigned int flags) > > raw_spin_lock(&sg_policy->update_lock); > > - sugov_iowait_boost(sg_cpu, time, flags); > - sg_cpu->last_update = time; > + if (!sugov_should_update_freq(sg_policy, time, flags)) > + goto unlock; > > - ignore_dl_rate_limit(sg_cpu); > + sugov_iowait_boost(sg_cpu, time, flags); > > - if (sugov_should_update_freq(sg_policy, time)) { > - next_f = sugov_next_freq_shared(sg_cpu, time); > + next_f = sugov_next_freq_shared(sg_cpu, time, flags); > > - if (!sugov_update_next_freq(sg_policy, time, next_f)) > - goto unlock; > + if (!sugov_update_next_freq(sg_policy, time, next_f, flags)) > + goto unlock; > > - if (sg_policy->policy->fast_switch_enabled) > - cpufreq_driver_fast_switch(sg_policy->policy, next_f); > - else > - sugov_deferred_update(sg_policy); > - } > + if (sg_policy->policy->fast_switch_enabled) > + cpufreq_driver_fast_switch(sg_policy->policy, next_f); > + else > + sugov_deferred_update(sg_policy); > unlock: > raw_spin_unlock(&sg_policy->update_lock); > } > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c > index f59e5c19d944..8a4ccf532a7b 100644 > --- a/kernel/sched/deadline.c > +++ b/kernel/sched/deadline.c > @@ -251,8 +251,12 @@ void __add_running_bw(u64 dl_bw, struct dl_rq *dl_rq) > dl_rq->running_bw += dl_bw; > SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */ > SCHED_WARN_ON(dl_rq->running_bw > dl_rq->this_bw); > - /* kick cpufreq (see the comment in kernel/sched/sched.h). */ > - cpufreq_update_util(rq_of_dl_rq(dl_rq), 0); > + /* > + * Context switch handles updates, but this is an exception to ensure > + * multiple DL tasks run at the correct frequencies. We don't need > + * a cpufreq update on dequeue, context switch will handle that. > + */ > + cpufreq_update_util(rq_of_dl_rq(dl_rq), SCHED_CPUFREQ_FORCE_UPDATE); > } > > static inline > @@ -265,8 +269,6 @@ void __sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq) > SCHED_WARN_ON(dl_rq->running_bw > old); /* underflow */ > if (dl_rq->running_bw > old) > dl_rq->running_bw = 0; > - /* kick cpufreq (see the comment in kernel/sched/sched.h). */ > - cpufreq_update_util(rq_of_dl_rq(dl_rq), 0); > } > > static inline > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 9057584ec06d..8fe7a7124c70 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -3987,29 +3987,6 @@ static inline void update_cfs_group(struct sched_entity *se) > } > #endif /* CONFIG_FAIR_GROUP_SCHED */ > > -static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq, int flags) > -{ > - struct rq *rq = rq_of(cfs_rq); > - > - if (&rq->cfs == cfs_rq) { > - /* > - * There are a few boundary cases this might miss but it should > - * get called often enough that that should (hopefully) not be > - * a real problem. > - * > - * It will not get called when we go idle, because the idle > - * thread is a different class (!fair), nor will the utilization > - * number include things like RT tasks. > - * > - * As is, the util number is not freq-invariant (we'd have to > - * implement arch_scale_freq_capacity() for that). > - * > - * See cpu_util_cfs(). > - */ > - cpufreq_update_util(rq, flags); > - } > -} > - > #ifdef CONFIG_SMP > static inline bool load_avg_is_decayed(struct sched_avg *sa) > { > @@ -4687,8 +4664,6 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > > add_tg_cfs_propagate(cfs_rq, se->avg.load_sum); > > - cfs_rq_util_change(cfs_rq, 0); > - > trace_pelt_cfs_tp(cfs_rq); > } > > @@ -4717,8 +4692,6 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > > add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum); > > - cfs_rq_util_change(cfs_rq, 0); > - > trace_pelt_cfs_tp(cfs_rq); > } > > @@ -4765,12 +4738,16 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s > */ > detach_entity_load_avg(cfs_rq, se); > update_tg_load_avg(cfs_rq); > - } else if (decayed) { > - cfs_rq_util_change(cfs_rq, 0); > - > - if (flags & UPDATE_TG) > - update_tg_load_avg(cfs_rq); > + } else if (decayed && (flags & UPDATE_TG)) { > + update_tg_load_avg(cfs_rq); > } > + > + /* > + * If this is the root cfs_rq, set the decayed flag to let the world > + * know a cpufreq update is required. > + */ > + if (cfs_rq == &rq_of(cfs_rq)->cfs) > + cfs_rq->decayed |= decayed; > } > > /* > @@ -5144,7 +5121,6 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) > > static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int not_used1) > { > - cfs_rq_util_change(cfs_rq, 0); > } > > static inline void remove_entity_load_avg(struct sched_entity *se) {} > @@ -6759,14 +6735,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) > */ > util_est_enqueue(&rq->cfs, p); > > - /* > - * If in_iowait is set, the code below may not trigger any cpufreq > - * utilization updates, so do it here explicitly with the IOWAIT flag > - * passed. > - */ > - if (p->in_iowait) > - cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT); > - > for_each_sched_entity(se) { > if (se->on_rq) > break; > @@ -8353,7 +8321,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int > int cse_is_idle, pse_is_idle; > > if (unlikely(se == pse)) > - return; > + goto nopreempt; > > /* > * This is possible from callers such as attach_tasks(), in which we > @@ -8362,7 +8330,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int > * next-buddy nomination below. > */ > if (unlikely(throttled_hierarchy(cfs_rq_of(pse)))) > - return; > + goto nopreempt; > > if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK)) { > set_next_buddy(pse); > @@ -8379,7 +8347,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int > * below. > */ > if (test_tsk_need_resched(curr)) > - return; > + goto nopreempt; > > /* Idle tasks are by definition preempted by non-idle tasks. */ > if (unlikely(task_has_idle_policy(curr)) && > @@ -8391,7 +8359,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int > * is driven by the tick): > */ > if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION)) > - return; > + goto nopreempt; > > find_matching_se(&se, &pse); > WARN_ON_ONCE(!pse); > @@ -8406,7 +8374,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int > if (cse_is_idle && !pse_is_idle) > goto preempt; > if (cse_is_idle != pse_is_idle) > - return; > + goto nopreempt; > > cfs_rq = cfs_rq_of(se); > update_curr(cfs_rq); > @@ -8417,6 +8385,24 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int > if (pick_eevdf(cfs_rq) == pse) > goto preempt; > > +nopreempt: > + if (rq->cfs.h_nr_running > 1) { > +#ifdef CONFIG_SMP > + /* > + * When a task is added, its util_est could be high but the > + * enqueue might not have caused rq->cfs.decayed to be updated > + * as it is small after a long sleep. So set it to ensure next > + * context switch will definitely trigger an update after the > + * new enqueue. > + * > + * TODO: we need to make cpufreq_update_util() return true if > + * the operation was successful or false if it failed and use > + * that to reset rq->cfs.decayed. > + */ > + rq->cfs.decayed = true; > +#endif > + cpufreq_update_util(rq, SCHED_CPUFREQ_TASK_ENQUEUED); > + } > return; > > preempt: > @@ -9352,10 +9338,6 @@ static bool __update_blocked_others(struct rq *rq, bool *done) > unsigned long hw_pressure; > bool decayed; > > - /* > - * update_load_avg() can call cpufreq_update_util(). Make sure that RT, > - * DL and IRQ signals have been updated before updating CFS. > - */ > curr_class = rq->curr->sched_class; > > hw_pressure = arch_scale_hw_pressure(cpu_of(rq)); > @@ -12692,6 +12674,15 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) > update_misfit_status(curr, rq); > check_update_overutilized_status(task_rq(curr)); > > +#ifdef CONFIG_SMP > + if (rq->cfs.decayed) { > + rq->cfs.decayed = false; > + cpufreq_update_util(rq, 0); > + } > +#else > + cpufreq_update_util(rq, 0); > +#endif > + > task_tick_core(rq, curr); > } > > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c > index 63e49c8ffc4d..92ed373e5b90 100644 > --- a/kernel/sched/rt.c > +++ b/kernel/sched/rt.c > @@ -555,11 +555,8 @@ static void sched_rt_rq_dequeue(struct rt_rq *rt_rq) > > rt_se = rt_rq->tg->rt_se[cpu]; > > - if (!rt_se) { > + if (!rt_se) > dequeue_top_rt_rq(rt_rq, rt_rq->rt_nr_running); > - /* Kick cpufreq (see the comment in kernel/sched/sched.h). */ > - cpufreq_update_util(rq_of_rt_rq(rt_rq), 0); > - } > else if (on_rt_rq(rt_se)) > dequeue_rt_entity(rt_se, 0); > } > @@ -1064,9 +1061,6 @@ enqueue_top_rt_rq(struct rt_rq *rt_rq) > add_nr_running(rq, rt_rq->rt_nr_running); > rt_rq->rt_queued = 1; > } > - > - /* Kick cpufreq (see the comment in kernel/sched/sched.h). */ > - cpufreq_update_util(rq, 0); > } > > #if defined CONFIG_SMP > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 4c36cc680361..1fc9339dd5c7 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -639,6 +639,11 @@ struct cfs_rq { > unsigned long runnable_avg; > } removed; > > + /* > + * Store whether last update_load_avg() has decayed > + */ > + bool decayed; > + > #ifdef CONFIG_FAIR_GROUP_SCHED > u64 last_update_tg_load_avg; > unsigned long tg_load_avg_contrib; > @@ -3609,10 +3614,12 @@ extern void set_load_weight(struct task_struct *p, bool update_load); > extern void enqueue_task(struct rq *rq, struct task_struct *p, int flags); > extern void dequeue_task(struct rq *rq, struct task_struct *p, int flags); > > -extern void check_class_changed(struct rq *rq, struct task_struct *p, > +extern bool check_class_changed(struct rq *rq, struct task_struct *p, > const struct sched_class *prev_class, > int oldprio); > > +extern void update_cpufreq_current(struct rq *rq); > + > #ifdef CONFIG_SMP > extern struct balance_callback *splice_balance_callbacks(struct rq *rq); > extern void balance_callbacks(struct rq *rq, struct balance_callback *head); > diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c > index ae1b42775ef9..c24769cf1a4f 100644 > --- a/kernel/sched/syscalls.c > +++ b/kernel/sched/syscalls.c > @@ -491,7 +491,7 @@ static bool uclamp_reset(const struct sched_attr *attr, > return false; > } > > -static void __setscheduler_uclamp(struct task_struct *p, > +static bool __setscheduler_uclamp(struct task_struct *p, > const struct sched_attr *attr) > { > enum uclamp_id clamp_id; > @@ -517,7 +517,7 @@ static void __setscheduler_uclamp(struct task_struct *p, > } > > if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP))) > - return; > + return false; > > if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN && > attr->sched_util_min != -1) { > @@ -530,6 +530,8 @@ static void __setscheduler_uclamp(struct task_struct *p, > uclamp_se_set(&p->uclamp_req[UCLAMP_MAX], > attr->sched_util_max, true); > } > + > + return true; > } > > #else /* !CONFIG_UCLAMP_TASK: */ > @@ -539,8 +541,11 @@ static inline int uclamp_validate(struct task_struct *p, > { > return -EOPNOTSUPP; > } > -static void __setscheduler_uclamp(struct task_struct *p, > - const struct sched_attr *attr) { } > +static bool __setscheduler_uclamp(struct task_struct *p, > + const struct sched_attr *attr) > +{ > + return false; > +} > #endif > > /* > @@ -614,6 +619,7 @@ int __sched_setscheduler(struct task_struct *p, > int retval, oldprio, newprio, queued, running; > const struct sched_class *prev_class; > struct balance_callback *head; > + bool update_cpufreq; > struct rq_flags rf; > int reset_on_fork; > int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK; > @@ -796,7 +802,8 @@ int __sched_setscheduler(struct task_struct *p, > __setscheduler_params(p, attr); > __setscheduler_prio(p, newprio); > } > - __setscheduler_uclamp(p, attr); > + > + update_cpufreq = __setscheduler_uclamp(p, attr); > > if (queued) { > /* > @@ -811,7 +818,18 @@ int __sched_setscheduler(struct task_struct *p, > if (running) > set_next_task(rq, p); > > - check_class_changed(rq, p, prev_class, oldprio); > + update_cpufreq |= check_class_changed(rq, p, prev_class, oldprio); > + > + /* > + * Changing class or uclamp value implies requiring to send cpufreq > + * update. > + */ > + if (update_cpufreq) { > + if (running) > + update_cpufreq_current(rq); > + else if (queued) > + cpufreq_update_util(rq, SCHED_CPUFREQ_TASK_ENQUEUED); > + } cpufreq_update_util() -> sugov_should_update_freq() -> cpufreq_this_cpu_can_update() relies on smp_processor_id(), should this move this below the preempt_disable() to avoid sending an update from an illegal CPU? > > /* Avoid rq from going away on us: */ > preempt_disable();
On 08/10/24 15:26, Christian Loehle wrote: > The default CPUFREQ_DBS_MIN_SAMPLING_INTERVAL is still to have 2 ticks between > cpufreq updates on conservative/ondemand. > What is your sampling_rate setting? What's your HZ? The sampling_rate setting is 8000 us. CONFIG_HZ is set to 250 Hz. > Interestingly the context switch heavy benchmarks still show -6% don't they? Yes, stress-ng and Unixbench Pipebased Context Switching benchmarks showed 6% regression. There was a high run-to-run variation in stress-ng and the Unixbench Pipebased Context Switching benchmarks of 15% and 5% respectively. This led me to doubt those results and so I re-ran these two benchmarks. Each run below is an average of 10 iterations of the benchmarks. The results are as follows: +------------------------------------------------------+--------------------+----------+--------+---------+----------------+ | Benchmark | Baseline | Baseline + |Baseline|Baseline | Throughput | | | (6.10.0-rc1 tip | patch | |+ patch |Difference %| | | sched/core) | |stdev % | stdev % | | | | avg throughput |avg throughput| | | | +------------------------------------------------------+--------------------+--------------+--------+---------+------------+ |Unixbench Pipebased Context Switching throughput (lps)| 1 | 1.02 | 6.48 | 10.29 | 2.18 | | | 1 | 1.19 | 13.74 | 8.22 | 19.20 | | | 1 | 0.87 | 11.27 | 8.12 | -13.24 | | | | | | | | |stressng (bogo ops) | 1 | 1.01 | 2.68 | 1.90 | 1.35 | | | 1 | 0.98 | 2.29 | 4.26 | -2.03 | | | 1 | 0.99 | 2.01 | 2.24 | -0.56 | +------------------------------------------------------+--------------------+--------------+--------+---------+------------+ There is a very high run-to-run variation in the Unixbench Pipebased Context Switching benchmark and we can't conclude anything from this benchmark. There is no regression in stress-ng on applying this patch on this system. > Do you mind trying schedutil with a reasonable rate_limit_us, too? I think the schedutil governor is not working on my system because the cpu frequency shoots to the maximum (3.9GHz) even when the system is only 10% loaded. I ran stress-ng --cpu `nproc` --cpu-load 10. The mpstat command shows that the system is 10% loaded: 10:55:25 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 10:56:50 AM all 10.03 0.00 0.02 0.00 0.18 0.00 0.00 0.00 0.00 89.76 But cpupower frequency-info showed that the system is at max frequency root@ltczz10:~# cpupower frequency-info <snipped> available cpufreq governors: conservative ondemand performance schedutil current policy: frequency should be within 2.30 GHz and 3.90 GHz. The governor "schedutil" may decide which speed to use within this range. current CPU frequency: 3.90 GHz (asserted by call to hardware) <snipped> This is not expected, right? I will work on finding out why the schedutil governor is not working on this system and get back. Thank you for your response, Anjali K
diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h index bdd31ab93bc5..5409a9f79cc0 100644 --- a/include/linux/sched/cpufreq.h +++ b/include/linux/sched/cpufreq.h @@ -8,7 +8,9 @@ * Interface between cpufreq drivers and the scheduler: */ -#define SCHED_CPUFREQ_IOWAIT (1U << 0) +#define SCHED_CPUFREQ_IOWAIT (1U << 0) +#define SCHED_CPUFREQ_FORCE_UPDATE (1U << 1) /* ignore transition_delay_us */ +#define SCHED_CPUFREQ_TASK_ENQUEUED (1U << 2) /* new fair task was enqueued */ #ifdef CONFIG_CPU_FREQ struct cpufreq_policy; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 6d35c48239be..a31d91a224d0 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -153,6 +153,9 @@ const_debug unsigned int sysctl_sched_nr_migrate = SCHED_NR_MIGRATE_BREAK; __read_mostly int scheduler_running; +static __always_inline void +__update_cpufreq_ctx_switch(struct rq *rq, struct task_struct *prev); + #ifdef CONFIG_SCHED_CORE DEFINE_STATIC_KEY_FALSE(__sched_core_enabled); @@ -2038,17 +2041,24 @@ inline int task_curr(const struct task_struct *p) * this means any call to check_class_changed() must be followed by a call to * balance_callback(). */ -void check_class_changed(struct rq *rq, struct task_struct *p, +bool check_class_changed(struct rq *rq, struct task_struct *p, const struct sched_class *prev_class, int oldprio) { + bool class_changed = false; + if (prev_class != p->sched_class) { if (prev_class->switched_from) prev_class->switched_from(rq, p); p->sched_class->switched_to(rq, p); - } else if (oldprio != p->prio || dl_task(p)) + + class_changed = true; + } else if (oldprio != p->prio || dl_task(p)) { p->sched_class->prio_changed(rq, p, oldprio); + } + + return class_changed; } void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags) @@ -4913,6 +4923,93 @@ static inline void __balance_callbacks(struct rq *rq) #endif +static __always_inline void +__update_cpufreq_ctx_switch(struct rq *rq, struct task_struct *prev) +{ +#ifdef CONFIG_CPU_FREQ + if (prev && prev->dl.flags & SCHED_FLAG_SUGOV) { + /* Sugov just did an update, don't be too aggressive */ + return; + } + + /* + * RT and DL should always send a freq update. But we can do some + * simple checks to avoid it when we know it's not necessary. + * + * iowait_boost will always trigger a freq update too. + * + * Fair tasks will only trigger an update if the root cfs_rq has + * decayed. + * + * Everything else should do nothing. + */ + switch (current->policy) { + case SCHED_NORMAL: + case SCHED_BATCH: + case SCHED_IDLE: + if (unlikely(current->in_iowait)) { + cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT | SCHED_CPUFREQ_FORCE_UPDATE); + return; + } + +#ifdef CONFIG_SMP + /* + * Send an update if we switched from RT or DL as they tend to + * boost the CPU and we are likely able to reduce the freq now. + */ + rq->cfs.decayed |= prev && (rt_policy(prev->policy) || dl_policy(prev->policy)); + + if (unlikely(rq->cfs.decayed)) { + rq->cfs.decayed = false; + cpufreq_update_util(rq, 0); + return; + } +#else + cpufreq_update_util(rq, 0); +#endif + return; + case SCHED_FIFO: + case SCHED_RR: + if (prev && rt_policy(prev->policy)) { +#ifdef CONFIG_UCLAMP_TASK + unsigned long curr_uclamp_min = uclamp_eff_value(current, UCLAMP_MIN); + unsigned long prev_uclamp_min = uclamp_eff_value(prev, UCLAMP_MIN); + + if (curr_uclamp_min == prev_uclamp_min) +#endif + return; + } +#ifdef CONFIG_SMP + /* Stopper task masquerades as RT */ + if (unlikely(current->sched_class == &stop_sched_class)) + return; +#endif + cpufreq_update_util(rq, SCHED_CPUFREQ_FORCE_UPDATE); + return; + case SCHED_DEADLINE: + /* + * This is handled at enqueue to avoid breaking DL bandwidth + * rules when multiple DL tasks are running on the same CPU. + * Deferring till context switch here could mean the bandwidth + * calculations would be broken to ensure all the DL tasks meet + * their deadlines. + */ + return; + default: + return; + } +#endif +} + +/* + * Call when currently running task had an attribute change that requires + * an immediate cpufreq update. + */ +void update_cpufreq_current(struct rq *rq) +{ + __update_cpufreq_ctx_switch(rq, NULL); +} + static inline void prepare_lock_switch(struct rq *rq, struct task_struct *next, struct rq_flags *rf) { @@ -4930,7 +5027,7 @@ prepare_lock_switch(struct rq *rq, struct task_struct *next, struct rq_flags *rf #endif } -static inline void finish_lock_switch(struct rq *rq) +static inline void finish_lock_switch(struct rq *rq, struct task_struct *prev) { /* * If we are tracking spinlock dependencies then we have to @@ -4939,6 +5036,11 @@ static inline void finish_lock_switch(struct rq *rq) */ spin_acquire(&__rq_lockp(rq)->dep_map, 0, 0, _THIS_IP_); __balance_callbacks(rq); + /* + * Request freq update after __balance_callbacks to take into account + * any changes to rq. + */ + __update_cpufreq_ctx_switch(rq, prev); raw_spin_rq_unlock_irq(rq); } @@ -5057,7 +5159,7 @@ static struct rq *finish_task_switch(struct task_struct *prev) perf_event_task_sched_in(prev, current); finish_task(prev); tick_nohz_task_switch(); - finish_lock_switch(rq); + finish_lock_switch(rq, prev); finish_arch_post_lock_switch(); kcov_finish_switch(current); /* @@ -6920,6 +7022,7 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task) int prio, oldprio, queued, running, queue_flag = DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK; const struct sched_class *prev_class; + bool class_changed; struct rq_flags rf; struct rq *rq; @@ -7021,7 +7124,10 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task) if (running) set_next_task(rq, p); - check_class_changed(rq, p, prev_class, oldprio); + class_changed = check_class_changed(rq, p, prev_class, oldprio); + if (class_changed && running) + update_cpufreq_current(rq); + out_unlock: /* Avoid rq from going away on us: */ preempt_disable(); diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index eece6244f9d2..64f614b3db20 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -44,7 +44,7 @@ struct sugov_cpu { bool iowait_boost_pending; unsigned int iowait_boost; - u64 last_update; + u64 last_iowait_update; unsigned long util; unsigned long bw_min; @@ -59,10 +59,31 @@ static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu); /************************ Governor internals ***********************/ -static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time) +static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time, + unsigned int flags) { s64 delta_ns; + delta_ns = time - sg_policy->last_freq_update_time; + + /* + * We want to update cpufreq at context switch, but on systems with + * long TICK values, this can happen after a long time while more tasks + * would have been added meanwhile leaving us potentially running at + * inadequate frequency for extended period of time. + * + * This logic should only apply when new fair task was added to the + * CPU, we'd want to defer to context switch as much as possible, but + * to avoid the potential delays mentioned above, let's check if this + * additional tasks warrants sending an update sooner. + * + * We want to ensure there's at least an update every 1ms. + */ + if (likely(flags & SCHED_CPUFREQ_TASK_ENQUEUED)) { + if (delta_ns < NSEC_PER_MSEC) + return false; + } + /* * Since cpufreq_update_util() is called with rq->lock held for * the @target_cpu, our per-CPU data is fully serialized. @@ -87,13 +108,14 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time) return true; } - delta_ns = time - sg_policy->last_freq_update_time; + if (unlikely(flags & SCHED_CPUFREQ_FORCE_UPDATE)) + return true; return delta_ns >= sg_policy->freq_update_delay_ns; } static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time, - unsigned int next_freq) + unsigned int next_freq, unsigned int flags) { if (sg_policy->need_freq_update) sg_policy->need_freq_update = cpufreq_driver_test_flags(CPUFREQ_NEED_UPDATE_LIMITS); @@ -101,7 +123,9 @@ static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time, return false; sg_policy->next_freq = next_freq; - sg_policy->last_freq_update_time = time; + + if (!unlikely(flags & SCHED_CPUFREQ_FORCE_UPDATE)) + sg_policy->last_freq_update_time = time; return true; } @@ -219,7 +243,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost) static bool sugov_iowait_reset(struct sugov_cpu *sg_cpu, u64 time, bool set_iowait_boost) { - s64 delta_ns = time - sg_cpu->last_update; + s64 delta_ns = time - sg_cpu->last_iowait_update; /* Reset boost only if a tick has elapsed since last request */ if (delta_ns <= TICK_NSEC) @@ -249,30 +273,33 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time, unsigned int flags) { bool set_iowait_boost = flags & SCHED_CPUFREQ_IOWAIT; + bool forced_update = flags & SCHED_CPUFREQ_FORCE_UPDATE; /* Reset boost if the CPU appears to have been idle enough */ - if (sg_cpu->iowait_boost && + if (sg_cpu->iowait_boost && !forced_update && sugov_iowait_reset(sg_cpu, time, set_iowait_boost)) - return; + goto done; /* Boost only tasks waking up after IO */ if (!set_iowait_boost) - return; + goto done; /* Ensure boost doubles only one time at each request */ if (sg_cpu->iowait_boost_pending) - return; + goto done; sg_cpu->iowait_boost_pending = true; /* Double the boost at each request */ if (sg_cpu->iowait_boost) { sg_cpu->iowait_boost = min_t(unsigned int, sg_cpu->iowait_boost << 1, SCHED_CAPACITY_SCALE); - return; + goto done; } /* First wakeup after IO: start with minimum boost */ sg_cpu->iowait_boost = IOWAIT_BOOST_MIN; +done: + sg_cpu->last_iowait_update = time; } /** @@ -294,17 +321,34 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time, * being more conservative on tasks which does sporadic IO operations. */ static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time, - unsigned long max_cap) + unsigned long max_cap, unsigned int flags) { + bool forced_update = flags & SCHED_CPUFREQ_FORCE_UPDATE; + s64 delta_ns = time - sg_cpu->last_iowait_update; + /* No boost currently required */ if (!sg_cpu->iowait_boost) return 0; + if (forced_update) + goto apply_boost; + /* Reset boost if the CPU appears to have been idle enough */ if (sugov_iowait_reset(sg_cpu, time, false)) return 0; if (!sg_cpu->iowait_boost_pending) { + /* + * This logic relied on PELT signal decays happening once every + * 1ms. But due to changes to how updates are done now, we can + * end up with more request coming up leading to iowait boost + * to be prematurely reduced. Make the assumption explicit + * until we improve the iowait boost logic to be better in + * general as it is due for an overhaul. + */ + if (delta_ns < NSEC_PER_MSEC) + goto apply_boost; + /* * No boost pending; reduce the boost value. */ @@ -315,6 +359,7 @@ static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time, } } +apply_boost: sg_cpu->iowait_boost_pending = false; /* @@ -337,31 +382,18 @@ static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; } #endif /* CONFIG_NO_HZ_COMMON */ -/* - * Make sugov_should_update_freq() ignore the rate limit when DL - * has increased the utilization. - */ -static inline void ignore_dl_rate_limit(struct sugov_cpu *sg_cpu) -{ - if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_min) - sg_cpu->sg_policy->limits_changed = true; -} - static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu, u64 time, unsigned long max_cap, unsigned int flags) { unsigned long boost; - sugov_iowait_boost(sg_cpu, time, flags); - sg_cpu->last_update = time; - - ignore_dl_rate_limit(sg_cpu); - - if (!sugov_should_update_freq(sg_cpu->sg_policy, time)) + if (!sugov_should_update_freq(sg_cpu->sg_policy, time, flags)) return false; - boost = sugov_iowait_apply(sg_cpu, time, max_cap); + sugov_iowait_boost(sg_cpu, time, flags); + + boost = sugov_iowait_apply(sg_cpu, time, max_cap, flags); sugov_get_util(sg_cpu, boost); return true; @@ -397,7 +429,7 @@ static void sugov_update_single_freq(struct update_util_data *hook, u64 time, sg_policy->cached_raw_freq = cached_freq; } - if (!sugov_update_next_freq(sg_policy, time, next_f)) + if (!sugov_update_next_freq(sg_policy, time, next_f, flags)) return; /* @@ -449,10 +481,12 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time, cpufreq_driver_adjust_perf(sg_cpu->cpu, sg_cpu->bw_min, sg_cpu->util, max_cap); - sg_cpu->sg_policy->last_freq_update_time = time; + if (!unlikely(flags & SCHED_CPUFREQ_FORCE_UPDATE)) + sg_cpu->sg_policy->last_freq_update_time = time; } -static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time) +static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time, + unsigned int flags) { struct sugov_policy *sg_policy = sg_cpu->sg_policy; struct cpufreq_policy *policy = sg_policy->policy; @@ -465,7 +499,7 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time) struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j); unsigned long boost; - boost = sugov_iowait_apply(j_sg_cpu, time, max_cap); + boost = sugov_iowait_apply(j_sg_cpu, time, max_cap, flags); sugov_get_util(j_sg_cpu, boost); util = max(j_sg_cpu->util, util); @@ -483,22 +517,20 @@ sugov_update_shared(struct update_util_data *hook, u64 time, unsigned int flags) raw_spin_lock(&sg_policy->update_lock); - sugov_iowait_boost(sg_cpu, time, flags); - sg_cpu->last_update = time; + if (!sugov_should_update_freq(sg_policy, time, flags)) + goto unlock; - ignore_dl_rate_limit(sg_cpu); + sugov_iowait_boost(sg_cpu, time, flags); - if (sugov_should_update_freq(sg_policy, time)) { - next_f = sugov_next_freq_shared(sg_cpu, time); + next_f = sugov_next_freq_shared(sg_cpu, time, flags); - if (!sugov_update_next_freq(sg_policy, time, next_f)) - goto unlock; + if (!sugov_update_next_freq(sg_policy, time, next_f, flags)) + goto unlock; - if (sg_policy->policy->fast_switch_enabled) - cpufreq_driver_fast_switch(sg_policy->policy, next_f); - else - sugov_deferred_update(sg_policy); - } + if (sg_policy->policy->fast_switch_enabled) + cpufreq_driver_fast_switch(sg_policy->policy, next_f); + else + sugov_deferred_update(sg_policy); unlock: raw_spin_unlock(&sg_policy->update_lock); } diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index f59e5c19d944..8a4ccf532a7b 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -251,8 +251,12 @@ void __add_running_bw(u64 dl_bw, struct dl_rq *dl_rq) dl_rq->running_bw += dl_bw; SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */ SCHED_WARN_ON(dl_rq->running_bw > dl_rq->this_bw); - /* kick cpufreq (see the comment in kernel/sched/sched.h). */ - cpufreq_update_util(rq_of_dl_rq(dl_rq), 0); + /* + * Context switch handles updates, but this is an exception to ensure + * multiple DL tasks run at the correct frequencies. We don't need + * a cpufreq update on dequeue, context switch will handle that. + */ + cpufreq_update_util(rq_of_dl_rq(dl_rq), SCHED_CPUFREQ_FORCE_UPDATE); } static inline @@ -265,8 +269,6 @@ void __sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq) SCHED_WARN_ON(dl_rq->running_bw > old); /* underflow */ if (dl_rq->running_bw > old) dl_rq->running_bw = 0; - /* kick cpufreq (see the comment in kernel/sched/sched.h). */ - cpufreq_update_util(rq_of_dl_rq(dl_rq), 0); } static inline diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9057584ec06d..8fe7a7124c70 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3987,29 +3987,6 @@ static inline void update_cfs_group(struct sched_entity *se) } #endif /* CONFIG_FAIR_GROUP_SCHED */ -static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq, int flags) -{ - struct rq *rq = rq_of(cfs_rq); - - if (&rq->cfs == cfs_rq) { - /* - * There are a few boundary cases this might miss but it should - * get called often enough that that should (hopefully) not be - * a real problem. - * - * It will not get called when we go idle, because the idle - * thread is a different class (!fair), nor will the utilization - * number include things like RT tasks. - * - * As is, the util number is not freq-invariant (we'd have to - * implement arch_scale_freq_capacity() for that). - * - * See cpu_util_cfs(). - */ - cpufreq_update_util(rq, flags); - } -} - #ifdef CONFIG_SMP static inline bool load_avg_is_decayed(struct sched_avg *sa) { @@ -4687,8 +4664,6 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s add_tg_cfs_propagate(cfs_rq, se->avg.load_sum); - cfs_rq_util_change(cfs_rq, 0); - trace_pelt_cfs_tp(cfs_rq); } @@ -4717,8 +4692,6 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum); - cfs_rq_util_change(cfs_rq, 0); - trace_pelt_cfs_tp(cfs_rq); } @@ -4765,12 +4738,16 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s */ detach_entity_load_avg(cfs_rq, se); update_tg_load_avg(cfs_rq); - } else if (decayed) { - cfs_rq_util_change(cfs_rq, 0); - - if (flags & UPDATE_TG) - update_tg_load_avg(cfs_rq); + } else if (decayed && (flags & UPDATE_TG)) { + update_tg_load_avg(cfs_rq); } + + /* + * If this is the root cfs_rq, set the decayed flag to let the world + * know a cpufreq update is required. + */ + if (cfs_rq == &rq_of(cfs_rq)->cfs) + cfs_rq->decayed |= decayed; } /* @@ -5144,7 +5121,6 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int not_used1) { - cfs_rq_util_change(cfs_rq, 0); } static inline void remove_entity_load_avg(struct sched_entity *se) {} @@ -6759,14 +6735,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) */ util_est_enqueue(&rq->cfs, p); - /* - * If in_iowait is set, the code below may not trigger any cpufreq - * utilization updates, so do it here explicitly with the IOWAIT flag - * passed. - */ - if (p->in_iowait) - cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT); - for_each_sched_entity(se) { if (se->on_rq) break; @@ -8353,7 +8321,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int int cse_is_idle, pse_is_idle; if (unlikely(se == pse)) - return; + goto nopreempt; /* * This is possible from callers such as attach_tasks(), in which we @@ -8362,7 +8330,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int * next-buddy nomination below. */ if (unlikely(throttled_hierarchy(cfs_rq_of(pse)))) - return; + goto nopreempt; if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK)) { set_next_buddy(pse); @@ -8379,7 +8347,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int * below. */ if (test_tsk_need_resched(curr)) - return; + goto nopreempt; /* Idle tasks are by definition preempted by non-idle tasks. */ if (unlikely(task_has_idle_policy(curr)) && @@ -8391,7 +8359,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int * is driven by the tick): */ if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION)) - return; + goto nopreempt; find_matching_se(&se, &pse); WARN_ON_ONCE(!pse); @@ -8406,7 +8374,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int if (cse_is_idle && !pse_is_idle) goto preempt; if (cse_is_idle != pse_is_idle) - return; + goto nopreempt; cfs_rq = cfs_rq_of(se); update_curr(cfs_rq); @@ -8417,6 +8385,24 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int if (pick_eevdf(cfs_rq) == pse) goto preempt; +nopreempt: + if (rq->cfs.h_nr_running > 1) { +#ifdef CONFIG_SMP + /* + * When a task is added, its util_est could be high but the + * enqueue might not have caused rq->cfs.decayed to be updated + * as it is small after a long sleep. So set it to ensure next + * context switch will definitely trigger an update after the + * new enqueue. + * + * TODO: we need to make cpufreq_update_util() return true if + * the operation was successful or false if it failed and use + * that to reset rq->cfs.decayed. + */ + rq->cfs.decayed = true; +#endif + cpufreq_update_util(rq, SCHED_CPUFREQ_TASK_ENQUEUED); + } return; preempt: @@ -9352,10 +9338,6 @@ static bool __update_blocked_others(struct rq *rq, bool *done) unsigned long hw_pressure; bool decayed; - /* - * update_load_avg() can call cpufreq_update_util(). Make sure that RT, - * DL and IRQ signals have been updated before updating CFS. - */ curr_class = rq->curr->sched_class; hw_pressure = arch_scale_hw_pressure(cpu_of(rq)); @@ -12692,6 +12674,15 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) update_misfit_status(curr, rq); check_update_overutilized_status(task_rq(curr)); +#ifdef CONFIG_SMP + if (rq->cfs.decayed) { + rq->cfs.decayed = false; + cpufreq_update_util(rq, 0); + } +#else + cpufreq_update_util(rq, 0); +#endif + task_tick_core(rq, curr); } diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 63e49c8ffc4d..92ed373e5b90 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -555,11 +555,8 @@ static void sched_rt_rq_dequeue(struct rt_rq *rt_rq) rt_se = rt_rq->tg->rt_se[cpu]; - if (!rt_se) { + if (!rt_se) dequeue_top_rt_rq(rt_rq, rt_rq->rt_nr_running); - /* Kick cpufreq (see the comment in kernel/sched/sched.h). */ - cpufreq_update_util(rq_of_rt_rq(rt_rq), 0); - } else if (on_rt_rq(rt_se)) dequeue_rt_entity(rt_se, 0); } @@ -1064,9 +1061,6 @@ enqueue_top_rt_rq(struct rt_rq *rt_rq) add_nr_running(rq, rt_rq->rt_nr_running); rt_rq->rt_queued = 1; } - - /* Kick cpufreq (see the comment in kernel/sched/sched.h). */ - cpufreq_update_util(rq, 0); } #if defined CONFIG_SMP diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 4c36cc680361..1fc9339dd5c7 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -639,6 +639,11 @@ struct cfs_rq { unsigned long runnable_avg; } removed; + /* + * Store whether last update_load_avg() has decayed + */ + bool decayed; + #ifdef CONFIG_FAIR_GROUP_SCHED u64 last_update_tg_load_avg; unsigned long tg_load_avg_contrib; @@ -3609,10 +3614,12 @@ extern void set_load_weight(struct task_struct *p, bool update_load); extern void enqueue_task(struct rq *rq, struct task_struct *p, int flags); extern void dequeue_task(struct rq *rq, struct task_struct *p, int flags); -extern void check_class_changed(struct rq *rq, struct task_struct *p, +extern bool check_class_changed(struct rq *rq, struct task_struct *p, const struct sched_class *prev_class, int oldprio); +extern void update_cpufreq_current(struct rq *rq); + #ifdef CONFIG_SMP extern struct balance_callback *splice_balance_callbacks(struct rq *rq); extern void balance_callbacks(struct rq *rq, struct balance_callback *head); diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c index ae1b42775ef9..c24769cf1a4f 100644 --- a/kernel/sched/syscalls.c +++ b/kernel/sched/syscalls.c @@ -491,7 +491,7 @@ static bool uclamp_reset(const struct sched_attr *attr, return false; } -static void __setscheduler_uclamp(struct task_struct *p, +static bool __setscheduler_uclamp(struct task_struct *p, const struct sched_attr *attr) { enum uclamp_id clamp_id; @@ -517,7 +517,7 @@ static void __setscheduler_uclamp(struct task_struct *p, } if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP))) - return; + return false; if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN && attr->sched_util_min != -1) { @@ -530,6 +530,8 @@ static void __setscheduler_uclamp(struct task_struct *p, uclamp_se_set(&p->uclamp_req[UCLAMP_MAX], attr->sched_util_max, true); } + + return true; } #else /* !CONFIG_UCLAMP_TASK: */ @@ -539,8 +541,11 @@ static inline int uclamp_validate(struct task_struct *p, { return -EOPNOTSUPP; } -static void __setscheduler_uclamp(struct task_struct *p, - const struct sched_attr *attr) { } +static bool __setscheduler_uclamp(struct task_struct *p, + const struct sched_attr *attr) +{ + return false; +} #endif /* @@ -614,6 +619,7 @@ int __sched_setscheduler(struct task_struct *p, int retval, oldprio, newprio, queued, running; const struct sched_class *prev_class; struct balance_callback *head; + bool update_cpufreq; struct rq_flags rf; int reset_on_fork; int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK; @@ -796,7 +802,8 @@ int __sched_setscheduler(struct task_struct *p, __setscheduler_params(p, attr); __setscheduler_prio(p, newprio); } - __setscheduler_uclamp(p, attr); + + update_cpufreq = __setscheduler_uclamp(p, attr); if (queued) { /* @@ -811,7 +818,18 @@ int __sched_setscheduler(struct task_struct *p, if (running) set_next_task(rq, p); - check_class_changed(rq, p, prev_class, oldprio); + update_cpufreq |= check_class_changed(rq, p, prev_class, oldprio); + + /* + * Changing class or uclamp value implies requiring to send cpufreq + * update. + */ + if (update_cpufreq) { + if (running) + update_cpufreq_current(rq); + else if (queued) + cpufreq_update_util(rq, SCHED_CPUFREQ_TASK_ENQUEUED); + } /* Avoid rq from going away on us: */ preempt_disable();
Improve the interaction with cpufreq governors by making the cpufreq_update_util() calls more intentional. At the moment we send them when load is updated for CFS, bandwidth for DL and at enqueue/dequeue for RT. But this can lead to too many updates sent in a short period of time and potentially be ignored at a critical moment due to the rate_limit_us in schedutil. For example, simultaneous task enqueue on the CPU where 2nd task is bigger and requires higher freq. The trigger to cpufreq_update_util() by the first task will lead to dropping the 2nd request until tick. Or another CPU in the same policy triggers a freq update shortly after. Updates at enqueue for RT are not strictly required. Though they do help to reduce the delay for switching the frequency and the potential observation of lower frequency during this delay. But current logic doesn't intentionally (at least to my understanding) try to speed up the request. To help reduce the amount of cpufreq updates and make them more purposeful, consolidate them into these locations: 1. context_switch() 2. task_tick_fair() 3. sched_balance_update_blocked_averages() 4. on sched_setscheduler() syscall that changes policy or uclamp values 5. on check_preempt_wakeup_fair() if wakeup preemption failed 6. on __add_running_bw() to guarantee DL bandwidth requirements. The update at context switch should help guarantee that RT get the right frequency straightaway when they're RUNNING. As mentioned though the update will happen slightly after enqueue_task(); though in an ideal world these tasks should be RUNNING ASAP and this additional delay should be negligible. For fair tasks we need to make sure we send a single update for every decay for the root cfs_rq. Any changes to the rq will be deferred until the next task is ready to run, or we hit TICK. But we are guaranteed the task is running at a level that meets its requirements after enqueue. To guarantee RT and DL tasks updates are never missed, we add a new SCHED_CPUFREQ_FORCE_UPDATE to ignore the rate_limit_us. If we are already running at the right freq, the governor will end up doing nothing, but we eliminate the risk of the task ending up accidentally running at the wrong freq due to rate_limit_us. Similarly for iowait boost, we ignore rate limits. We also handle a case of a boost reset prematurely by adding a guard in sugov_iowait_apply() to reduce the boost after 1ms which seems iowait boost mechanism relied on rate_limit_us and cfs_rq.decayed preventing any updates to happen soon after iowait boost. The new SCHED_CPUFREQ_FORCE_UPDATE should not impact the rate limit time stamps otherwise we can end up delaying updates for normal requests. As a simple optimization, we avoid sending cpufreq updates when switching from RT to another RT as RT tasks run at max freq by default. If CONFIG_UCLAMP_TASK is enabled, we can do a simple check to see if uclamp_min is different to avoid unnecessary cpufreq update as most RT tasks are likely to be running at the same performance level, so we can avoid unnecessary overhead of forced updates when there's nothing to do. We also ensure to ignore cpufreq udpates for sugov workers at context switch if it was prev task. The update at task_tick_fair() will guarantee that the governor will follow any updates to load for tasks/CPU or due to new enqueues/dequeues to the rq. Since DL and RT always run at constant frequencies and have no load tracking, this is only required for fair tasks. The update at update_blocked_averages() will ensure we decay frequency as the CPU becomes idle for long enough. If the currently running task changes its policy or uclamp values, we ensure we follow up with cpufreq update to ensure we follow up with any potential new perf requirements based on the new change. To handle systems with long TICK where tasks could end up enqueued but no preemption happens until TICK, we add an update in check_preempt_wakeup_fair() if wake up preemption fails. This will send special SCHED_CPUFREQ_TASK_ENQUEUED cpufreq update to tell the governor that the state of the CPU has changed and it can consider an update if it deems worthwhile. In schedutil this will do an update if no update was done since 1ms which is how often util_avg changes roughly. To ensure DL tasks bandwidth are respected, we do the update on __add_running_bw() instead of context switch as the delay could result in missing a deadline when multiple DL tasks are RUNNING. Since now DL tasks always ignore rate limit, remove ignore_dl_rate_limit() function as it's no longer necessary. Also move updating sg_cpu->last_update inside sugov_iowait_boost() where this variable is associated and rename it to last_iowait_update to better reflect it is iowait boost specific. Note worthy that we still have the following race condition on systems that have shared policy: * CPUs with shared policy can end up sending simultaneous cpufreq updates requests where the 2nd one will be unlucky and get blocked by the rate_limit_us (schedutil). We can potentially address this limitation later, but it is out of the scope of this patch. Signed-off-by: Qais Yousef <qyousef@layalina.io> --- Changes since v6: * Fix typos in commit message * Move DL to enqueue to address breaking bandwidth rules for DL * Do freq updates for SCHED_IDLE too * Ensure wakeup preemption will cause cpufreq updates even if cfs_rq.decayed was false as util_est could be high and cfs_rq.decayed wouldn't reflect that. * Ensure we send an update if we switch to fair from RT or DL as this is an opportunity to reduce freq even if cfs_rq.decayed is false. * If sched_setsched() syscall for a queued task requires cpufreq update, handle it like we do for wakeup_preemption_check() * Use 1ms instead of base_slice to send an update if wakeup preemption fails * Fix a bug in setting sg_cpu->last_update being updated too early causing some systems to always request 1024 io boost. * Change delta_ns <= NSEC_PER_MSEC to be strictly less than delta_ns < NSEC_PER_MSEC for iowait boost to match the condition for when a task was enqueued. * Moved the results of context switch test out of the commit messages as I am seeing some variations that I am not sure are due to binary differences causing weird caching effect or true overhead Results of taskset 1 perf record perf stat --repeat 10 -e cycles,instructions,task-clock perf bench sched pipe on AMD 3900X to verify any potential overhead because of the addition at context switch against sched-core-2024-07-16 tip/sched/core tip sched-core-2024-07-16 schedutil: ------------------------------------ Performance counter stats for '/home/qyousef/utils/perf bench sched pipe' (10 runs): 39,296,424,438 cycles # 3.208 GHz ( +- 0.05% ) 20,350,055,343 instructions # 0.52 insn per cycle ( +- 0.03% ) 12,274.17 msec task-clock # 1.002 CPUs utilized ( +- 0.06% ) 12.24917 +- 0.00783 seconds time elapsed ( +- 0.06% ) tip sched-core-2024-07-16 performance: -------------------------------------- Performance counter stats for '/home/qyousef/utils/perf bench sched pipe' (10 runs): 40,610,243,585 cycles # 3.268 GHz ( +- 0.15% ) 21,252,175,791 instructions # 0.53 insn per cycle ( +- 0.05% ) 12,443.34 msec task-clock # 1.001 CPUs utilized ( +- 0.06% ) 12.42761 +- 0.00672 seconds time elapsed ( +- 0.05% ) patch: tip sched-core-2024-07-16 schedutil: ------------------------------------------- Performance counter stats for '/home/qyousef/utils/perf bench sched pipe' (10 runs): 40,706,113,323 cycles # 3.253 GHz ( +- 0.07% ) 21,163,304,319 instructions # 0.52 insn per cycle ( +- 0.04% ) 12,494.93 msec task-clock # 0.998 CPUs utilized ( +- 0.04% ) 12.51557 +- 0.00486 seconds time elapsed ( +- 0.04% ) patch: tip sched-core-2024-07-16 performance: --------------------------------------------- Performance counter stats for '/home/qyousef/utils/perf bench sched pipe' (10 runs): 39,654,998,545 cycles # 3.220 GHz ( +- 0.12% ) 20,554,376,621 instructions # 0.52 insn per cycle ( +- 0.12% ) 12,317.02 msec task-clock # 1.000 CPUs utilized ( +- 0.16% ) 12.3166 +- 0.0193 seconds time elapsed ( +- 0.16% ) We do better in performance governor than tip/sched/core. But schedutil looks worse. Looking at perf diff I can see update_load_avg() and sugov_update_single_freq() but not sure if this is due to this patch per se rather than strange binary difference creating unexpected effect. The hot instructions in update_load_avg() are not related to the new code added there. Similarly for check_preempt_wakeup_fair(). For sugov_update_single_freq() this hasn't shown up in previous versions. Removing the new cpufreq update in check_preempt_wakeup_fair() didn't help. Note that in v6 same test showed that schedutil was on par but performance was slightly worse. Though the test was against 6.8.7 stable kernel then. perf diff schedutil: -------------------- 10.56% -2.56% [kernel.kallsyms] [k] delay_halt_mwaitx 14.56% -1.46% [kernel.kallsyms] [k] native_read_msr 14.19% -1.40% [kernel.kallsyms] [k] native_write_msr 0.63% +0.54% [kernel.kallsyms] [k] restore_fpregs_from_fpstate 1.52% +0.52% [kernel.kallsyms] [k] update_load_avg 0.01% +0.47% [kernel.kallsyms] [k] sugov_update_single_freq 3.44% -0.35% [kernel.kallsyms] [k] amd_pmu_addr_offset 4.67% -0.31% [kernel.kallsyms] [k] x86_pmu_disable_all 0.35% +0.29% [kernel.kallsyms] [k] check_preempt_wakeup_fair 1.81% -0.28% [kernel.kallsyms] [k] amd_pmu_check_overflow 1.81% -0.27% [kernel.kallsyms] [k] amd_pmu_test_overflow_topbit 1.20% +0.26% [kernel.kallsyms] [k] pick_next_task_fair 0.01% +0.22% [kernel.kallsyms] [k] __get_user_8 1.41% +0.21% [kernel.kallsyms] [k] update_curr 1.18% -0.21% [kernel.kallsyms] [k] delay_halt 0.50% +0.21% [kernel.kallsyms] [k] pick_eevdf 3.13% +0.20% [kernel.kallsyms] [k] srso_safe_ret 0.00% +0.18% [kernel.kallsyms] [k] sugov_get_util 1.23% +0.17% [kernel.kallsyms] [k] __schedule 0.50% +0.16% [kernel.kallsyms] [k] enqueue_entity 0.57% +0.16% [kernel.kallsyms] [k] psi_task_change 0.57% +0.15% [kernel.kallsyms] [k] enqueue_task_fair 1.06% -0.15% [kernel.kallsyms] [k] apparmor_file_permission 0.80% +0.15% [kernel.kallsyms] [k] try_to_wake_up 1.07% +0.14% [kernel.kallsyms] [k] psi_task_switch 1.58% +0.14% [kernel.kallsyms] [k] pipe_write 0.86% +0.14% [kernel.kallsyms] [k] syscall_exit_to_user_mode 1.02% +0.13% [kernel.kallsyms] [k] native_sched_clock 0.46% +0.11% [kernel.kallsyms] [k] __update_load_avg_se perf diff performance: ---------------------- 13.09% +3.06% [kernel.kallsyms] [k] native_read_msr 13.12% +2.84% [kernel.kallsyms] [k] native_write_msr 7.94% +2.34% [kernel.kallsyms] [k] delay_halt_mwaitx 2.15% -0.93% [kernel.kallsyms] [k] update_curr 4.42% +0.87% [kernel.kallsyms] [k] x86_pmu_disable_all 3.12% +0.74% [kernel.kallsyms] [k] amd_pmu_addr_offset 2.84% -0.59% [kernel.kallsyms] [k] psi_group_change 1.44% +0.53% [kernel.kallsyms] [k] amd_pmu_check_overflow 1.45% +0.50% [kernel.kallsyms] [k] amd_pmu_test_overflow_topbit 0.47% -0.47% [kernel.kallsyms] [k] __calc_delta.constprop.0 1.60% -0.40% [kernel.kallsyms] [k] pick_next_task_fair 1.97% -0.37% [kernel.kallsyms] [k] update_load_avg 0.57% -0.37% [kernel.kallsyms] [k] avg_vruntime 0.82% -0.37% [kernel.kallsyms] [k] enqueue_task_fair 1.54% -0.34% [kernel.kallsyms] [k] __schedule 0.79% -0.32% [kernel.kallsyms] [k] pick_eevdf 0.88% +0.32% [kernel.kallsyms] [k] delay_halt 0.59% -0.28% [kernel.kallsyms] [k] update_cfs_group 0.86% -0.25% [kernel.kallsyms] [k] try_to_wake_up 1.18% -0.25% [kernel.kallsyms] [k] native_sched_clock 0.45% -0.24% [kernel.kallsyms] [k] put_prev_entity 0.49% -0.24% [kernel.kallsyms] [k] ttwu_do_activate 0.64% -0.23% [kernel.kallsyms] [k] enqueue_entity 0.72% -0.22% [kernel.kallsyms] [k] __update_load_avg_cfs_rq 1.57% -0.22% [kernel.kallsyms] [k] pipe_write 0.50% -0.20% [kernel.kallsyms] [k] update_min_vruntime 3.31% -0.19% [kernel.kallsyms] [k] srso_safe_ret 1.31% -0.18% [kernel.kallsyms] [k] psi_task_switch 0.52% -0.18% [kernel.kallsyms] [k] check_preempt_wakeup_fair 0.32% -0.16% [kernel.kallsyms] [k] __enqueue_entity 0.87% -0.16% [kernel.kallsyms] [k] dequeue_task_fair 0.44% -0.14% [kernel.kallsyms] [k] pick_next_entity 0.63% -0.13% [kernel.kallsyms] [k] psi_task_change 0.62% -0.13% [kernel.kallsyms] [k] sched_clock_cpu Changes since v5: * Fix a bug where switching between RT and sugov tasks triggered an endless cycle of cpufreq updates. * Only do cpufreq updates at tick for fair after verifying rq->cfs.decayed * Remove optimization in update_load_avg() to avoid sending an update if util hasn't changed that caused a bug when switching from Idle * Handle systems with long ticks by adding extra update on check_preempt_wakeup_fair(). The idea is to rely on context switch but still consider an update if wakeup preemption failed and no update was sent since sysctl_sched_base_slice * Remove ignore_dl_rate_limit() as this function is now redundant * move sg_cpu->last_update = time inside sugov_iowait_boost() * Update commit message with new details and with perf diff output Changes since v4: * Fix updating freq when uclamp changes before the dequeue/enqueue dance. (Hongyan) * Rebased on top of tip/sched/core 6.10-rc1 and resolve some conflicts due to code shuffling to syscalls.c. Added new function update_cpufreq_current() to be used outside core.c when task_current() requires cpufreq update. Changes since v3: * Omit cpufreq updates at attach/detach_entity_load_avg(). They share the update path from enqueue/dequeue which is not intended to trigger an update. And task_change_group_fair() is not expected to cause the root cfs_rq util to change significantly to warrant an immediate update for enqueued tasks. Better defer for next context switch to sample the state of the cpu taking all changes into account before the next task is due to run. Dietmar also pointed out a bug where we could send more updates vs without the patch in this path as I wasn't sending the update for cfs_rq == &rq->cfs. Changes since v2: * Clean up update_cpufreq_ctx_switch() to reduce branches (Peter) * Fix issue with cpufreq updates missed on switching from idle (Vincent) * perf bench sched pipe regressed after fixing the switch from idle, detect when util_avg has changed when cfs_rq->decayed to fix it * Ensure to issue cpufreq updates when task_current() switches policy/uclamp values Changes since v1: * Use taskset and measure with performance governor as Ingo suggested * Remove the static key as I found out we always register a function for cpu_dbs in cpufreq_governor.c; and as Christian pointed out it trigger a lock debug warning. * Improve detection of sugov workers by using SCHED_FLAG_SUGOV * Guard against NSEC_PER_MSEC instead of TICK_USEC to avoid prematurely reducing iowait boost as the latter was a NOP and like sugov_iowait_reset() like Christian pointed out. v1 discussion: https://lore.kernel.org/all/20240324020139.1032473-1-qyousef@layalina.io/ v2 discussion: https://lore.kernel.org/lkml/20240505233103.168766-1-qyousef@layalina.io/ v3 discussion: https://lore.kernel.org/lkml/20240512190018.531820-1-qyousef@layalina.io/ v4 discussion: https://lore.kernel.org/lkml/20240516204802.846520-1-qyousef@layalina.io/ v5 discussion: https://lore.kernel.org/lkml/20240530104653.1234004-1-qyousef@layalina.io/ v6 discussion: https://lore.kernel.org/lkml/20240619201409.2071728-1-qyousef@layalina.io/ include/linux/sched/cpufreq.h | 4 +- kernel/sched/core.c | 116 +++++++++++++++++++++++++++-- kernel/sched/cpufreq_schedutil.c | 122 +++++++++++++++++++------------ kernel/sched/deadline.c | 10 ++- kernel/sched/fair.c | 91 +++++++++++------------ kernel/sched/rt.c | 8 +- kernel/sched/sched.h | 9 ++- kernel/sched/syscalls.c | 30 ++++++-- 8 files changed, 271 insertions(+), 119 deletions(-)