Message ID | 20210707123402.13999-2-borntraeger@de.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Improve yield (was: sched: Move SCHED_DEBUG sysctl to debugfs) | expand |
Hi Christian, I love your patch! Perhaps something to improve: [auto build test WARNING on tip/sched/core] [also build test WARNING on v5.13 next-20210707] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/Christian-Borntraeger/sched-fair-improve-yield_to-vs-fairness/20210707-213440 base: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 031e3bd8986fffe31e1ddbf5264cccfe30c9abd7 config: i386-randconfig-s002-20210707 (attached as .config) compiler: gcc-9 (Debian 9.3.0-22) 9.3.0 reproduce: # apt-get install sparse # sparse version: v0.6.3-341-g8af24329-dirty # https://github.com/0day-ci/linux/commit/75196412f9c36f51144f4c333b2b02d57bb0ebde git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review Christian-Borntraeger/sched-fair-improve-yield_to-vs-fairness/20210707-213440 git checkout 75196412f9c36f51144f4c333b2b02d57bb0ebde # save the attached .config to linux build tree make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=i386 If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> sparse warnings: (new ones prefixed by >>) kernel/sched/fair.c:830:34: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct sched_entity *se @@ got struct sched_entity [noderef] __rcu * @@ kernel/sched/fair.c:830:34: sparse: expected struct sched_entity *se kernel/sched/fair.c:830:34: sparse: got struct sched_entity [noderef] __rcu * kernel/sched/fair.c:5458:38: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct task_struct *curr @@ got struct task_struct [noderef] __rcu *curr @@ kernel/sched/fair.c:5458:38: sparse: expected struct task_struct *curr kernel/sched/fair.c:5458:38: sparse: got struct task_struct [noderef] __rcu *curr kernel/sched/fair.c:7048:38: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct task_struct *curr @@ got struct task_struct [noderef] __rcu *curr @@ kernel/sched/fair.c:7048:38: sparse: expected struct task_struct *curr kernel/sched/fair.c:7048:38: sparse: got struct task_struct [noderef] __rcu *curr kernel/sched/fair.c:7332:38: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct task_struct *curr @@ got struct task_struct [noderef] __rcu *curr @@ kernel/sched/fair.c:7332:38: sparse: expected struct task_struct *curr kernel/sched/fair.c:7332:38: sparse: got struct task_struct [noderef] __rcu *curr >> kernel/sched/fair.c:7364:40: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct sched_entity *curr @@ got struct sched_entity [noderef] __rcu * @@ kernel/sched/fair.c:7364:40: sparse: expected struct sched_entity *curr kernel/sched/fair.c:7364:40: sparse: got struct sched_entity [noderef] __rcu * kernel/sched/fair.c:5387:35: sparse: sparse: marked inline, but without a definition kernel/sched/fair.c: note: in included file: kernel/sched/sched.h:2011:25: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2011:25: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2011:25: sparse: struct task_struct * kernel/sched/sched.h:2169:9: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2169:9: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2169:9: sparse: struct task_struct * kernel/sched/sched.h:2011:25: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2011:25: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2011:25: sparse: struct task_struct * kernel/sched/sched.h:2011:25: sparse: sparse: incompatible types in comparison expression (different address spaces): kernel/sched/sched.h:2011:25: sparse: struct task_struct [noderef] __rcu * kernel/sched/sched.h:2011:25: sparse: struct task_struct * vim +7364 kernel/sched/fair.c 7360 7361 static bool yield_to_task_fair(struct rq *rq, struct task_struct *p) 7362 { 7363 struct sched_entity *se = &p->se; > 7364 struct sched_entity *curr = &rq->curr->se; 7365 7366 /* throttled hierarchies are not runnable */ 7367 if (!se->on_rq || throttled_hierarchy(cfs_rq_of(se))) 7368 return false; 7369 7370 /* Tell the scheduler that we'd really like pse to run next. */ 7371 set_next_buddy(se); 7372 7373 yield_task_fair(rq); 7374 7375 /* 7376 * This path is special and only called from KVM. In contrast to yield, 7377 * in yield_to we really know that current is spinning and we know 7378 * (s390) or have good heuristics whom are we waiting for. There is 7379 * absolutely no point in continuing the current task, even if this 7380 * means to become unfairer. Let us give the current process some 7381 * "fake" penalty. 7382 */ 7383 curr->vruntime += sched_slice(cfs_rq_of(curr), curr); 7384 7385 return true; 7386 } 7387 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
On Wed, Jul 07, 2021 at 02:34:02PM +0200, Christian Borntraeger wrote: > After some debugging in situations where a smaller sched_latency_ns and > smaller sched_migration_cost settings helped for KVM host, I was able to > come up with a reduced testcase. > This testcase has 2 vcpus working on a shared memory location and > waiting for mem % 2 == cpu number to then do an add on the shared > memory. > To start simple I pinned all vcpus to one host CPU. Without the > yield_to in KVM the testcase was horribly slow. This is expected as each > vcpu will spin a whole time slice. With the yield_to from KVM things are > much better, but I was still seeing yields being ignored. > In the end pick_next_entity decided to keep the current process running > due to fairness reasons. On this path we really know that there is no > point in continuing current. So let us make things a bit unfairer to > current. > This makes the reduced testcase noticeable faster. It improved a more > realistic test case (many guests on some host CPUs with overcomitment) > even more. > In the end this is similar to the old compat_sched_yield approach with > an important difference: > Instead of doing it for all yields we now only do it for yield_to > a place where we really know that current it waiting for the target. > > What are alternative implementations for this patch > - do the same as the old compat_sched_yield: > current->vruntime = rightmost->vruntime+1 > - provide a new tunable sched_ns_yield_penalty: how much vruntime to add > (could be per architecture) > - also fiddle with the vruntime of the target > e.g. subtract from the target what we add to the source > > Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> I think this one accidentally fell off everyones radar including mine. At the time this patch was mailed I remembered thinking that playing games with vruntime might have other consequences. For example, what I believe is the most relevant problem for KVM is that a task spinning to acquire a lock may be waiting on a vcpu holding the lock that has been descheduled. Without vcpu pinning, it's possible that the holder is on the same runqueue as the lock acquirer so the acquirer is wasting CPU. In such a case, changing the acquirers vcpu may mean that it unfairly loses CPU time simply because it's a lock acquirer. Vincent, what do you think? Christian, would you mind testing this as an alternative with your demonstration test case and more importantly the "realistic test case"? --8<-- sched: Do not select highest priority task to run if it should be skipped pick_next_entity will consider the "next buddy" over the highest priority task if it's not unfair to do so (as determined by wakekup_preempt_entity). The potential problem is that an in-kernel user of yield_to() such as KVM may explicitly want to yield the current task because it is trying to acquire a spinlock from a task that is currently descheduled and potentially running on the same runqueue. However, if it's more fair from the scheduler perspective to continue running the current task, it'll continue to spin uselessly waiting on a descheduled task to run. This patch will select the targeted task to run even if it's unfair if the highest priority task is explicitly marked as "skip". This was evaluated using a debugging patch to expose yield_to as a system call. A demonstration program creates N number of threads and arranges them in a ring that are updating a shared value in memory. Each thread spins until the value matches the thread ID. It then updates the value and wakes the next thread in the ring. It measures how many times it spins before it gets its turn. Without the patch, the number of spins is highly variable and unstable but with the patch it's more consistent. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- kernel/sched/fair.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 44c452072a1b..ddc0212d520f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4522,7 +4522,8 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr) se = second; } - if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) { + if (cfs_rq->next && + (cfs_rq->skip == left || wakeup_preempt_entity(cfs_rq->next, left) < 1)) { /* * Someone really wants this to run. If it's not unfair, run it. */
On 23.07.21 11:35, Mel Gorman wrote: > On Wed, Jul 07, 2021 at 02:34:02PM +0200, Christian Borntraeger wrote: >> After some debugging in situations where a smaller sched_latency_ns and >> smaller sched_migration_cost settings helped for KVM host, I was able to >> come up with a reduced testcase. >> This testcase has 2 vcpus working on a shared memory location and >> waiting for mem % 2 == cpu number to then do an add on the shared >> memory. >> To start simple I pinned all vcpus to one host CPU. Without the >> yield_to in KVM the testcase was horribly slow. This is expected as each >> vcpu will spin a whole time slice. With the yield_to from KVM things are >> much better, but I was still seeing yields being ignored. >> In the end pick_next_entity decided to keep the current process running >> due to fairness reasons. On this path we really know that there is no >> point in continuing current. So let us make things a bit unfairer to >> current. >> This makes the reduced testcase noticeable faster. It improved a more >> realistic test case (many guests on some host CPUs with overcomitment) >> even more. >> In the end this is similar to the old compat_sched_yield approach with >> an important difference: >> Instead of doing it for all yields we now only do it for yield_to >> a place where we really know that current it waiting for the target. >> >> What are alternative implementations for this patch >> - do the same as the old compat_sched_yield: >> current->vruntime = rightmost->vruntime+1 >> - provide a new tunable sched_ns_yield_penalty: how much vruntime to add >> (could be per architecture) >> - also fiddle with the vruntime of the target >> e.g. subtract from the target what we add to the source >> >> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> > > I think this one accidentally fell off everyones radar including mine. > At the time this patch was mailed I remembered thinking that playing games with > vruntime might have other consequences. For example, what I believe is > the most relevant problem for KVM is that a task spinning to acquire a > lock may be waiting on a vcpu holding the lock that has been > descheduled. Without vcpu pinning, it's possible that the holder is on > the same runqueue as the lock acquirer so the acquirer is wasting CPU. > > In such a case, changing the acquirers vcpu may mean that it unfairly > loses CPU time simply because it's a lock acquirer. Vincent, what do you > think? Christian, would you mind testing this as an alternative with your > demonstration test case and more importantly the "realistic test case"? > > --8<-- > sched: Do not select highest priority task to run if it should be skipped > > pick_next_entity will consider the "next buddy" over the highest priority > task if it's not unfair to do so (as determined by wakekup_preempt_entity). > The potential problem is that an in-kernel user of yield_to() such as > KVM may explicitly want to yield the current task because it is trying > to acquire a spinlock from a task that is currently descheduled and > potentially running on the same runqueue. However, if it's more fair from > the scheduler perspective to continue running the current task, it'll continue > to spin uselessly waiting on a descheduled task to run. > > This patch will select the targeted task to run even if it's unfair if the > highest priority task is explicitly marked as "skip". > > This was evaluated using a debugging patch to expose yield_to as a system > call. A demonstration program creates N number of threads and arranges > them in a ring that are updating a shared value in memory. Each thread > spins until the value matches the thread ID. It then updates the value > and wakes the next thread in the ring. It measures how many times it spins > before it gets its turn. Without the patch, the number of spins is highly > variable and unstable but with the patch it's more consistent. > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > --- > kernel/sched/fair.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 44c452072a1b..ddc0212d520f 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4522,7 +4522,8 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr) > se = second; > } > > - if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) { > + if (cfs_rq->next && > + (cfs_rq->skip == left || wakeup_preempt_entity(cfs_rq->next, left) < 1)) { > /* > * Someone really wants this to run. If it's not unfair, run it. > */ > I do see a reduction in ignored yields, but from a performance aspect for my testcases this patch does not provide a benefit, while the the simple curr->vruntime += sysctl_sched_min_granularity; does. I still think that your approach is probably the cleaner one, any chance to improve this somehow? FWIW, I recently realized that my patch also does not solve the problem for KVM with libvirt. My testcase only improves with qemu started by hand. As soon as the cpu cgroup controller is active, my patch also no longer helps. If you have any patch to test, I can test it. Meanwhile I will also do some more testing.
On Fri, Jul 23, 2021 at 02:36:21PM +0200, Christian Borntraeger wrote: > > sched: Do not select highest priority task to run if it should be skipped > > > > <SNIP> > > > > index 44c452072a1b..ddc0212d520f 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -4522,7 +4522,8 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr) > > se = second; > > } > > - if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) { > > + if (cfs_rq->next && > > + (cfs_rq->skip == left || wakeup_preempt_entity(cfs_rq->next, left) < 1)) { > > /* > > * Someone really wants this to run. If it's not unfair, run it. > > */ > > > > I do see a reduction in ignored yields, but from a performance aspect for my > testcases this patch does not provide a benefit, while the the simple > curr->vruntime += sysctl_sched_min_granularity; > does. I'm still not a fan because vruntime gets distorted. From the docs Small detail: on "ideal" hardware, at any time all tasks would have the same p->se.vruntime value --- i.e., tasks would execute simultaneously and no task would ever get "out of balance" from the "ideal" share of CPU time If yield_to impacts this "ideal share" then it could have other consequences. I think your patch may be performing better in your test case because every "wrong" task selected that is not the yield_to target gets penalised and so the yield_to target gets pushed up the list. > I still think that your approach is probably the cleaner one, any chance to improve this > somehow? > Potentially. The patch was a bit off because while it noticed that skip was not being obeyed, the fix was clumsy and isolated. The current flow is 1. pick se == left as the candidate 2. try pick a different se if the "ideal" candidate is a skip candidate 3. Ignore the se update if next or last are set Step 3 looks off because it ignores skip if next or last buddies are set and I don't think that was intended. Can you try this? diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 44c452072a1b..d56f7772a607 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4522,12 +4522,12 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr) se = second; } - if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) { + if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, se) < 1) { /* * Someone really wants this to run. If it's not unfair, run it. */ se = cfs_rq->next; - } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) { + } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, se) < 1) { /* * Prefer last buddy, try to return the CPU to a preempted task. */
On 23.07.21 18:21, Mel Gorman wrote: > On Fri, Jul 23, 2021 at 02:36:21PM +0200, Christian Borntraeger wrote: >>> sched: Do not select highest priority task to run if it should be skipped >>> >>> <SNIP> >>> >>> index 44c452072a1b..ddc0212d520f 100644 >>> --- a/kernel/sched/fair.c >>> +++ b/kernel/sched/fair.c >>> @@ -4522,7 +4522,8 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr) >>> se = second; >>> } >>> - if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) { >>> + if (cfs_rq->next && >>> + (cfs_rq->skip == left || wakeup_preempt_entity(cfs_rq->next, left) < 1)) { >>> /* >>> * Someone really wants this to run. If it's not unfair, run it. >>> */ >>> >> >> I do see a reduction in ignored yields, but from a performance aspect for my >> testcases this patch does not provide a benefit, while the the simple >> curr->vruntime += sysctl_sched_min_granularity; >> does. > > I'm still not a fan because vruntime gets distorted. From the docs > > Small detail: on "ideal" hardware, at any time all tasks would have the same > p->se.vruntime value --- i.e., tasks would execute simultaneously and no task > would ever get "out of balance" from the "ideal" share of CPU time > > If yield_to impacts this "ideal share" then it could have other > consequences. > > I think your patch may be performing better in your test case because every > "wrong" task selected that is not the yield_to target gets penalised and > so the yield_to target gets pushed up the list. > >> I still think that your approach is probably the cleaner one, any chance to improve this >> somehow? >> > > Potentially. The patch was a bit off because while it noticed that skip > was not being obeyed, the fix was clumsy and isolated. The current flow is > > 1. pick se == left as the candidate > 2. try pick a different se if the "ideal" candidate is a skip candidate > 3. Ignore the se update if next or last are set > > Step 3 looks off because it ignores skip if next or last buddies are set > and I don't think that was intended. Can you try this? > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 44c452072a1b..d56f7772a607 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4522,12 +4522,12 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr) > se = second; > } > > - if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) { > + if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, se) < 1) { > /* > * Someone really wants this to run. If it's not unfair, run it. > */ > se = cfs_rq->next; > - } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) { > + } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, se) < 1) { > /* > * Prefer last buddy, try to return the CPU to a preempted task. > */ > This one alone does not seem to make a difference. Neither in ignored yield, nor in performance. Your first patch does really help in terms of ignored yields when all threads are pinned to one host CPU. After that we do have no ignored yield it seems. But it does not affect the performance of my testcase. I did some more experiments and I removed the wakeup_preempt_entity checks in pick_next_entity - assuming that this will result in source always being stopped and target always being picked. But still, no performance difference. As soon as I play with vruntime I do see a difference (but only without the cpu cgroup controller). I will try to better understand the scheduler logic and do some more testing. If you have anything that I should test, let me know. Christian
On Mon, Jul 26, 2021 at 08:41:15PM +0200, Christian Borntraeger wrote: > > Potentially. The patch was a bit off because while it noticed that skip > > was not being obeyed, the fix was clumsy and isolated. The current flow is > > > > 1. pick se == left as the candidate > > 2. try pick a different se if the "ideal" candidate is a skip candidate > > 3. Ignore the se update if next or last are set > > > > Step 3 looks off because it ignores skip if next or last buddies are set > > and I don't think that was intended. Can you try this? > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 44c452072a1b..d56f7772a607 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -4522,12 +4522,12 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr) > > se = second; > > } > > - if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) { > > + if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, se) < 1) { > > /* > > * Someone really wants this to run. If it's not unfair, run it. > > */ > > se = cfs_rq->next; > > - } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) { > > + } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, se) < 1) { > > /* > > * Prefer last buddy, try to return the CPU to a preempted task. > > */ > > > > This one alone does not seem to make a difference. Neither in ignored yield, nor > in performance. > > Your first patch does really help in terms of ignored yields when > all threads are pinned to one host CPU. Ok, that tells us something. It implies, but does not prove, that the block above that handles skip is failing either the entity_before() test or the wakeup_preempt_entity() test. To what degree that should be relaxed when cfs_rq->next is !NULL is harder to determine. > After that we do have no ignored yield > it seems. But it does not affect the performance of my testcase. Ok, this is the first patch. The second patch is not improving ignored yields at all so the above paragraph still applies. It would be nice if you could instrument with trace_printk when cfs->rq_next is valid whether it's the entity_before() check that is preventing the skip or wakeup_preempt_entity. Would that be possible? I still think the second patch is right independent of it helping your test case because it makes no sense to me at all that the task after the skip candidate is ignored if there is a next or last buddy. > I did some more experiments and I removed the wakeup_preempt_entity checks in > pick_next_entity - assuming that this will result in source always being stopped > and target always being picked. But still, no performance difference. > As soon as I play with vruntime I do see a difference (but only without the cpu cgroup > controller). I will try to better understand the scheduler logic and do some more > testing. If you have anything that I should test, let me know. > The fact that vruntime tricks only makes a difference when cgroups are involved is interesting. Can you describe roughly what how the cgroup is configured? Similarly, does your config have CONFIG_SCHED_AUTOGROUP or CONFIG_FAIR_GROUP_SCHED set? I assume FAIR_GROUP_SCHED must be and I wonder if the impact of your patch is dropping groups of tasks in priority as opposed to individual tasks. I'm not that familiar with how groups are handled in terms of how they are prioritised unfortunately. I'm still hesitant to consider the vruntime hammer in case it causes fairness problems when vruntime is no longer reflecting time spent on the CPU.
On 26.07.21 21:32, Mel Gorman wrote: > On Mon, Jul 26, 2021 at 08:41:15PM +0200, Christian Borntraeger wrote: >>> Potentially. The patch was a bit off because while it noticed that skip >>> was not being obeyed, the fix was clumsy and isolated. The current flow is >>> >>> 1. pick se == left as the candidate >>> 2. try pick a different se if the "ideal" candidate is a skip candidate >>> 3. Ignore the se update if next or last are set >>> >>> Step 3 looks off because it ignores skip if next or last buddies are set >>> and I don't think that was intended. Can you try this? >>> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>> index 44c452072a1b..d56f7772a607 100644 >>> --- a/kernel/sched/fair.c >>> +++ b/kernel/sched/fair.c >>> @@ -4522,12 +4522,12 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr) >>> se = second; >>> } >>> - if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) { >>> + if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, se) < 1) { >>> /* >>> * Someone really wants this to run. If it's not unfair, run it. >>> */ >>> se = cfs_rq->next; >>> - } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) { >>> + } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, se) < 1) { >>> /* >>> * Prefer last buddy, try to return the CPU to a preempted task. >>> */ >>> >> >> This one alone does not seem to make a difference. Neither in ignored yield, nor >> in performance. >> >> Your first patch does really help in terms of ignored yields when >> all threads are pinned to one host CPU. > > Ok, that tells us something. It implies, but does not prove, that the > block above that handles skip is failing either the entity_before() > test or the wakeup_preempt_entity() test. To what degree that should be > relaxed when cfs_rq->next is !NULL is harder to determine. > >> After that we do have no ignored yield >> it seems. But it does not affect the performance of my testcase. > > Ok, this is the first patch. The second patch is not improving ignored > yields at all so the above paragraph still applies. It would be nice > if you could instrument with trace_printk when cfs->rq_next is valid > whether it's the entity_before() check that is preventing the skip or > wakeup_preempt_entity. Would that be possible? I will try that. > > I still think the second patch is right independent of it helping your > test case because it makes no sense to me at all that the task after the > skip candidate is ignored if there is a next or last buddy. I agree. This patch makes sense to me as a bug fix. And I think also the first patch makes sense on its own. > >> I did some more experiments and I removed the wakeup_preempt_entity checks in >> pick_next_entity - assuming that this will result in source always being stopped >> and target always being picked. But still, no performance difference. >> As soon as I play with vruntime I do see a difference (but only without the cpu cgroup >> controller). I will try to better understand the scheduler logic and do some more >> testing. If you have anything that I should test, let me know. >> > > The fact that vruntime tricks only makes a difference when cgroups are > involved is interesting. Can you describe roughly what how the cgroup > is configured? Its the other way around. My vruntime patch ONLY helps WITHOUT the cpu cgroup controller. In other words this example on a 16CPU host (resulting in 4x overcommitment) time ( for ((d=0; d<16; d++)) ; do cgexec -g cpu:test$d qemu-system-s390x -enable-kvm -kernel /root/REPOS/kvm-unit-tests/s390x/diag9c.elf -smp 4 -nographic -nodefaults -device sclpconsole,chardev=c2 -chardev file,path=/tmp/log$d.log,id=c2 & done; wait) does NOT benefit from the vruntime patch, but when I remove the "cgexec -g cpu:test$d" it does: time ( for ((d=0; d<16; d++)) ; do qemu-system-s390x -enable-kvm -kernel /root/REPOS/kvm-unit-tests/s390x/diag9c.elf -smp 4 -nographic -nodefaults -device sclpconsole,chardev=c2 -chardev file,path=/tmp/log$d.log,id=c2 & done; wait) Similarly, does your config have CONFIG_SCHED_AUTOGROUP > or CONFIG_FAIR_GROUP_SCHED set? I assume FAIR_GROUP_SCHED must be and Yes, both are set. > I wonder if the impact of your patch is dropping groups of tasks in > priority as opposed to individual tasks. I'm not that familiar with how > groups are handled in terms of how they are prioritised unfortunately. > > I'm still hesitant to consider the vruntime hammer in case it causes > fairness problems when vruntime is no longer reflecting time spent on > the CPU. I understand your concerns. What about subtracting the same amount of vruntime from the target as we add on the yielder? Would that result in quicker rebalancing while still keeping everything in order? The reason why I am asking is that initially we realized that setting some tunables lower, e.g. kernel.sched_latency_ns = 2000000 kernel.sched_migration_cost_ns = 100000 makes things faster in a similar fashion. And that also works with cgroups. But ideally we find a solution without changing tuneables.
On Fri, Jul 23, 2021 at 05:21:37PM +0100, Mel Gorman wrote: > I'm still not a fan because vruntime gets distorted. From the docs > > Small detail: on "ideal" hardware, at any time all tasks would have the same > p->se.vruntime value --- i.e., tasks would execute simultaneously and no task > would ever get "out of balance" from the "ideal" share of CPU time > > If yield_to impacts this "ideal share" then it could have other > consequences. Note that we already violate this ideal both subtly and explicitly. For an example of the latter consider pretty much the entirety of place_entity() with GENTLE_FAIR_SLEEPERS being the most egregious example. That said; adding to vruntime will only penalize the task itself, while subtracting from vruntime will penalize everyone else. And in that sense addition to vruntime is a safe option. I've not fully considered the case at hand; just wanted to give some context.
On Fri, Jul 23, 2021 at 10:35:23AM +0100, Mel Gorman wrote: > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 44c452072a1b..ddc0212d520f 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4522,7 +4522,8 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr) > se = second; > } > > - if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) { > + if (cfs_rq->next && > + (cfs_rq->skip == left || wakeup_preempt_entity(cfs_rq->next, left) < 1)) { > /* > * Someone really wants this to run. If it's not unfair, run it. > */ With a little more context this function reads like: se = left; if (cfs_rq->skip && cfs_rq->skip == se) { ... + if (cfs_rq->next && (cfs_rq->skip == left || ...)) If '...' doesn't change @left (afaict it doesn't), then your change (+) is equivalent to '&& true', or am I reading things wrong?
On Tue, Jul 27, 2021 at 03:33:00PM +0200, Peter Zijlstra wrote: > On Fri, Jul 23, 2021 at 10:35:23AM +0100, Mel Gorman wrote: > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 44c452072a1b..ddc0212d520f 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -4522,7 +4522,8 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr) > > se = second; > > } > > > > - if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) { > > + if (cfs_rq->next && > > + (cfs_rq->skip == left || wakeup_preempt_entity(cfs_rq->next, left) < 1)) { > > /* > > * Someone really wants this to run. If it's not unfair, run it. > > */ > > With a little more context this function reads like: > > se = left; > > if (cfs_rq->skip && cfs_rq->skip == se) { > ... > + if (cfs_rq->next && (cfs_rq->skip == left || ...)) > > If '...' doesn't change @left (afaict it doesn't), then your change (+) > is equivalent to '&& true', or am I reading things wrong? You're not reading it wrong although the patch is clumsy and may introduce unfairness that gets incrementally worse if there was repeated yields to the same task. A second patch was posted that does - if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) { + if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, se) < 1) { i.e. if the skip hint picks a second alternative then next or last buddies should be compared to the second alternative and not "left". It doesn't help indicating that the skip hint is not obeyed because "second" failed the entity_before() or wakeup_preempt_entity() checks. I'm waiting on a trace to see which check dominates. That said, I'm still undecided on how to approach this. None of the proposed patches on their own helps but the options are 1. Strictly obey the next buddy if the skip hint is the same se as left (first patch which I'm not very happy with even if it helped the test case) 2. My second patch which compares next/last with "second" if the skip hint skips "left". This may be a sensible starting point no matter what 3. Relaxing how "second" is selected if next or last buddies are set 4. vruntime tricks even if it punishes fairness for the task yielding the CPU. The advantage of this approach is if there are multiple tasks ahead of the task being yielded to then yield_to task will become "left" very quickly regardless of any buddy-related hints. I don't know what "3" would look like yet, it might be very fragile but lets see what the tracing says. Otherwise, testing 2+4 might be worthwhile to see if the combination helps Christian's test case when the cpu cgroup is involved.
Christian Borntraeger <borntraeger@de.ibm.com> writes: > On 23.07.21 18:21, Mel Gorman wrote: >> On Fri, Jul 23, 2021 at 02:36:21PM +0200, Christian Borntraeger wrote: >>>> sched: Do not select highest priority task to run if it should be skipped >>>> >>>> <SNIP> >>>> >>>> index 44c452072a1b..ddc0212d520f 100644 >>>> --- a/kernel/sched/fair.c >>>> +++ b/kernel/sched/fair.c >>>> @@ -4522,7 +4522,8 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr) >>>> se = second; >>>> } >>>> - if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) { >>>> + if (cfs_rq->next && >>>> + (cfs_rq->skip == left || wakeup_preempt_entity(cfs_rq->next, left) < 1)) { >>>> /* >>>> * Someone really wants this to run. If it's not unfair, run it. >>>> */ >>>> >>> >>> I do see a reduction in ignored yields, but from a performance aspect for my >>> testcases this patch does not provide a benefit, while the the simple >>> curr->vruntime += sysctl_sched_min_granularity; >>> does. >> I'm still not a fan because vruntime gets distorted. From the docs >> Small detail: on "ideal" hardware, at any time all tasks would have the >> same >> p->se.vruntime value --- i.e., tasks would execute simultaneously and no task >> would ever get "out of balance" from the "ideal" share of CPU time >> If yield_to impacts this "ideal share" then it could have other >> consequences. >> I think your patch may be performing better in your test case because every >> "wrong" task selected that is not the yield_to target gets penalised and >> so the yield_to target gets pushed up the list. >> >>> I still think that your approach is probably the cleaner one, any chance to improve this >>> somehow? >>> >> Potentially. The patch was a bit off because while it noticed that skip >> was not being obeyed, the fix was clumsy and isolated. The current flow is >> 1. pick se == left as the candidate >> 2. try pick a different se if the "ideal" candidate is a skip candidate >> 3. Ignore the se update if next or last are set >> Step 3 looks off because it ignores skip if next or last buddies are set >> and I don't think that was intended. Can you try this? >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 44c452072a1b..d56f7772a607 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -4522,12 +4522,12 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr) >> se = second; >> } >> - if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) { >> + if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, se) < 1) { >> /* >> * Someone really wants this to run. If it's not unfair, run it. >> */ >> se = cfs_rq->next; >> - } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) { >> + } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, se) < 1) { >> /* >> * Prefer last buddy, try to return the CPU to a preempted task. >> */ >> > > This one alone does not seem to make a difference. Neither in ignored yield, nor > in performance. > > Your first patch does really help in terms of ignored yields when > all threads are pinned to one host CPU. After that we do have no ignored yield > it seems. But it does not affect the performance of my testcase. > I did some more experiments and I removed the wakeup_preempt_entity checks in > pick_next_entity - assuming that this will result in source always being stopped > and target always being picked. But still, no performance difference. > As soon as I play with vruntime I do see a difference (but only without the cpu cgroup > controller). I will try to better understand the scheduler logic and do some more > testing. If you have anything that I should test, let me know. > > Christian If both yielder and target are in the same cpu cgroup or the cpu cgroup is disabled (ie, if cfs_rq_of(p->se) matches), you could try if (p->se.vruntime > rq->curr->se.vruntime) swap(p->se.vruntime, rq->curr->se.vruntime) as well as the existing buddy flags, as an entirely fair vruntime boost to the target. For when they aren't direct siblings, you /could/ use find_matching_se, but it's much less clear that's desirable, since it would yield vruntime for the entire hierarchy to the target's hierarchy.
On 27.07.21 20:57, Benjamin Segall wrote: > Christian Borntraeger <borntraeger@de.ibm.com> writes: > >> On 23.07.21 18:21, Mel Gorman wrote: >>> On Fri, Jul 23, 2021 at 02:36:21PM +0200, Christian Borntraeger wrote: >>>>> sched: Do not select highest priority task to run if it should be skipped >>>>> >>>>> <SNIP> >>>>> >>>>> index 44c452072a1b..ddc0212d520f 100644 >>>>> --- a/kernel/sched/fair.c >>>>> +++ b/kernel/sched/fair.c >>>>> @@ -4522,7 +4522,8 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr) >>>>> se = second; >>>>> } >>>>> - if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) { >>>>> + if (cfs_rq->next && >>>>> + (cfs_rq->skip == left || wakeup_preempt_entity(cfs_rq->next, left) < 1)) { >>>>> /* >>>>> * Someone really wants this to run. If it's not unfair, run it. >>>>> */ >>>>> >>>> >>>> I do see a reduction in ignored yields, but from a performance aspect for my >>>> testcases this patch does not provide a benefit, while the the simple >>>> curr->vruntime += sysctl_sched_min_granularity; >>>> does. >>> I'm still not a fan because vruntime gets distorted. From the docs >>> Small detail: on "ideal" hardware, at any time all tasks would have the >>> same >>> p->se.vruntime value --- i.e., tasks would execute simultaneously and no task >>> would ever get "out of balance" from the "ideal" share of CPU time >>> If yield_to impacts this "ideal share" then it could have other >>> consequences. >>> I think your patch may be performing better in your test case because every >>> "wrong" task selected that is not the yield_to target gets penalised and >>> so the yield_to target gets pushed up the list. >>> >>>> I still think that your approach is probably the cleaner one, any chance to improve this >>>> somehow? >>>> >>> Potentially. The patch was a bit off because while it noticed that skip >>> was not being obeyed, the fix was clumsy and isolated. The current flow is >>> 1. pick se == left as the candidate >>> 2. try pick a different se if the "ideal" candidate is a skip candidate >>> 3. Ignore the se update if next or last are set >>> Step 3 looks off because it ignores skip if next or last buddies are set >>> and I don't think that was intended. Can you try this? >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>> index 44c452072a1b..d56f7772a607 100644 >>> --- a/kernel/sched/fair.c >>> +++ b/kernel/sched/fair.c >>> @@ -4522,12 +4522,12 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr) >>> se = second; >>> } >>> - if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) { >>> + if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, se) < 1) { >>> /* >>> * Someone really wants this to run. If it's not unfair, run it. >>> */ >>> se = cfs_rq->next; >>> - } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) { >>> + } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, se) < 1) { >>> /* >>> * Prefer last buddy, try to return the CPU to a preempted task. >>> */ >>> >> >> This one alone does not seem to make a difference. Neither in ignored yield, nor >> in performance. >> >> Your first patch does really help in terms of ignored yields when >> all threads are pinned to one host CPU. After that we do have no ignored yield >> it seems. But it does not affect the performance of my testcase. >> I did some more experiments and I removed the wakeup_preempt_entity checks in >> pick_next_entity - assuming that this will result in source always being stopped >> and target always being picked. But still, no performance difference. >> As soon as I play with vruntime I do see a difference (but only without the cpu cgroup >> controller). I will try to better understand the scheduler logic and do some more >> testing. If you have anything that I should test, let me know. >> >> Christian > > If both yielder and target are in the same cpu cgroup or the cpu cgroup > is disabled (ie, if cfs_rq_of(p->se) matches), you could try > > if (p->se.vruntime > rq->curr->se.vruntime) > swap(p->se.vruntime, rq->curr->se.vruntime) I tried that and it does not show the performance benefit. I then played with my patch (uses different values to add) and the benefit seems to be depending on the size that is being added, maybe when swapping it was just not large enough. I have to say that this is all a bit unclear what and why performance improves. It just seems that the cpu cgroup controller has a fair share of the performance problems. I also asked the performance people to run some measurements and the numbers of some transactional workload under KVM was baseline: 11813 with much smaller sched_latency_ns and sched_migration_cost_ns: 16419 with cpu controller disabled: 15962 with cpu controller disabled + my patch: 16782 I will be travelling the next 2 weeks, so I can continue with more debugging after that. Thanks for all the ideas and help so far. Christian > as well as the existing buddy flags, as an entirely fair vruntime boost > to the target. > > For when they aren't direct siblings, you /could/ use find_matching_se, > but it's much less clear that's desirable, since it would yield vruntime > for the entire hierarchy to the target's hierarchy. >
On Wed, 28 Jul 2021 at 18:24, Christian Borntraeger <borntraeger@de.ibm.com> wrote: > > > > On 27.07.21 20:57, Benjamin Segall wrote: > > Christian Borntraeger <borntraeger@de.ibm.com> writes: > > > >> On 23.07.21 18:21, Mel Gorman wrote: > >>> On Fri, Jul 23, 2021 at 02:36:21PM +0200, Christian Borntraeger wrote: > >>>>> sched: Do not select highest priority task to run if it should be skipped > >>>>> > >>>>> <SNIP> > >>>>> > >>>>> index 44c452072a1b..ddc0212d520f 100644 > >>>>> --- a/kernel/sched/fair.c > >>>>> +++ b/kernel/sched/fair.c > >>>>> @@ -4522,7 +4522,8 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr) > >>>>> se = second; > >>>>> } > >>>>> - if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) { > >>>>> + if (cfs_rq->next && > >>>>> + (cfs_rq->skip == left || wakeup_preempt_entity(cfs_rq->next, left) < 1)) { > >>>>> /* > >>>>> * Someone really wants this to run. If it's not unfair, run it. > >>>>> */ > >>>>> > >>>> > >>>> I do see a reduction in ignored yields, but from a performance aspect for my > >>>> testcases this patch does not provide a benefit, while the the simple > >>>> curr->vruntime += sysctl_sched_min_granularity; > >>>> does. > >>> I'm still not a fan because vruntime gets distorted. From the docs > >>> Small detail: on "ideal" hardware, at any time all tasks would have the > >>> same > >>> p->se.vruntime value --- i.e., tasks would execute simultaneously and no task > >>> would ever get "out of balance" from the "ideal" share of CPU time > >>> If yield_to impacts this "ideal share" then it could have other > >>> consequences. > >>> I think your patch may be performing better in your test case because every > >>> "wrong" task selected that is not the yield_to target gets penalised and > >>> so the yield_to target gets pushed up the list. > >>> > >>>> I still think that your approach is probably the cleaner one, any chance to improve this > >>>> somehow? > >>>> > >>> Potentially. The patch was a bit off because while it noticed that skip > >>> was not being obeyed, the fix was clumsy and isolated. The current flow is > >>> 1. pick se == left as the candidate > >>> 2. try pick a different se if the "ideal" candidate is a skip candidate > >>> 3. Ignore the se update if next or last are set > >>> Step 3 looks off because it ignores skip if next or last buddies are set > >>> and I don't think that was intended. Can you try this? > >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > >>> index 44c452072a1b..d56f7772a607 100644 > >>> --- a/kernel/sched/fair.c > >>> +++ b/kernel/sched/fair.c > >>> @@ -4522,12 +4522,12 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr) > >>> se = second; > >>> } > >>> - if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) { > >>> + if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, se) < 1) { > >>> /* > >>> * Someone really wants this to run. If it's not unfair, run it. > >>> */ > >>> se = cfs_rq->next; > >>> - } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) { > >>> + } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, se) < 1) { > >>> /* > >>> * Prefer last buddy, try to return the CPU to a preempted task. > >>> */ > >>> > >> > >> This one alone does not seem to make a difference. Neither in ignored yield, nor > >> in performance. > >> > >> Your first patch does really help in terms of ignored yields when > >> all threads are pinned to one host CPU. After that we do have no ignored yield > >> it seems. But it does not affect the performance of my testcase. > >> I did some more experiments and I removed the wakeup_preempt_entity checks in > >> pick_next_entity - assuming that this will result in source always being stopped > >> and target always being picked. But still, no performance difference. > >> As soon as I play with vruntime I do see a difference (but only without the cpu cgroup > >> controller). I will try to better understand the scheduler logic and do some more > >> testing. If you have anything that I should test, let me know. > >> > >> Christian > > > > If both yielder and target are in the same cpu cgroup or the cpu cgroup > > is disabled (ie, if cfs_rq_of(p->se) matches), you could try > > > > if (p->se.vruntime > rq->curr->se.vruntime) > > swap(p->se.vruntime, rq->curr->se.vruntime) > > I tried that and it does not show the performance benefit. I then played with my > patch (uses different values to add) and the benefit seems to be depending on the > size that is being added, maybe when swapping it was just not large enough. > > I have to say that this is all a bit unclear what and why performance improves. > It just seems that the cpu cgroup controller has a fair share of the performance > problems. > > I also asked the performance people to run some measurements and the numbers of > some transactional workload under KVM was > baseline: 11813 > with much smaller sched_latency_ns and sched_migration_cost_ns: 16419 Have you also tried to increase sched_wakeup_granularity which is used to decide whether we can preempt curr ? > with cpu controller disabled: 15962 > with cpu controller disabled + my patch: 16782 Your patch modifies the vruntime of the task but cgroup sched_entity stays unchanged. Scheduler starts to compare the vruntime of the sched_entity of the group before reaching your task. That's probably why your patch doesn't help with cgroup and will be difficult to extend to cgroup because the yield of the task should not impact the other entities in the group. > > I will be travelling the next 2 weeks, so I can continue with more debugging > after that. > > Thanks for all the ideas and help so far. > > Christian > > > as well as the existing buddy flags, as an entirely fair vruntime boost > > to the target. > > > > For when they aren't direct siblings, you /could/ use find_matching_se, > > but it's much less clear that's desirable, since it would yield vruntime > > for the entire hierarchy to the target's hierarchy. > >
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 23663318fb81..4f661a9ed3ba 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7337,6 +7337,7 @@ static void yield_task_fair(struct rq *rq) static bool yield_to_task_fair(struct rq *rq, struct task_struct *p) { struct sched_entity *se = &p->se; + struct sched_entity *curr = &rq->curr->se; /* throttled hierarchies are not runnable */ if (!se->on_rq || throttled_hierarchy(cfs_rq_of(se))) @@ -7347,6 +7348,16 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p) yield_task_fair(rq); + /* + * This path is special and only called from KVM. In contrast to yield, + * in yield_to we really know that current is spinning and we know + * (s390) or have good heuristics whom are we waiting for. There is + * absolutely no point in continuing the current task, even if this + * means to become unfairer. Let us give the current process some + * "fake" penalty. + */ + curr->vruntime += sched_slice(cfs_rq_of(curr), curr); + return true; }
After some debugging in situations where a smaller sched_latency_ns and smaller sched_migration_cost settings helped for KVM host, I was able to come up with a reduced testcase. This testcase has 2 vcpus working on a shared memory location and waiting for mem % 2 == cpu number to then do an add on the shared memory. To start simple I pinned all vcpus to one host CPU. Without the yield_to in KVM the testcase was horribly slow. This is expected as each vcpu will spin a whole time slice. With the yield_to from KVM things are much better, but I was still seeing yields being ignored. In the end pick_next_entity decided to keep the current process running due to fairness reasons. On this path we really know that there is no point in continuing current. So let us make things a bit unfairer to current. This makes the reduced testcase noticeable faster. It improved a more realistic test case (many guests on some host CPUs with overcomitment) even more. In the end this is similar to the old compat_sched_yield approach with an important difference: Instead of doing it for all yields we now only do it for yield_to a place where we really know that current it waiting for the target. What are alternative implementations for this patch - do the same as the old compat_sched_yield: current->vruntime = rightmost->vruntime+1 - provide a new tunable sched_ns_yield_penalty: how much vruntime to add (could be per architecture) - also fiddle with the vruntime of the target e.g. subtract from the target what we add to the source Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> --- kernel/sched/fair.c | 11 +++++++++++ 1 file changed, 11 insertions(+)