[1/1] sched/fair: improve yield_to vs fairness

Message ID	20210707123402.13999-2-borntraeger@de.ibm.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> From: Christian Borntraeger <borntraeger@de.ibm.com> To: peterz@infradead.org Cc: borntraeger@de.ibm.com, bristot@redhat.com, bsegall@google.com, dietmar.eggemann@arm.com, joshdon@google.com, juri.lelli@redhat.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-s390@vger.kernel.org, linux@rasmusvillemoes.dk, mgorman@suse.de, mingo@kernel.org, rostedt@goodmis.org, valentin.schneider@arm.com, vincent.guittot@linaro.org Subject: [PATCH 1/1] sched/fair: improve yield_to vs fairness Date: Wed, 7 Jul 2021 14:34:02 +0200 Message-Id: <20210707123402.13999-2-borntraeger@de.ibm.com> In-Reply-To: <20210707123402.13999-1-borntraeger@de.ibm.com> References: <YIlXQ43b6+7sUl+f@hirez.programming.kicks-ass.net> <20210707123402.13999-1-borntraeger@de.ibm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Improve yield (was: sched: Move SCHED_DEBUG sysctl to debugfs) \| expand [0/1] Improve yield (was: sched: Move SCHED_DEBUG sysctl to debugfs) [1/1] sched/fair: improve yield_to vs fairness

Christian Borntraeger July 7, 2021, 12:34 p.m. UTC

After some debugging in situations where a smaller sched_latency_ns and
smaller sched_migration_cost settings helped for KVM host, I was able to
come up with a reduced testcase.
This testcase has 2 vcpus working on a shared memory location and
waiting for mem % 2 == cpu number to then do an add on the shared
memory.
To start simple I pinned all vcpus to one host CPU. Without the
yield_to in KVM the testcase was horribly slow. This is expected as each
vcpu will spin a whole time slice. With the yield_to from KVM things are
much better, but I was still seeing yields being ignored.
In the end pick_next_entity decided to keep the current process running
due to fairness reasons.  On this path we really know that there is no
point in continuing current. So let us make things a bit unfairer to
current.
This makes the reduced testcase noticeable faster. It improved a more
realistic test case (many guests on some host CPUs with overcomitment)
even more.
In the end this is similar to the old compat_sched_yield approach with
an important difference:
Instead of doing it for all yields we now only do it for yield_to
a place where we really know that current it waiting for the target.

What are alternative implementations for this patch
- do the same as the old compat_sched_yield:
  current->vruntime = rightmost->vruntime+1
- provide a new tunable sched_ns_yield_penalty: how much vruntime to add
  (could be per architecture)
- also fiddle with the vruntime of the target
  e.g. subtract from the target what we add to the source

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
---
 kernel/sched/fair.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

kernel test robot July 7, 2021, 6:07 p.m. UTC | #1

Hi Christian,

I love your patch! Perhaps something to improve:

[auto build test WARNING on tip/sched/core]
[also build test WARNING on v5.13 next-20210707]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Christian-Borntraeger/sched-fair-improve-yield_to-vs-fairness/20210707-213440
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 031e3bd8986fffe31e1ddbf5264cccfe30c9abd7
config: i386-randconfig-s002-20210707 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.3-341-g8af24329-dirty
        # https://github.com/0day-ci/linux/commit/75196412f9c36f51144f4c333b2b02d57bb0ebde
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Christian-Borntraeger/sched-fair-improve-yield_to-vs-fairness/20210707-213440
        git checkout 75196412f9c36f51144f4c333b2b02d57bb0ebde
        # save the attached .config to linux build tree
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=i386 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)
   kernel/sched/fair.c:830:34: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct sched_entity *se @@     got struct sched_entity [noderef] __rcu * @@
   kernel/sched/fair.c:830:34: sparse:     expected struct sched_entity *se
   kernel/sched/fair.c:830:34: sparse:     got struct sched_entity [noderef] __rcu *
   kernel/sched/fair.c:5458:38: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct task_struct *curr @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/fair.c:5458:38: sparse:     expected struct task_struct *curr
   kernel/sched/fair.c:5458:38: sparse:     got struct task_struct [noderef] __rcu *curr
   kernel/sched/fair.c:7048:38: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct task_struct *curr @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/fair.c:7048:38: sparse:     expected struct task_struct *curr
   kernel/sched/fair.c:7048:38: sparse:     got struct task_struct [noderef] __rcu *curr
   kernel/sched/fair.c:7332:38: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct task_struct *curr @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/fair.c:7332:38: sparse:     expected struct task_struct *curr
   kernel/sched/fair.c:7332:38: sparse:     got struct task_struct [noderef] __rcu *curr
>> kernel/sched/fair.c:7364:40: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct sched_entity *curr @@     got struct sched_entity [noderef] __rcu * @@
   kernel/sched/fair.c:7364:40: sparse:     expected struct sched_entity *curr
   kernel/sched/fair.c:7364:40: sparse:     got struct sched_entity [noderef] __rcu *
   kernel/sched/fair.c:5387:35: sparse: sparse: marked inline, but without a definition
   kernel/sched/fair.c: note: in included file:
   kernel/sched/sched.h:2011:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2011:25: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2011:25: sparse:    struct task_struct *
   kernel/sched/sched.h:2169:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2169:9: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2169:9: sparse:    struct task_struct *
   kernel/sched/sched.h:2011:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2011:25: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2011:25: sparse:    struct task_struct *
   kernel/sched/sched.h:2011:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2011:25: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2011:25: sparse:    struct task_struct *

vim +7364 kernel/sched/fair.c

  7360	
  7361	static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
  7362	{
  7363		struct sched_entity *se = &p->se;
> 7364		struct sched_entity *curr = &rq->curr->se;
  7365	
  7366		/* throttled hierarchies are not runnable */
  7367		if (!se->on_rq || throttled_hierarchy(cfs_rq_of(se)))
  7368			return false;
  7369	
  7370		/* Tell the scheduler that we'd really like pse to run next. */
  7371		set_next_buddy(se);
  7372	
  7373		yield_task_fair(rq);
  7374	
  7375		/*
  7376		 * This path is special and only called from KVM. In contrast to yield,
  7377		 * in yield_to we really know that current is spinning and we know
  7378		 * (s390) or have good heuristics whom are we waiting for. There is
  7379		 * absolutely no point in continuing the current task, even if this
  7380		 * means to become unfairer. Let us give the current process some
  7381		 * "fake" penalty.
  7382		 */
  7383		curr->vruntime += sched_slice(cfs_rq_of(curr), curr);
  7384	
  7385		return true;
  7386	}
  7387	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

Mel Gorman July 23, 2021, 9:35 a.m. UTC | #2

On Wed, Jul 07, 2021 at 02:34:02PM +0200, Christian Borntraeger wrote:
> After some debugging in situations where a smaller sched_latency_ns and
> smaller sched_migration_cost settings helped for KVM host, I was able to
> come up with a reduced testcase.
> This testcase has 2 vcpus working on a shared memory location and
> waiting for mem % 2 == cpu number to then do an add on the shared
> memory.
> To start simple I pinned all vcpus to one host CPU. Without the
> yield_to in KVM the testcase was horribly slow. This is expected as each
> vcpu will spin a whole time slice. With the yield_to from KVM things are
> much better, but I was still seeing yields being ignored.
> In the end pick_next_entity decided to keep the current process running
> due to fairness reasons.  On this path we really know that there is no
> point in continuing current. So let us make things a bit unfairer to
> current.
> This makes the reduced testcase noticeable faster. It improved a more
> realistic test case (many guests on some host CPUs with overcomitment)
> even more.
> In the end this is similar to the old compat_sched_yield approach with
> an important difference:
> Instead of doing it for all yields we now only do it for yield_to
> a place where we really know that current it waiting for the target.
> 
> What are alternative implementations for this patch
> - do the same as the old compat_sched_yield:
>   current->vruntime = rightmost->vruntime+1
> - provide a new tunable sched_ns_yield_penalty: how much vruntime to add
>   (could be per architecture)
> - also fiddle with the vruntime of the target
>   e.g. subtract from the target what we add to the source
> 
> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>

I think this one accidentally fell off everyones radar including mine.
At the time this patch was mailed I remembered thinking that playing games with
vruntime might have other consequences. For example, what I believe is
the most relevant problem for KVM is that a task spinning to acquire a
lock may be waiting on a vcpu holding the lock that has been
descheduled. Without vcpu pinning, it's possible that the holder is on
the same runqueue as the lock acquirer so the acquirer is wasting CPU.

In such a case, changing the acquirers vcpu may mean that it unfairly
loses CPU time simply because it's a lock acquirer. Vincent, what do you
think? Christian, would you mind testing this as an alternative with your
demonstration test case and more importantly the "realistic test case"?

--8<--
sched: Do not select highest priority task to run if it should be skipped

pick_next_entity will consider the "next buddy" over the highest priority
task if it's not unfair to do so (as determined by wakekup_preempt_entity).
The potential problem is that an in-kernel user of yield_to() such as
KVM may explicitly want to yield the current task because it is trying
to acquire a spinlock from a task that is currently descheduled and
potentially running on the same runqueue. However, if it's more fair from
the scheduler perspective to continue running the current task, it'll continue
to spin uselessly waiting on a descheduled task to run.

This patch will select the targeted task to run even if it's unfair if the
highest priority task is explicitly marked as "skip".

This was evaluated using a debugging patch to expose yield_to as a system
call. A demonstration program creates N number of threads and arranges
them in a ring that are updating a shared value in memory. Each thread
spins until the value matches the thread ID. It then updates the value
and wakes the next thread in the ring. It measures how many times it spins
before it gets its turn. Without the patch, the number of spins is highly
variable and unstable but with the patch it's more consistent.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 44c452072a1b..ddc0212d520f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4522,7 +4522,8 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 			se = second;
 	}

-	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) {
+	if (cfs_rq->next &&
+	    (cfs_rq->skip == left || wakeup_preempt_entity(cfs_rq->next, left) < 1)) {
 		/*
 		 * Someone really wants this to run. If it's not unfair, run it.
 		 */

Christian Borntraeger July 23, 2021, 12:36 p.m. UTC | #3

On 23.07.21 11:35, Mel Gorman wrote:
> On Wed, Jul 07, 2021 at 02:34:02PM +0200, Christian Borntraeger wrote:
>> After some debugging in situations where a smaller sched_latency_ns and
>> smaller sched_migration_cost settings helped for KVM host, I was able to
>> come up with a reduced testcase.
>> This testcase has 2 vcpus working on a shared memory location and
>> waiting for mem % 2 == cpu number to then do an add on the shared
>> memory.
>> To start simple I pinned all vcpus to one host CPU. Without the
>> yield_to in KVM the testcase was horribly slow. This is expected as each
>> vcpu will spin a whole time slice. With the yield_to from KVM things are
>> much better, but I was still seeing yields being ignored.
>> In the end pick_next_entity decided to keep the current process running
>> due to fairness reasons.  On this path we really know that there is no
>> point in continuing current. So let us make things a bit unfairer to
>> current.
>> This makes the reduced testcase noticeable faster. It improved a more
>> realistic test case (many guests on some host CPUs with overcomitment)
>> even more.
>> In the end this is similar to the old compat_sched_yield approach with
>> an important difference:
>> Instead of doing it for all yields we now only do it for yield_to
>> a place where we really know that current it waiting for the target.
>>
>> What are alternative implementations for this patch
>> - do the same as the old compat_sched_yield:
>>    current->vruntime = rightmost->vruntime+1
>> - provide a new tunable sched_ns_yield_penalty: how much vruntime to add
>>    (could be per architecture)
>> - also fiddle with the vruntime of the target
>>    e.g. subtract from the target what we add to the source
>>
>> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
> 
> I think this one accidentally fell off everyones radar including mine.
> At the time this patch was mailed I remembered thinking that playing games with
> vruntime might have other consequences. For example, what I believe is
> the most relevant problem for KVM is that a task spinning to acquire a
> lock may be waiting on a vcpu holding the lock that has been
> descheduled. Without vcpu pinning, it's possible that the holder is on
> the same runqueue as the lock acquirer so the acquirer is wasting CPU.
> 
> In such a case, changing the acquirers vcpu may mean that it unfairly
> loses CPU time simply because it's a lock acquirer. Vincent, what do you
> think? Christian, would you mind testing this as an alternative with your
> demonstration test case and more importantly the "realistic test case"?
> 
> --8<--
> sched: Do not select highest priority task to run if it should be skipped
> 
> pick_next_entity will consider the "next buddy" over the highest priority
> task if it's not unfair to do so (as determined by wakekup_preempt_entity).
> The potential problem is that an in-kernel user of yield_to() such as
> KVM may explicitly want to yield the current task because it is trying
> to acquire a spinlock from a task that is currently descheduled and
> potentially running on the same runqueue. However, if it's more fair from
> the scheduler perspective to continue running the current task, it'll continue
> to spin uselessly waiting on a descheduled task to run.
> 
> This patch will select the targeted task to run even if it's unfair if the
> highest priority task is explicitly marked as "skip".
> 
> This was evaluated using a debugging patch to expose yield_to as a system
> call. A demonstration program creates N number of threads and arranges
> them in a ring that are updating a shared value in memory. Each thread
> spins until the value matches the thread ID. It then updates the value
> and wakes the next thread in the ring. It measures how many times it spins
> before it gets its turn. Without the patch, the number of spins is highly
> variable and unstable but with the patch it's more consistent.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>   kernel/sched/fair.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 44c452072a1b..ddc0212d520f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4522,7 +4522,8 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
>   			se = second;
>   	}
>   
> -	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) {
> +	if (cfs_rq->next &&
> +	    (cfs_rq->skip == left || wakeup_preempt_entity(cfs_rq->next, left) < 1)) {
>   		/*
>   		 * Someone really wants this to run. If it's not unfair, run it.
>   		 */
> 

I do see a reduction in ignored yields, but from a performance aspect for my
testcases this patch does not provide a benefit, while the the simple
	curr->vruntime += sysctl_sched_min_granularity;
does.
I still think that your approach is probably the cleaner one, any chance to improve this
somehow?

FWIW,  I recently realized that my patch also does not solve the problem for KVM with libvirt.
My testcase only improves with qemu started by hand. As soon as the cpu cgroup controller is
active, my patch also no longer helps.

If you have any patch to test, I can test it. Meanwhile I will also do some more testing.

Mel Gorman July 23, 2021, 4:21 p.m. UTC | #4

On Fri, Jul 23, 2021 at 02:36:21PM +0200, Christian Borntraeger wrote:
> > sched: Do not select highest priority task to run if it should be skipped
> > 
> > <SNIP>
> >
> > index 44c452072a1b..ddc0212d520f 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4522,7 +4522,8 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> >   			se = second;
> >   	}
> > -	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) {
> > +	if (cfs_rq->next &&
> > +	    (cfs_rq->skip == left || wakeup_preempt_entity(cfs_rq->next, left) < 1)) {
> >   		/*
> >   		 * Someone really wants this to run. If it's not unfair, run it.
> >   		 */
> > 
> 
> I do see a reduction in ignored yields, but from a performance aspect for my
> testcases this patch does not provide a benefit, while the the simple
> 	curr->vruntime += sysctl_sched_min_granularity;
> does.

I'm still not a fan because vruntime gets distorted. From the docs

   Small detail: on "ideal" hardware, at any time all tasks would have the same
   p->se.vruntime value --- i.e., tasks would execute simultaneously and no task
   would ever get "out of balance" from the "ideal" share of CPU time

If yield_to impacts this "ideal share" then it could have other
consequences.

I think your patch may be performing better in your test case because every
"wrong" task selected that is not the yield_to target gets penalised and
so the yield_to target gets pushed up the list.

> I still think that your approach is probably the cleaner one, any chance to improve this
> somehow?
> 

Potentially. The patch was a bit off because while it noticed that skip
was not being obeyed, the fix was clumsy and isolated. The current flow is

1. pick se == left as the candidate
2. try pick a different se if the "ideal" candidate is a skip candidate
3. Ignore the se update if next or last are set

Step 3 looks off because it ignores skip if next or last buddies are set
and I don't think that was intended. Can you try this?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 44c452072a1b..d56f7772a607 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4522,12 +4522,12 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 			se = second;
 	}
 
-	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) {
+	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, se) < 1) {
 		/*
 		 * Someone really wants this to run. If it's not unfair, run it.
 		 */
 		se = cfs_rq->next;
-	} else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) {
+	} else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, se) < 1) {
 		/*
 		 * Prefer last buddy, try to return the CPU to a preempted task.
 		 */

Christian Borntraeger July 26, 2021, 6:41 p.m. UTC | #5

On 23.07.21 18:21, Mel Gorman wrote:
> On Fri, Jul 23, 2021 at 02:36:21PM +0200, Christian Borntraeger wrote:
>>> sched: Do not select highest priority task to run if it should be skipped
>>>
>>> <SNIP>
>>>
>>> index 44c452072a1b..ddc0212d520f 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -4522,7 +4522,8 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
>>>    			se = second;
>>>    	}
>>> -	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) {
>>> +	if (cfs_rq->next &&
>>> +	    (cfs_rq->skip == left || wakeup_preempt_entity(cfs_rq->next, left) < 1)) {
>>>    		/*
>>>    		 * Someone really wants this to run. If it's not unfair, run it.
>>>    		 */
>>>
>>
>> I do see a reduction in ignored yields, but from a performance aspect for my
>> testcases this patch does not provide a benefit, while the the simple
>> 	curr->vruntime += sysctl_sched_min_granularity;
>> does.
> 
> I'm still not a fan because vruntime gets distorted. From the docs
> 
>     Small detail: on "ideal" hardware, at any time all tasks would have the same
>     p->se.vruntime value --- i.e., tasks would execute simultaneously and no task
>     would ever get "out of balance" from the "ideal" share of CPU time
> 
> If yield_to impacts this "ideal share" then it could have other
> consequences.
> 
> I think your patch may be performing better in your test case because every
> "wrong" task selected that is not the yield_to target gets penalised and
> so the yield_to target gets pushed up the list.
> 
>> I still think that your approach is probably the cleaner one, any chance to improve this
>> somehow?
>>
> 
> Potentially. The patch was a bit off because while it noticed that skip
> was not being obeyed, the fix was clumsy and isolated. The current flow is
> 
> 1. pick se == left as the candidate
> 2. try pick a different se if the "ideal" candidate is a skip candidate
> 3. Ignore the se update if next or last are set
> 
> Step 3 looks off because it ignores skip if next or last buddies are set
> and I don't think that was intended. Can you try this?
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 44c452072a1b..d56f7772a607 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4522,12 +4522,12 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
>   			se = second;
>   	}
>   
> -	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) {
> +	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, se) < 1) {
>   		/*
>   		 * Someone really wants this to run. If it's not unfair, run it.
>   		 */
>   		se = cfs_rq->next;
> -	} else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) {
> +	} else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, se) < 1) {
>   		/*
>   		 * Prefer last buddy, try to return the CPU to a preempted task.
>   		 */
> 

This one alone does not seem to make a difference. Neither in ignored yield, nor
in performance.

Your first patch does really help in terms of ignored yields when
all threads are pinned to one host CPU. After that we do have no ignored yield
it seems. But it does not affect the performance of my testcase.
I did some more experiments and I removed the wakeup_preempt_entity checks in
pick_next_entity - assuming that this will result in source always being stopped
and target always being picked. But still, no performance difference.
As soon as I play with vruntime I do see a difference (but only without the cpu cgroup
controller). I will try to better understand the scheduler logic and do some more
testing. If you have anything that I should test, let me know.

Christian

Mel Gorman July 26, 2021, 7:32 p.m. UTC | #6

On Mon, Jul 26, 2021 at 08:41:15PM +0200, Christian Borntraeger wrote:
> > Potentially. The patch was a bit off because while it noticed that skip
> > was not being obeyed, the fix was clumsy and isolated. The current flow is
> > 
> > 1. pick se == left as the candidate
> > 2. try pick a different se if the "ideal" candidate is a skip candidate
> > 3. Ignore the se update if next or last are set
> > 
> > Step 3 looks off because it ignores skip if next or last buddies are set
> > and I don't think that was intended. Can you try this?
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 44c452072a1b..d56f7772a607 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4522,12 +4522,12 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> >   			se = second;
> >   	}
> > -	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) {
> > +	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, se) < 1) {
> >   		/*
> >   		 * Someone really wants this to run. If it's not unfair, run it.
> >   		 */
> >   		se = cfs_rq->next;
> > -	} else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) {
> > +	} else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, se) < 1) {
> >   		/*
> >   		 * Prefer last buddy, try to return the CPU to a preempted task.
> >   		 */
> > 
> 
> This one alone does not seem to make a difference. Neither in ignored yield, nor
> in performance.
> 
> Your first patch does really help in terms of ignored yields when
> all threads are pinned to one host CPU.

Ok, that tells us something. It implies, but does not prove, that the
block above that handles skip is failing either the entity_before()
test or the wakeup_preempt_entity() test. To what degree that should be
relaxed when cfs_rq->next is !NULL is harder to determine.

> After that we do have no ignored yield
> it seems. But it does not affect the performance of my testcase.

Ok, this is the first patch. The second patch is not improving ignored
yields at all so the above paragraph still applies. It would be nice
if you could instrument with trace_printk when cfs->rq_next is valid
whether it's the entity_before() check that is preventing the skip or
wakeup_preempt_entity. Would that be possible?

I still think the second patch is right independent of it helping your
test case because it makes no sense to me at all that the task after the
skip candidate is ignored if there is a next or last buddy.

> I did some more experiments and I removed the wakeup_preempt_entity checks in
> pick_next_entity - assuming that this will result in source always being stopped
> and target always being picked. But still, no performance difference.
> As soon as I play with vruntime I do see a difference (but only without the cpu cgroup
> controller). I will try to better understand the scheduler logic and do some more
> testing. If you have anything that I should test, let me know.
> 

The fact that vruntime tricks only makes a difference when cgroups are
involved is interesting. Can you describe roughly what how the cgroup
is configured? Similarly, does your config have CONFIG_SCHED_AUTOGROUP
or CONFIG_FAIR_GROUP_SCHED set? I assume FAIR_GROUP_SCHED must be and
I wonder if the impact of your patch is dropping groups of tasks in
priority as opposed to individual tasks. I'm not that familiar with how
groups are handled in terms of how they are prioritised unfortunately.

I'm still hesitant to consider the vruntime hammer in case it causes
fairness problems when vruntime is no longer reflecting time spent on
the CPU.

Christian Borntraeger July 27, 2021, 6:59 a.m. UTC | #7

On 26.07.21 21:32, Mel Gorman wrote:
> On Mon, Jul 26, 2021 at 08:41:15PM +0200, Christian Borntraeger wrote:
>>> Potentially. The patch was a bit off because while it noticed that skip
>>> was not being obeyed, the fix was clumsy and isolated. The current flow is
>>>
>>> 1. pick se == left as the candidate
>>> 2. try pick a different se if the "ideal" candidate is a skip candidate
>>> 3. Ignore the se update if next or last are set
>>>
>>> Step 3 looks off because it ignores skip if next or last buddies are set
>>> and I don't think that was intended. Can you try this?
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 44c452072a1b..d56f7772a607 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -4522,12 +4522,12 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
>>>    			se = second;
>>>    	}
>>> -	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) {
>>> +	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, se) < 1) {
>>>    		/*
>>>    		 * Someone really wants this to run. If it's not unfair, run it.
>>>    		 */
>>>    		se = cfs_rq->next;
>>> -	} else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) {
>>> +	} else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, se) < 1) {
>>>    		/*
>>>    		 * Prefer last buddy, try to return the CPU to a preempted task.
>>>    		 */
>>>
>>
>> This one alone does not seem to make a difference. Neither in ignored yield, nor
>> in performance.
>>
>> Your first patch does really help in terms of ignored yields when
>> all threads are pinned to one host CPU.
> 
> Ok, that tells us something. It implies, but does not prove, that the
> block above that handles skip is failing either the entity_before()
> test or the wakeup_preempt_entity() test. To what degree that should be
> relaxed when cfs_rq->next is !NULL is harder to determine.
> 
>> After that we do have no ignored yield
>> it seems. But it does not affect the performance of my testcase.
> 
> Ok, this is the first patch. The second patch is not improving ignored
> yields at all so the above paragraph still applies. It would be nice
> if you could instrument with trace_printk when cfs->rq_next is valid
> whether it's the entity_before() check that is preventing the skip or
> wakeup_preempt_entity. Would that be possible?

I will try that.
> 
> I still think the second patch is right independent of it helping your
> test case because it makes no sense to me at all that the task after the
> skip candidate is ignored if there is a next or last buddy.

I agree.  This patch makes sense to me as a bug fix.
And I think also the first patch makes sense on its own.
> 
>> I did some more experiments and I removed the wakeup_preempt_entity checks in
>> pick_next_entity - assuming that this will result in source always being stopped
>> and target always being picked. But still, no performance difference.
>> As soon as I play with vruntime I do see a difference (but only without the cpu cgroup
>> controller). I will try to better understand the scheduler logic and do some more
>> testing. If you have anything that I should test, let me know.
>>
> 
> The fact that vruntime tricks only makes a difference when cgroups are
> involved is interesting. Can you describe roughly what how the cgroup
> is configured? 

Its the other way around. My vruntime patch ONLY helps WITHOUT the cpu cgroup controller.
In other words this example on a 16CPU host (resulting in 4x overcommitment)
time ( for ((d=0; d<16; d++)) ; do cgexec -g cpu:test$d qemu-system-s390x -enable-kvm -kernel /root/REPOS/kvm-unit-tests/s390x/diag9c.elf  -smp 4 -nographic -nodefaults -device sclpconsole,chardev=c2 -chardev file,path=/tmp/log$d.log,id=c2  & done; wait)
does NOT benefit from the vruntime patch, but when I remove the "cgexec -g cpu:test$d" it does:
time ( for ((d=0; d<16; d++)) ; do qemu-system-s390x -enable-kvm -kernel /root/REPOS/kvm-unit-tests/s390x/diag9c.elf  -smp 4 -nographic -nodefaults -device sclpconsole,chardev=c2 -chardev file,path=/tmp/log$d.log,id=c2  & done; wait)
  

Similarly, does your config have CONFIG_SCHED_AUTOGROUP
> or CONFIG_FAIR_GROUP_SCHED set? I assume FAIR_GROUP_SCHED must be and

Yes, both are set.
> I wonder if the impact of your patch is dropping groups of tasks in
> priority as opposed to individual tasks. I'm not that familiar with how
> groups are handled in terms of how they are prioritised unfortunately.
> 
> I'm still hesitant to consider the vruntime hammer in case it causes
> fairness problems when vruntime is no longer reflecting time spent on
> the CPU.

I understand your concerns. What about subtracting the same amount of
vruntime from the target as we add on the yielder? Would that result in
quicker rebalancing while still keeping everything in order?
The reason why I am asking is that initially we
realized that setting some tunables lower, e.g.
kernel.sched_latency_ns = 2000000
kernel.sched_migration_cost_ns = 100000
makes things faster in a similar fashion. And that also works with cgroups.
But ideally we find a solution without changing tuneables.

Peter Zijlstra July 27, 2021, 1:29 p.m. UTC | #8

On Fri, Jul 23, 2021 at 05:21:37PM +0100, Mel Gorman wrote:

> I'm still not a fan because vruntime gets distorted. From the docs
> 
>    Small detail: on "ideal" hardware, at any time all tasks would have the same
>    p->se.vruntime value --- i.e., tasks would execute simultaneously and no task
>    would ever get "out of balance" from the "ideal" share of CPU time
> 
> If yield_to impacts this "ideal share" then it could have other
> consequences.

Note that we already violate this ideal both subtly and explicitly.

For an example of the latter consider pretty much the entirety of
place_entity() with GENTLE_FAIR_SLEEPERS being the most egregious
example.

That said; adding to vruntime will only penalize the task itself, while
subtracting from vruntime will penalize everyone else. And in that sense
addition to vruntime is a safe option.

I've not fully considered the case at hand; just wanted to give some
context.

Peter Zijlstra July 27, 2021, 1:33 p.m. UTC | #9

On Fri, Jul 23, 2021 at 10:35:23AM +0100, Mel Gorman wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 44c452072a1b..ddc0212d520f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4522,7 +4522,8 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
>  			se = second;
>  	}
>  
> -	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) {
> +	if (cfs_rq->next &&
> +	    (cfs_rq->skip == left || wakeup_preempt_entity(cfs_rq->next, left) < 1)) {
>  		/*
>  		 * Someone really wants this to run. If it's not unfair, run it.
>  		 */

With a little more context this function reads like:

	se = left;

	if (cfs_rq->skip && cfs_rq->skip == se) {
		...
+		if (cfs_rq->next && (cfs_rq->skip == left || ...))

If '...' doesn't change @left (afaict it doesn't), then your change (+)
is equivalent to '&& true', or am I reading things wrong?

Mel Gorman July 27, 2021, 2:31 p.m. UTC | #10

On Tue, Jul 27, 2021 at 03:33:00PM +0200, Peter Zijlstra wrote:
> On Fri, Jul 23, 2021 at 10:35:23AM +0100, Mel Gorman wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 44c452072a1b..ddc0212d520f 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4522,7 +4522,8 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> >  			se = second;
> >  	}
> >  
> > -	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) {
> > +	if (cfs_rq->next &&
> > +	    (cfs_rq->skip == left || wakeup_preempt_entity(cfs_rq->next, left) < 1)) {
> >  		/*
> >  		 * Someone really wants this to run. If it's not unfair, run it.
> >  		 */
> 
> With a little more context this function reads like:
> 
> 	se = left;
> 
> 	if (cfs_rq->skip && cfs_rq->skip == se) {
> 		...
> +		if (cfs_rq->next && (cfs_rq->skip == left || ...))
> 
> If '...' doesn't change @left (afaict it doesn't), then your change (+)
> is equivalent to '&& true', or am I reading things wrong?

You're not reading it wrong although the patch is clumsy and may introduce
unfairness that gets incrementally worse if there was repeated yields to
the same task. A second patch was posted that does

-       if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) {
+       if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, se) < 1) {

i.e. if the skip hint picks a second alternative then next or last buddies
should be compared to the second alternative and not "left". It doesn't
help indicating that the skip hint is not obeyed because "second" failed
the entity_before() or wakeup_preempt_entity() checks. I'm waiting on a
trace to see which check dominates.

That said, I'm still undecided on how to approach this. None of the
proposed patches on their own helps but the options are

1. Strictly obey the next buddy if the skip hint is the same se as left
   (first patch which I'm not very happy with even if it helped the
   test case)

2. My second patch which compares next/last with "second" if the skip
   hint skips "left". This may be a sensible starting point no matter
   what

3. Relaxing how "second" is selected if next or last buddies are set

4. vruntime tricks even if it punishes fairness for the task yielding
   the CPU. The advantage of this approach is if there are multiple tasks
   ahead of the task being yielded to then yield_to task will become
   "left" very quickly regardless of any buddy-related hints.

I don't know what "3" would look like yet, it might be very fragile but
lets see what the tracing says. Otherwise, testing 2+4 might be worthwhile
to see if the combination helps Christian's test case when the cpu cgroup
is involved.

Benjamin Segall July 27, 2021, 6:57 p.m. UTC | #11

Christian Borntraeger <borntraeger@de.ibm.com> writes:

> On 23.07.21 18:21, Mel Gorman wrote:
>> On Fri, Jul 23, 2021 at 02:36:21PM +0200, Christian Borntraeger wrote:
>>>> sched: Do not select highest priority task to run if it should be skipped
>>>>
>>>> <SNIP>
>>>>
>>>> index 44c452072a1b..ddc0212d520f 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -4522,7 +4522,8 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
>>>>    			se = second;
>>>>    	}
>>>> -	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) {
>>>> +	if (cfs_rq->next &&
>>>> +	    (cfs_rq->skip == left || wakeup_preempt_entity(cfs_rq->next, left) < 1)) {
>>>>    		/*
>>>>    		 * Someone really wants this to run. If it's not unfair, run it.
>>>>    		 */
>>>>
>>>
>>> I do see a reduction in ignored yields, but from a performance aspect for my
>>> testcases this patch does not provide a benefit, while the the simple
>>> 	curr->vruntime += sysctl_sched_min_granularity;
>>> does.
>> I'm still not a fan because vruntime gets distorted. From the docs
>>     Small detail: on "ideal" hardware, at any time all tasks would have the
>> same
>>     p->se.vruntime value --- i.e., tasks would execute simultaneously and no task
>>     would ever get "out of balance" from the "ideal" share of CPU time
>> If yield_to impacts this "ideal share" then it could have other
>> consequences.
>> I think your patch may be performing better in your test case because every
>> "wrong" task selected that is not the yield_to target gets penalised and
>> so the yield_to target gets pushed up the list.
>> 
>>> I still think that your approach is probably the cleaner one, any chance to improve this
>>> somehow?
>>>
>> Potentially. The patch was a bit off because while it noticed that skip
>> was not being obeyed, the fix was clumsy and isolated. The current flow is
>> 1. pick se == left as the candidate
>> 2. try pick a different se if the "ideal" candidate is a skip candidate
>> 3. Ignore the se update if next or last are set
>> Step 3 looks off because it ignores skip if next or last buddies are set
>> and I don't think that was intended. Can you try this?
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 44c452072a1b..d56f7772a607 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4522,12 +4522,12 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
>>   			se = second;
>>   	}
>>   -	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) {
>> +	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, se) < 1) {
>>   		/*
>>   		 * Someone really wants this to run. If it's not unfair, run it.
>>   		 */
>>   		se = cfs_rq->next;
>> -	} else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) {
>> +	} else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, se) < 1) {
>>   		/*
>>   		 * Prefer last buddy, try to return the CPU to a preempted task.
>>   		 */
>> 
>
> This one alone does not seem to make a difference. Neither in ignored yield, nor
> in performance.
>
> Your first patch does really help in terms of ignored yields when
> all threads are pinned to one host CPU. After that we do have no ignored yield
> it seems. But it does not affect the performance of my testcase.
> I did some more experiments and I removed the wakeup_preempt_entity checks in
> pick_next_entity - assuming that this will result in source always being stopped
> and target always being picked. But still, no performance difference.
> As soon as I play with vruntime I do see a difference (but only without the cpu cgroup
> controller). I will try to better understand the scheduler logic and do some more
> testing. If you have anything that I should test, let me know.
>
> Christian

If both yielder and target are in the same cpu cgroup or the cpu cgroup
is disabled (ie, if cfs_rq_of(p->se) matches), you could try

if (p->se.vruntime > rq->curr->se.vruntime)
	swap(p->se.vruntime, rq->curr->se.vruntime)

as well as the existing buddy flags, as an entirely fair vruntime boost
to the target.

For when they aren't direct siblings, you /could/ use find_matching_se,
but it's much less clear that's desirable, since it would yield vruntime
for the entire hierarchy to the target's hierarchy.

Christian Borntraeger July 28, 2021, 4:23 p.m. UTC | #12

On 27.07.21 20:57, Benjamin Segall wrote:
> Christian Borntraeger <borntraeger@de.ibm.com> writes:
> 
>> On 23.07.21 18:21, Mel Gorman wrote:
>>> On Fri, Jul 23, 2021 at 02:36:21PM +0200, Christian Borntraeger wrote:
>>>>> sched: Do not select highest priority task to run if it should be skipped
>>>>>
>>>>> <SNIP>
>>>>>
>>>>> index 44c452072a1b..ddc0212d520f 100644
>>>>> --- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c
>>>>> @@ -4522,7 +4522,8 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
>>>>>     			se = second;
>>>>>     	}
>>>>> -	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) {
>>>>> +	if (cfs_rq->next &&
>>>>> +	    (cfs_rq->skip == left || wakeup_preempt_entity(cfs_rq->next, left) < 1)) {
>>>>>     		/*
>>>>>     		 * Someone really wants this to run. If it's not unfair, run it.
>>>>>     		 */
>>>>>
>>>>
>>>> I do see a reduction in ignored yields, but from a performance aspect for my
>>>> testcases this patch does not provide a benefit, while the the simple
>>>> 	curr->vruntime += sysctl_sched_min_granularity;
>>>> does.
>>> I'm still not a fan because vruntime gets distorted. From the docs
>>>      Small detail: on "ideal" hardware, at any time all tasks would have the
>>> same
>>>      p->se.vruntime value --- i.e., tasks would execute simultaneously and no task
>>>      would ever get "out of balance" from the "ideal" share of CPU time
>>> If yield_to impacts this "ideal share" then it could have other
>>> consequences.
>>> I think your patch may be performing better in your test case because every
>>> "wrong" task selected that is not the yield_to target gets penalised and
>>> so the yield_to target gets pushed up the list.
>>>
>>>> I still think that your approach is probably the cleaner one, any chance to improve this
>>>> somehow?
>>>>
>>> Potentially. The patch was a bit off because while it noticed that skip
>>> was not being obeyed, the fix was clumsy and isolated. The current flow is
>>> 1. pick se == left as the candidate
>>> 2. try pick a different se if the "ideal" candidate is a skip candidate
>>> 3. Ignore the se update if next or last are set
>>> Step 3 looks off because it ignores skip if next or last buddies are set
>>> and I don't think that was intended. Can you try this?
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 44c452072a1b..d56f7772a607 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -4522,12 +4522,12 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
>>>    			se = second;
>>>    	}
>>>    -	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) {
>>> +	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, se) < 1) {
>>>    		/*
>>>    		 * Someone really wants this to run. If it's not unfair, run it.
>>>    		 */
>>>    		se = cfs_rq->next;
>>> -	} else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) {
>>> +	} else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, se) < 1) {
>>>    		/*
>>>    		 * Prefer last buddy, try to return the CPU to a preempted task.
>>>    		 */
>>>
>>
>> This one alone does not seem to make a difference. Neither in ignored yield, nor
>> in performance.
>>
>> Your first patch does really help in terms of ignored yields when
>> all threads are pinned to one host CPU. After that we do have no ignored yield
>> it seems. But it does not affect the performance of my testcase.
>> I did some more experiments and I removed the wakeup_preempt_entity checks in
>> pick_next_entity - assuming that this will result in source always being stopped
>> and target always being picked. But still, no performance difference.
>> As soon as I play with vruntime I do see a difference (but only without the cpu cgroup
>> controller). I will try to better understand the scheduler logic and do some more
>> testing. If you have anything that I should test, let me know.
>>
>> Christian
> 
> If both yielder and target are in the same cpu cgroup or the cpu cgroup
> is disabled (ie, if cfs_rq_of(p->se) matches), you could try
> 
> if (p->se.vruntime > rq->curr->se.vruntime)
> 	swap(p->se.vruntime, rq->curr->se.vruntime)

I tried that and it does not show the performance benefit. I then played with my
patch (uses different values to add) and the benefit seems to be depending on the
size that is being added, maybe when swapping it was just not large enough.

I have to say that this is all a bit unclear what and why performance improves.
It just seems that the cpu cgroup controller has a fair share of the performance
problems.

I also asked the performance people to run some measurements and the numbers of
some transactional workload under KVM was
baseline: 11813
with much smaller sched_latency_ns and sched_migration_cost_ns: 16419
with cpu controller disabled: 15962
with cpu controller disabled + my patch: 16782

I will be travelling the next 2 weeks, so I can continue with more debugging
after that.

Thanks for all the ideas and help so far.

Christian

> as well as the existing buddy flags, as an entirely fair vruntime boost
> to the target.
> 
> For when they aren't direct siblings, you /could/ use find_matching_se,
> but it's much less clear that's desirable, since it would yield vruntime
> for the entire hierarchy to the target's hierarchy.
>

Vincent Guittot Aug. 10, 2021, 8:49 a.m. UTC | #13

On Wed, 28 Jul 2021 at 18:24, Christian Borntraeger
<borntraeger@de.ibm.com> wrote:
>
>
>
> On 27.07.21 20:57, Benjamin Segall wrote:
> > Christian Borntraeger <borntraeger@de.ibm.com> writes:
> >
> >> On 23.07.21 18:21, Mel Gorman wrote:
> >>> On Fri, Jul 23, 2021 at 02:36:21PM +0200, Christian Borntraeger wrote:
> >>>>> sched: Do not select highest priority task to run if it should be skipped
> >>>>>
> >>>>> <SNIP>
> >>>>>
> >>>>> index 44c452072a1b..ddc0212d520f 100644
> >>>>> --- a/kernel/sched/fair.c
> >>>>> +++ b/kernel/sched/fair.c
> >>>>> @@ -4522,7 +4522,8 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> >>>>>                           se = second;
> >>>>>           }
> >>>>> - if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) {
> >>>>> + if (cfs_rq->next &&
> >>>>> +     (cfs_rq->skip == left || wakeup_preempt_entity(cfs_rq->next, left) < 1)) {
> >>>>>                   /*
> >>>>>                    * Someone really wants this to run. If it's not unfair, run it.
> >>>>>                    */
> >>>>>
> >>>>
> >>>> I do see a reduction in ignored yields, but from a performance aspect for my
> >>>> testcases this patch does not provide a benefit, while the the simple
> >>>>    curr->vruntime += sysctl_sched_min_granularity;
> >>>> does.
> >>> I'm still not a fan because vruntime gets distorted. From the docs
> >>>      Small detail: on "ideal" hardware, at any time all tasks would have the
> >>> same
> >>>      p->se.vruntime value --- i.e., tasks would execute simultaneously and no task
> >>>      would ever get "out of balance" from the "ideal" share of CPU time
> >>> If yield_to impacts this "ideal share" then it could have other
> >>> consequences.
> >>> I think your patch may be performing better in your test case because every
> >>> "wrong" task selected that is not the yield_to target gets penalised and
> >>> so the yield_to target gets pushed up the list.
> >>>
> >>>> I still think that your approach is probably the cleaner one, any chance to improve this
> >>>> somehow?
> >>>>
> >>> Potentially. The patch was a bit off because while it noticed that skip
> >>> was not being obeyed, the fix was clumsy and isolated. The current flow is
> >>> 1. pick se == left as the candidate
> >>> 2. try pick a different se if the "ideal" candidate is a skip candidate
> >>> 3. Ignore the se update if next or last are set
> >>> Step 3 looks off because it ignores skip if next or last buddies are set
> >>> and I don't think that was intended. Can you try this?
> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>> index 44c452072a1b..d56f7772a607 100644
> >>> --- a/kernel/sched/fair.c
> >>> +++ b/kernel/sched/fair.c
> >>> @@ -4522,12 +4522,12 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> >>>                     se = second;
> >>>     }
> >>>    -        if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) {
> >>> +   if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, se) < 1) {
> >>>             /*
> >>>              * Someone really wants this to run. If it's not unfair, run it.
> >>>              */
> >>>             se = cfs_rq->next;
> >>> -   } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) {
> >>> +   } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, se) < 1) {
> >>>             /*
> >>>              * Prefer last buddy, try to return the CPU to a preempted task.
> >>>              */
> >>>
> >>
> >> This one alone does not seem to make a difference. Neither in ignored yield, nor
> >> in performance.
> >>
> >> Your first patch does really help in terms of ignored yields when
> >> all threads are pinned to one host CPU. After that we do have no ignored yield
> >> it seems. But it does not affect the performance of my testcase.
> >> I did some more experiments and I removed the wakeup_preempt_entity checks in
> >> pick_next_entity - assuming that this will result in source always being stopped
> >> and target always being picked. But still, no performance difference.
> >> As soon as I play with vruntime I do see a difference (but only without the cpu cgroup
> >> controller). I will try to better understand the scheduler logic and do some more
> >> testing. If you have anything that I should test, let me know.
> >>
> >> Christian
> >
> > If both yielder and target are in the same cpu cgroup or the cpu cgroup
> > is disabled (ie, if cfs_rq_of(p->se) matches), you could try
> >
> > if (p->se.vruntime > rq->curr->se.vruntime)
> >       swap(p->se.vruntime, rq->curr->se.vruntime)
>
> I tried that and it does not show the performance benefit. I then played with my
> patch (uses different values to add) and the benefit seems to be depending on the
> size that is being added, maybe when swapping it was just not large enough.
>
> I have to say that this is all a bit unclear what and why performance improves.
> It just seems that the cpu cgroup controller has a fair share of the performance
> problems.
>
> I also asked the performance people to run some measurements and the numbers of
> some transactional workload under KVM was
> baseline: 11813
> with much smaller sched_latency_ns and sched_migration_cost_ns: 16419

Have you also tried to increase sched_wakeup_granularity which is used
to decide whether we can preempt curr ?

> with cpu controller disabled: 15962
> with cpu controller disabled + my patch: 16782

Your patch modifies the vruntime of the task but cgroup sched_entity
stays unchanged. Scheduler starts to compare the vruntime of the
sched_entity of the group before reaching your task. That's probably
why your patch doesn't help with cgroup and will be difficult to
extend to cgroup because the yield of the task should not impact the
other entities in the group.

>
> I will be travelling the next 2 weeks, so I can continue with more debugging
> after that.
>
> Thanks for all the ideas and help so far.
>
> Christian
>
> > as well as the existing buddy flags, as an entirely fair vruntime boost
> > to the target.
> >
> > For when they aren't direct siblings, you /could/ use find_matching_se,
> > but it's much less clear that's desirable, since it would yield vruntime
> > for the entire hierarchy to the target's hierarchy.
> >

[1/1] sched/fair: improve yield_to vs fairness

Commit Message

Comments

Patch