diff mbox

[RFC] Improving directed yield scalability for PLE handler

Message ID 20120913114813.GA11797@linux.vnet.ibm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Raghavendra K T Sept. 13, 2012, 11:48 a.m. UTC
* Andrew Theurer <habanero@linux.vnet.ibm.com> [2012-09-11 13:27:41]:

> On Tue, 2012-09-11 at 11:38 +0530, Raghavendra K T wrote:
> > On 09/11/2012 01:42 AM, Andrew Theurer wrote:
> > > On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote:
> > >> On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote:
> > >>>> +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p)
> > >>>> +{
> > >>>> +     if (!curr->sched_class->yield_to_task)
> > >>>> +             return false;
> > >>>> +
> > >>>> +     if (curr->sched_class != p->sched_class)
> > >>>> +             return false;
> > >>>
> > >>>
> > >>> Peter,
> > >>>
> > >>> Should we also add a check if the runq has a skip buddy (as pointed out
> > >>> by Raghu) and return if the skip buddy is already set.
> > >>
> > >> Oh right, I missed that suggestion.. the performance improvement went
> > >> from 81% to 139% using this, right?
> > >>
> > >> It might make more sense to keep that separate, outside of this
> > >> function, since its not a strict prerequisite.
> > >>
> > >>>>
> > >>>> +     if (task_running(p_rq, p) || p->state)
> > >>>> +             return false;
> > >>>> +
> > >>>> +     return true;
> > >>>> +}
> > >>
> > >>
> > >>>> @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p,
> > >>> bool preempt)
> > >>>>        rq = this_rq();
> > >>>>
> > >>>>   again:
> > >>>> +     /* optimistic test to avoid taking locks */
> > >>>> +     if (!__yield_to_candidate(curr, p))
> > >>>> +             goto out_irq;
> > >>>> +
> > >>
> > >> So add something like:
> > >>
> > >> 	/* Optimistic, if we 'raced' with another yield_to(), don't bother */
> > >> 	if (p_rq->cfs_rq->skip)
> > >> 		goto out_irq;
> > >>>
> > >>>
> > >>>>        p_rq = task_rq(p);
> > >>>>        double_rq_lock(rq, p_rq);
> > >>>
> > >>>
> > >> But I do have a question on this optimization though,.. Why do we check
> > >> p_rq->cfs_rq->skip and not rq->cfs_rq->skip ?
> > >>
> > >> That is, I'd like to see this thing explained a little better.
> > >>
> > >> Does it go something like: p_rq is the runqueue of the task we'd like to
> > >> yield to, rq is our own, they might be the same. If we have a ->skip,
> > >> there's nothing we can do about it, OTOH p_rq having a ->skip and
> > >> failing the yield_to() simply means us picking the next VCPU thread,
> > >> which might be running on an entirely different cpu (rq) and could
> > >> succeed?
> > >
> > > Here's two new versions, both include a __yield_to_candidate(): "v3"
> > > uses the check for p_rq->curr in guest mode, and "v4" uses the cfs_rq
> > > skip check.  Raghu, I am not sure if this is exactly what you want
> > > implemented in v4.
> > >
> > 
> > Andrew, Yes that is what I had. I think there was a mis-understanding. 
> > My intention was to if there is a directed_yield happened in runqueue 
> > (say rqA), do not bother to directed yield to that. But unfortunately as 
> > PeterZ pointed that would have resulted in setting next buddy of a 
> > different run queue than rqA.
> > So we can drop this "skip" idea. Pondering more over what to do? can we 
> > use next buddy itself ... thinking..
> 
> As I mentioned earlier today, I did not have your changes from kvm.git
> tree when I tested my changes.  Here are your changes and my changes
> compared:
> 
> 			  throughput in MB/sec
> 
> kvm_vcpu_on_spin changes:  4636 +/- 15.74%
> yield_to changes:	   4515 +/- 12.73%
> 
> I would be inclined to stick with your changes which are kept in kvm
> code.  I did try both combined, and did not get good results:
> 
> both changes:		   4074 +/- 19.12%
> 
> So, having both is probably not a good idea.  However, I feel like
> there's more work to be done.  With no over-commit (10 VMs), total
> throughput is 23427 +/- 2.76%.  A 2x over-commit will no doubt have some
> overhead, but a reduction to ~4500 is still terrible.  By contrast,
> 8-way VMs with 2x over-commit have a total throughput roughly 10% less
> than 8-way VMs with no overcommit (20 vs 10 8-way VMs on 80 cpu-thread
> host).  We still have what appears to be scalability problems, but now
> it's not so much in runqueue locks for yield_to(), but now
> get_pid_task():
>

Hi Andrew,
IMHO, reducing the double runqueue lock overhead is a good idea,
and may be  we see the benefits when we increase the overcommit further.

The explaination for not seeing good benefit on top of PLE handler
optimization patch is because we filter the yield_to candidates,
and hence resulting in less contention for double runqueue lock.
and extra code overhead during genuine yield_to might have resulted in
some degradation in the case you tested.

However, did you use cfs.next also?. I hope it helps, when we combine.

Here is the result that is showing positive benefit.
I experimented on a 32 core (no HT) PLE machine with 32 vcpu guest(s).
  
+-----------+-----------+-----------+------------+-----------+
        kernbench time in sec, lower is better 
+-----------+-----------+-----------+------------+-----------+
       base      stddev     patched     stddev      %improve
+-----------+-----------+-----------+------------+-----------+
1x    44.3880     1.8699    40.8180     1.9173	   8.04271
2x    96.7580     4.2787    93.4188     3.5150	   3.45108
+-----------+-----------+-----------+------------+-----------+


+-----------+-----------+-----------+------------+-----------+
        ebizzy record/sec higher is better
+-----------+-----------+-----------+------------+-----------+
       base      stddev     patched     stddev      %improve
+-----------+-----------+-----------+------------+-----------+
1x  2374.1250    50.9718   3816.2500    54.0681	  60.74343
2x  2536.2500    93.0403   2789.3750   204.7897	   9.98029
+-----------+-----------+-----------+------------+-----------+


Below is the patch which combine suggestions of peterZ on your
original approach with cfs.next (already posted by Srikar in the other
thread)

----8<----

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Andrew Theurer Sept. 13, 2012, 9:30 p.m. UTC | #1
On Thu, 2012-09-13 at 17:18 +0530, Raghavendra K T wrote:
> * Andrew Theurer <habanero@linux.vnet.ibm.com> [2012-09-11 13:27:41]:
> 
> > On Tue, 2012-09-11 at 11:38 +0530, Raghavendra K T wrote:
> > > On 09/11/2012 01:42 AM, Andrew Theurer wrote:
> > > > On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote:
> > > >> On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote:
> > > >>>> +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p)
> > > >>>> +{
> > > >>>> +     if (!curr->sched_class->yield_to_task)
> > > >>>> +             return false;
> > > >>>> +
> > > >>>> +     if (curr->sched_class != p->sched_class)
> > > >>>> +             return false;
> > > >>>
> > > >>>
> > > >>> Peter,
> > > >>>
> > > >>> Should we also add a check if the runq has a skip buddy (as pointed out
> > > >>> by Raghu) and return if the skip buddy is already set.
> > > >>
> > > >> Oh right, I missed that suggestion.. the performance improvement went
> > > >> from 81% to 139% using this, right?
> > > >>
> > > >> It might make more sense to keep that separate, outside of this
> > > >> function, since its not a strict prerequisite.
> > > >>
> > > >>>>
> > > >>>> +     if (task_running(p_rq, p) || p->state)
> > > >>>> +             return false;
> > > >>>> +
> > > >>>> +     return true;
> > > >>>> +}
> > > >>
> > > >>
> > > >>>> @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p,
> > > >>> bool preempt)
> > > >>>>        rq = this_rq();
> > > >>>>
> > > >>>>   again:
> > > >>>> +     /* optimistic test to avoid taking locks */
> > > >>>> +     if (!__yield_to_candidate(curr, p))
> > > >>>> +             goto out_irq;
> > > >>>> +
> > > >>
> > > >> So add something like:
> > > >>
> > > >> 	/* Optimistic, if we 'raced' with another yield_to(), don't bother */
> > > >> 	if (p_rq->cfs_rq->skip)
> > > >> 		goto out_irq;
> > > >>>
> > > >>>
> > > >>>>        p_rq = task_rq(p);
> > > >>>>        double_rq_lock(rq, p_rq);
> > > >>>
> > > >>>
> > > >> But I do have a question on this optimization though,.. Why do we check
> > > >> p_rq->cfs_rq->skip and not rq->cfs_rq->skip ?
> > > >>
> > > >> That is, I'd like to see this thing explained a little better.
> > > >>
> > > >> Does it go something like: p_rq is the runqueue of the task we'd like to
> > > >> yield to, rq is our own, they might be the same. If we have a ->skip,
> > > >> there's nothing we can do about it, OTOH p_rq having a ->skip and
> > > >> failing the yield_to() simply means us picking the next VCPU thread,
> > > >> which might be running on an entirely different cpu (rq) and could
> > > >> succeed?
> > > >
> > > > Here's two new versions, both include a __yield_to_candidate(): "v3"
> > > > uses the check for p_rq->curr in guest mode, and "v4" uses the cfs_rq
> > > > skip check.  Raghu, I am not sure if this is exactly what you want
> > > > implemented in v4.
> > > >
> > > 
> > > Andrew, Yes that is what I had. I think there was a mis-understanding. 
> > > My intention was to if there is a directed_yield happened in runqueue 
> > > (say rqA), do not bother to directed yield to that. But unfortunately as 
> > > PeterZ pointed that would have resulted in setting next buddy of a 
> > > different run queue than rqA.
> > > So we can drop this "skip" idea. Pondering more over what to do? can we 
> > > use next buddy itself ... thinking..
> > 
> > As I mentioned earlier today, I did not have your changes from kvm.git
> > tree when I tested my changes.  Here are your changes and my changes
> > compared:
> > 
> > 			  throughput in MB/sec
> > 
> > kvm_vcpu_on_spin changes:  4636 +/- 15.74%
> > yield_to changes:	   4515 +/- 12.73%
> > 
> > I would be inclined to stick with your changes which are kept in kvm
> > code.  I did try both combined, and did not get good results:
> > 
> > both changes:		   4074 +/- 19.12%
> > 
> > So, having both is probably not a good idea.  However, I feel like
> > there's more work to be done.  With no over-commit (10 VMs), total
> > throughput is 23427 +/- 2.76%.  A 2x over-commit will no doubt have some
> > overhead, but a reduction to ~4500 is still terrible.  By contrast,
> > 8-way VMs with 2x over-commit have a total throughput roughly 10% less
> > than 8-way VMs with no overcommit (20 vs 10 8-way VMs on 80 cpu-thread
> > host).  We still have what appears to be scalability problems, but now
> > it's not so much in runqueue locks for yield_to(), but now
> > get_pid_task():
> >
> 
> Hi Andrew,
> IMHO, reducing the double runqueue lock overhead is a good idea,
> and may be  we see the benefits when we increase the overcommit further.
> 
> The explaination for not seeing good benefit on top of PLE handler
> optimization patch is because we filter the yield_to candidates,
> and hence resulting in less contention for double runqueue lock.
> and extra code overhead during genuine yield_to might have resulted in
> some degradation in the case you tested.
> 
> However, did you use cfs.next also?. I hope it helps, when we combine.
> 
> Here is the result that is showing positive benefit.
> I experimented on a 32 core (no HT) PLE machine with 32 vcpu guest(s).
>   
> +-----------+-----------+-----------+------------+-----------+
>         kernbench time in sec, lower is better 
> +-----------+-----------+-----------+------------+-----------+
>        base      stddev     patched     stddev      %improve
> +-----------+-----------+-----------+------------+-----------+
> 1x    44.3880     1.8699    40.8180     1.9173	   8.04271
> 2x    96.7580     4.2787    93.4188     3.5150	   3.45108
> +-----------+-----------+-----------+------------+-----------+
> 
> 
> +-----------+-----------+-----------+------------+-----------+
>         ebizzy record/sec higher is better
> +-----------+-----------+-----------+------------+-----------+
>        base      stddev     patched     stddev      %improve
> +-----------+-----------+-----------+------------+-----------+
> 1x  2374.1250    50.9718   3816.2500    54.0681	  60.74343
> 2x  2536.2500    93.0403   2789.3750   204.7897	   9.98029
> +-----------+-----------+-----------+------------+-----------+
> 
> 
> Below is the patch which combine suggestions of peterZ on your
> original approach with cfs.next (already posted by Srikar in the other
> thread)

I did get a chance to run with the below patch and your changes in
kvm.git, but the results were not too different:

Dbench, 10 x 16-way VMs on 80-way host:

kvm_vcpu_on_spin changes:  4636 +/- 15.74%
yield_to changes:	   4515 +/- 12.73%
both changes from above:   4074 +/- 19.12%
...plus cfs.next check:    4418 +/- 16.97%

Still hovering around 4500 MB/sec

The concern I have is that even though we have gone through changes to
help reduce the candidate vcpus we yield to, we still have a very poor
idea of which vcpu really needs to run.  The result is high cpu usage in
the get_pid_task and still some contention in the double runqueue lock.
To make this scalable, we either need to significantly reduce the
occurrence of the lock-holder preemption, or do a much better job of
knowing which vcpu needs to run (and not unnecessarily yielding to vcpus
which do not need to run).

On reducing the occurrence:  The worst case for lock-holder preemption
is having vcpus of same VM on the same runqueue.  This guarantees the
situation of 1 vcpu running while another [of the same VM] is not.  To
prove the point, I ran the same test, but with vcpus restricted to a
range of host cpus, such that any single VM's vcpus can never be on the
same runqueue.  In this case, all 10 VMs' vcpu-0's are on host cpus 0-4,
vcpu-1's are on host cpus 5-9, and so on.  Here is the result:

kvm_cpu_spin, and all
yield_to changes, plus
restricted vcpu placement:  8823 +/- 3.20%   much, much better

On picking a better vcpu to yield to:  I really hesitate to rely on
paravirt hint [telling us which vcpu is holding a lock], but I am not
sure how else to reduce the candidate vcpus to yield to.  I suspect we
are yielding to way more vcpus than are prempted lock-holders, and that
IMO is just work accomplishing nothing.  Trying to think of way to
further reduce candidate vcpus....


-Andrew


> 
> ----8<----
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index fbf1fd0..8551f57 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4820,6 +4820,24 @@ void __sched yield(void)
>  }
>  EXPORT_SYMBOL(yield);
> 
> +/*
> + * Tests preconditions required for sched_class::yield_to().
> + */
> +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p,
> +					 struct rq *p_rq)
> +{
> +	if (!curr->sched_class->yield_to_task)
> +		return false;
> +
> +	if (curr->sched_class != p->sched_class)
> +		return false;
> +
> +	if (task_running(p_rq, p) || p->state)
> +		return false;
> +
> +	return true;
> +}
> +
>  /**
>   * yield_to - yield the current processor to another thread in
>   * your thread group, or accelerate that thread toward the
> @@ -4844,20 +4862,24 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
> 
>  again:
>  	p_rq = task_rq(p);
> +
> +	/* optimistic test to avoid taking locks */
> +	if (!__yield_to_candidate(curr, p, p_rq))
> +		goto out_irq;
> +
> +	/* if next buddy is set, assume yield is in progress */
> +	if (p_rq->cfs.next)
> +		goto out_irq;
> +
>  	double_rq_lock(rq, p_rq);
>  	while (task_rq(p) != p_rq) {
>  		double_rq_unlock(rq, p_rq);
>  		goto again;
>  	}
> 
> -	if (!curr->sched_class->yield_to_task)
> -		goto out;
> -
> -	if (curr->sched_class != p->sched_class)
> -		goto out;
> -
> -	if (task_running(p_rq, p) || p->state)
> -		goto out;
> +	/* validate state, holding p_rq ensures p's state cannot change */
> +	if (!__yield_to_candidate(curr, p, p_rq))
> +		goto out_unlock;
> 
>  	yielded = curr->sched_class->yield_to_task(rq, p, preempt);
>  	if (yielded) {
> @@ -4877,8 +4899,9 @@ again:
>  		rq->skip_clock_update = 0;
>  	}
> 
> -out:
> +out_unlock:
>  	double_rq_unlock(rq, p_rq);
> +out_irq:
>  	local_irq_restore(flags);
> 
>  	if (yielded)


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Konrad Rzeszutek Wilk Sept. 14, 2012, 8:34 p.m. UTC | #2
> The concern I have is that even though we have gone through changes to
> help reduce the candidate vcpus we yield to, we still have a very poor
> idea of which vcpu really needs to run.  The result is high cpu usage in
> the get_pid_task and still some contention in the double runqueue lock.
> To make this scalable, we either need to significantly reduce the
> occurrence of the lock-holder preemption, or do a much better job of
> knowing which vcpu needs to run (and not unnecessarily yielding to vcpus
> which do not need to run).

The patches that Raghavendra  has been posting do accomplish that.
>
> On reducing the occurrence:  The worst case for lock-holder preemption
> is having vcpus of same VM on the same runqueue.  This guarantees the
> situation of 1 vcpu running while another [of the same VM] is not.  To
> prove the point, I ran the same test, but with vcpus restricted to a
> range of host cpus, such that any single VM's vcpus can never be on the
> same runqueue.  In this case, all 10 VMs' vcpu-0's are on host cpus 0-4,
> vcpu-1's are on host cpus 5-9, and so on.  Here is the result:
>
> kvm_cpu_spin, and all
> yield_to changes, plus
> restricted vcpu placement:  8823 +/- 3.20%   much, much better
>
> On picking a better vcpu to yield to:  I really hesitate to rely on
> paravirt hint [telling us which vcpu is holding a lock], but I am not
> sure how else to reduce the candidate vcpus to yield to.  I suspect we
> are yielding to way more vcpus than are prempted lock-holders, and that
> IMO is just work accomplishing nothing.  Trying to think of way to
> further reduce candidate vcpus....

... the patches are posted -  you could try them out?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity Sept. 16, 2012, 8:55 a.m. UTC | #3
On 09/14/2012 12:30 AM, Andrew Theurer wrote:

> The concern I have is that even though we have gone through changes to
> help reduce the candidate vcpus we yield to, we still have a very poor
> idea of which vcpu really needs to run.  The result is high cpu usage in
> the get_pid_task and still some contention in the double runqueue lock.
> To make this scalable, we either need to significantly reduce the
> occurrence of the lock-holder preemption, or do a much better job of
> knowing which vcpu needs to run (and not unnecessarily yielding to vcpus
> which do not need to run).
> 
> On reducing the occurrence:  The worst case for lock-holder preemption
> is having vcpus of same VM on the same runqueue.  This guarantees the
> situation of 1 vcpu running while another [of the same VM] is not.  To
> prove the point, I ran the same test, but with vcpus restricted to a
> range of host cpus, such that any single VM's vcpus can never be on the
> same runqueue.  In this case, all 10 VMs' vcpu-0's are on host cpus 0-4,
> vcpu-1's are on host cpus 5-9, and so on.  Here is the result:
> 
> kvm_cpu_spin, and all
> yield_to changes, plus
> restricted vcpu placement:  8823 +/- 3.20%   much, much better
> 
> On picking a better vcpu to yield to:  I really hesitate to rely on
> paravirt hint [telling us which vcpu is holding a lock], but I am not
> sure how else to reduce the candidate vcpus to yield to.  I suspect we
> are yielding to way more vcpus than are prempted lock-holders, and that
> IMO is just work accomplishing nothing.  Trying to think of way to
> further reduce candidate vcpus....

I wouldn't say that yielding to the "wrong" vcpu accomplishes nothing.
That other vcpu gets work done (unless it is in pause loop itself) and
the yielding vcpu gets put to sleep for a while, so it doesn't spend
cycles spinning.  While we haven't fixed the problem at least the guest
is accomplishing work, and meanwhile the real lock holder may get
naturally scheduled and clear the lock.

The main problem with this theory is that the experiments don't seem to
bear it out.  So maybe one of the assumptions is wrong - the yielding
vcpu gets scheduled early.  That could be the case if the two vcpus are
on different runqueues - you could be changing the relative priority of
vcpus on the target runqueue, but still remain on top yourself.  Is this
possible with the current code?

Maybe we should prefer vcpus on the same runqueue as yield_to targets,
and only fall back to remote vcpus when we see it didn't help.

Let's examine a few cases:

1. spinner on cpu 0, lock holder on cpu 0

win!

2. spinner on cpu 0, random vcpu(s) (or normal processes) on cpu 0

Spinner gets put to sleep, random vcpus get to work, low lock contention
(no double_rq_lock), by the time spinner gets scheduled we might have won

3. spinner on cpu 0, another spinner on cpu 0

Worst case, we'll just spin some more.  Need to detect this case and
migrate something in.

4. spinner on cpu 0, alone

Similar


It seems we need to tie in to the load balancer.

Would changing the priority of the task while it is spinning help the
load balancer?
Andrew Jones Sept. 17, 2012, 8:02 a.m. UTC | #4
On Fri, Sep 14, 2012 at 04:34:24PM -0400, Konrad Rzeszutek Wilk wrote:
> > The concern I have is that even though we have gone through changes to
> > help reduce the candidate vcpus we yield to, we still have a very poor
> > idea of which vcpu really needs to run.  The result is high cpu usage in
> > the get_pid_task and still some contention in the double runqueue lock.
> > To make this scalable, we either need to significantly reduce the
> > occurrence of the lock-holder preemption, or do a much better job of
> > knowing which vcpu needs to run (and not unnecessarily yielding to vcpus
> > which do not need to run).
> 
> The patches that Raghavendra  has been posting do accomplish that.
> >
> > On reducing the occurrence:  The worst case for lock-holder preemption
> > is having vcpus of same VM on the same runqueue.  This guarantees the
> > situation of 1 vcpu running while another [of the same VM] is not.  To
> > prove the point, I ran the same test, but with vcpus restricted to a
> > range of host cpus, such that any single VM's vcpus can never be on the
> > same runqueue.  In this case, all 10 VMs' vcpu-0's are on host cpus 0-4,
> > vcpu-1's are on host cpus 5-9, and so on.  Here is the result:
> >
> > kvm_cpu_spin, and all
> > yield_to changes, plus
> > restricted vcpu placement:  8823 +/- 3.20%   much, much better
> >
> > On picking a better vcpu to yield to:  I really hesitate to rely on
> > paravirt hint [telling us which vcpu is holding a lock], but I am not
> > sure how else to reduce the candidate vcpus to yield to.  I suspect we
> > are yielding to way more vcpus than are prempted lock-holders, and that
> > IMO is just work accomplishing nothing.  Trying to think of way to
> > further reduce candidate vcpus....
> 
> ... the patches are posted -  you could try them out?

Radim and I have done some testing with the pvticketlock series. While we
saw a gain over PLE alone, it wasn't huge, and without PLE also enabled it
could hardly support 2.0x overcommit. spinlocks aren't the only place
where cpu_relax() is called within a relatively tight loop, so it's likely
that PLE yielding just generally helps by getting schedule() called more
frequently.

Drew
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Jones Sept. 17, 2012, 8:10 a.m. UTC | #5
On Sun, Sep 16, 2012 at 11:55:28AM +0300, Avi Kivity wrote:
> On 09/14/2012 12:30 AM, Andrew Theurer wrote:
> 
> > The concern I have is that even though we have gone through changes to
> > help reduce the candidate vcpus we yield to, we still have a very poor
> > idea of which vcpu really needs to run.  The result is high cpu usage in
> > the get_pid_task and still some contention in the double runqueue lock.
> > To make this scalable, we either need to significantly reduce the
> > occurrence of the lock-holder preemption, or do a much better job of
> > knowing which vcpu needs to run (and not unnecessarily yielding to vcpus
> > which do not need to run).
> > 
> > On reducing the occurrence:  The worst case for lock-holder preemption
> > is having vcpus of same VM on the same runqueue.  This guarantees the
> > situation of 1 vcpu running while another [of the same VM] is not.  To
> > prove the point, I ran the same test, but with vcpus restricted to a
> > range of host cpus, such that any single VM's vcpus can never be on the
> > same runqueue.  In this case, all 10 VMs' vcpu-0's are on host cpus 0-4,
> > vcpu-1's are on host cpus 5-9, and so on.  Here is the result:
> > 
> > kvm_cpu_spin, and all
> > yield_to changes, plus
> > restricted vcpu placement:  8823 +/- 3.20%   much, much better
> > 
> > On picking a better vcpu to yield to:  I really hesitate to rely on
> > paravirt hint [telling us which vcpu is holding a lock], but I am not
> > sure how else to reduce the candidate vcpus to yield to.  I suspect we
> > are yielding to way more vcpus than are prempted lock-holders, and that
> > IMO is just work accomplishing nothing.  Trying to think of way to
> > further reduce candidate vcpus....
> 
> I wouldn't say that yielding to the "wrong" vcpu accomplishes nothing.
> That other vcpu gets work done (unless it is in pause loop itself) and
> the yielding vcpu gets put to sleep for a while, so it doesn't spend
> cycles spinning.  While we haven't fixed the problem at least the guest
> is accomplishing work, and meanwhile the real lock holder may get
> naturally scheduled and clear the lock.
> 
> The main problem with this theory is that the experiments don't seem to
> bear it out.  So maybe one of the assumptions is wrong - the yielding
> vcpu gets scheduled early.  That could be the case if the two vcpus are
> on different runqueues - you could be changing the relative priority of
> vcpus on the target runqueue, but still remain on top yourself.  Is this
> possible with the current code?
> 
> Maybe we should prefer vcpus on the same runqueue as yield_to targets,
> and only fall back to remote vcpus when we see it didn't help.

I thought about this a bit recently too, but didn't pursue it, because I
figured it would actually increase the get_pid_task and double_rq_lock
contention time if we have to hunt too long for a vcpu that matches a more
strict criteria. But, I guess if we can implement a special "reschedule"
to run on the current cpu which prioritizes runnable/non-running vcpus,
then it should be just as fast or faster for it to look through the
runqueue first, than it is to look through all the vcpus first.

Drew

> 
> Let's examine a few cases:
> 
> 1. spinner on cpu 0, lock holder on cpu 0
> 
> win!
> 
> 2. spinner on cpu 0, random vcpu(s) (or normal processes) on cpu 0
> 
> Spinner gets put to sleep, random vcpus get to work, low lock contention
> (no double_rq_lock), by the time spinner gets scheduled we might have won
> 
> 3. spinner on cpu 0, another spinner on cpu 0
> 
> Worst case, we'll just spin some more.  Need to detect this case and
> migrate something in.
> 
> 4. spinner on cpu 0, alone
> 
> Similar
> 
> 
> It seems we need to tie in to the load balancer.
> 
> Would changing the priority of the task while it is spinning help the
> load balancer?
> 
> -- 
> error compiling committee.c: too many arguments to function
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Theurer Sept. 18, 2012, 3:03 a.m. UTC | #6
On Sun, 2012-09-16 at 11:55 +0300, Avi Kivity wrote:
> On 09/14/2012 12:30 AM, Andrew Theurer wrote:
> 
> > The concern I have is that even though we have gone through changes to
> > help reduce the candidate vcpus we yield to, we still have a very poor
> > idea of which vcpu really needs to run.  The result is high cpu usage in
> > the get_pid_task and still some contention in the double runqueue lock.
> > To make this scalable, we either need to significantly reduce the
> > occurrence of the lock-holder preemption, or do a much better job of
> > knowing which vcpu needs to run (and not unnecessarily yielding to vcpus
> > which do not need to run).
> > 
> > On reducing the occurrence:  The worst case for lock-holder preemption
> > is having vcpus of same VM on the same runqueue.  This guarantees the
> > situation of 1 vcpu running while another [of the same VM] is not.  To
> > prove the point, I ran the same test, but with vcpus restricted to a
> > range of host cpus, such that any single VM's vcpus can never be on the
> > same runqueue.  In this case, all 10 VMs' vcpu-0's are on host cpus 0-4,
> > vcpu-1's are on host cpus 5-9, and so on.  Here is the result:
> > 
> > kvm_cpu_spin, and all
> > yield_to changes, plus
> > restricted vcpu placement:  8823 +/- 3.20%   much, much better
> > 
> > On picking a better vcpu to yield to:  I really hesitate to rely on
> > paravirt hint [telling us which vcpu is holding a lock], but I am not
> > sure how else to reduce the candidate vcpus to yield to.  I suspect we
> > are yielding to way more vcpus than are prempted lock-holders, and that
> > IMO is just work accomplishing nothing.  Trying to think of way to
> > further reduce candidate vcpus....
> 
> I wouldn't say that yielding to the "wrong" vcpu accomplishes nothing.
> That other vcpu gets work done (unless it is in pause loop itself) and
> the yielding vcpu gets put to sleep for a while, so it doesn't spend
> cycles spinning.  While we haven't fixed the problem at least the guest
> is accomplishing work, and meanwhile the real lock holder may get
> naturally scheduled and clear the lock.

OK, yes, if the other thread gets useful work done, then it is not
wasteful.  I was thinking of the worst case scenario, where any other
vcpu would likely spin as well, and the host side cpu-time for switching
vcpu threads was not all that productive.  Well, I suppose it does help
eliminate potential lock holding vcpus; it just seems to be not that
efficient or fast enough.

> The main problem with this theory is that the experiments don't seem to
> bear it out.

Granted, my test case is quite brutal.  It's nothing but over-committed
VMs which always have some spin lock activity.  However, we really
should try to fix the worst case scenario.

>   So maybe one of the assumptions is wrong - the yielding
> vcpu gets scheduled early.  That could be the case if the two vcpus are
> on different runqueues - you could be changing the relative priority of
> vcpus on the target runqueue, but still remain on top yourself.  Is this
> possible with the current code?
> 
> Maybe we should prefer vcpus on the same runqueue as yield_to targets,
> and only fall back to remote vcpus when we see it didn't help.
> 
> Let's examine a few cases:
> 
> 1. spinner on cpu 0, lock holder on cpu 0
> 
> win!
> 
> 2. spinner on cpu 0, random vcpu(s) (or normal processes) on cpu 0
> 
> Spinner gets put to sleep, random vcpus get to work, low lock contention
> (no double_rq_lock), by the time spinner gets scheduled we might have won
> 
> 3. spinner on cpu 0, another spinner on cpu 0
> 
> Worst case, we'll just spin some more.  Need to detect this case and
> migrate something in.

Well, we can certainly experiment and see what we get.

IMO, the key to getting this working really well on the large VMs is
finding the lock-holding cpu -quickly-.  What I think is happening is
that we go through a relatively long process to get to that one right
vcpu.  I guess I need to find a faster way to get there.

> 4. spinner on cpu 0, alone
> 
> Similar
> 
> 
> It seems we need to tie in to the load balancer.
> 
> Would changing the priority of the task while it is spinning help the
> load balancer?

Not sure.

-Andrew






--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity Sept. 19, 2012, 1:39 p.m. UTC | #7
On 09/18/2012 06:03 AM, Andrew Theurer wrote:
> On Sun, 2012-09-16 at 11:55 +0300, Avi Kivity wrote:
>> On 09/14/2012 12:30 AM, Andrew Theurer wrote:
>> 
>> > The concern I have is that even though we have gone through changes to
>> > help reduce the candidate vcpus we yield to, we still have a very poor
>> > idea of which vcpu really needs to run.  The result is high cpu usage in
>> > the get_pid_task and still some contention in the double runqueue lock.
>> > To make this scalable, we either need to significantly reduce the
>> > occurrence of the lock-holder preemption, or do a much better job of
>> > knowing which vcpu needs to run (and not unnecessarily yielding to vcpus
>> > which do not need to run).
>> > 
>> > On reducing the occurrence:  The worst case for lock-holder preemption
>> > is having vcpus of same VM on the same runqueue.  This guarantees the
>> > situation of 1 vcpu running while another [of the same VM] is not.  To
>> > prove the point, I ran the same test, but with vcpus restricted to a
>> > range of host cpus, such that any single VM's vcpus can never be on the
>> > same runqueue.  In this case, all 10 VMs' vcpu-0's are on host cpus 0-4,
>> > vcpu-1's are on host cpus 5-9, and so on.  Here is the result:
>> > 
>> > kvm_cpu_spin, and all
>> > yield_to changes, plus
>> > restricted vcpu placement:  8823 +/- 3.20%   much, much better
>> > 
>> > On picking a better vcpu to yield to:  I really hesitate to rely on
>> > paravirt hint [telling us which vcpu is holding a lock], but I am not
>> > sure how else to reduce the candidate vcpus to yield to.  I suspect we
>> > are yielding to way more vcpus than are prempted lock-holders, and that
>> > IMO is just work accomplishing nothing.  Trying to think of way to
>> > further reduce candidate vcpus....
>> 
>> I wouldn't say that yielding to the "wrong" vcpu accomplishes nothing.
>> That other vcpu gets work done (unless it is in pause loop itself) and
>> the yielding vcpu gets put to sleep for a while, so it doesn't spend
>> cycles spinning.  While we haven't fixed the problem at least the guest
>> is accomplishing work, and meanwhile the real lock holder may get
>> naturally scheduled and clear the lock.
> 
> OK, yes, if the other thread gets useful work done, then it is not
> wasteful.  I was thinking of the worst case scenario, where any other
> vcpu would likely spin as well, and the host side cpu-time for switching
> vcpu threads was not all that productive.  Well, I suppose it does help
> eliminate potential lock holding vcpus; it just seems to be not that
> efficient or fast enough.

If we have N-1 vcpus spinwaiting on 1 vcpu, with N:1 overcommit then
yes, we must iterate over N-1 vcpus until we find Mr. Right.  Eventually
it's not-a-timeslice will expire and we go through this again.  If
N*y_yield is comparable to the timeslice, we start losing efficiency.
Because of lock contention, t_yield can scale with the number of host
cpus.  So in this worst case, we get quadratic behaviour.

One way out is to increase the not-a-timeslice.  Can we get spinning
vcpus to do that for running vcpus, if they cannot find a
runnable-but-not-running vcpu?

That's not guaranteed to help, if we boost a running vcpu too much it
will skew how vcpu runtime is distributed even after the lock is released.

> 
>> The main problem with this theory is that the experiments don't seem to
>> bear it out.
> 
> Granted, my test case is quite brutal.  It's nothing but over-committed
> VMs which always have some spin lock activity.  However, we really
> should try to fix the worst case scenario.

Yes.  And other guests may not scale as well as Linux, so they may show
this behaviour more often.

> 
>>   So maybe one of the assumptions is wrong - the yielding
>> vcpu gets scheduled early.  That could be the case if the two vcpus are
>> on different runqueues - you could be changing the relative priority of
>> vcpus on the target runqueue, but still remain on top yourself.  Is this
>> possible with the current code?
>> 
>> Maybe we should prefer vcpus on the same runqueue as yield_to targets,
>> and only fall back to remote vcpus when we see it didn't help.
>> 
>> Let's examine a few cases:
>> 
>> 1. spinner on cpu 0, lock holder on cpu 0
>> 
>> win!
>> 
>> 2. spinner on cpu 0, random vcpu(s) (or normal processes) on cpu 0
>> 
>> Spinner gets put to sleep, random vcpus get to work, low lock contention
>> (no double_rq_lock), by the time spinner gets scheduled we might have won
>> 
>> 3. spinner on cpu 0, another spinner on cpu 0
>> 
>> Worst case, we'll just spin some more.  Need to detect this case and
>> migrate something in.
> 
> Well, we can certainly experiment and see what we get.
> 
> IMO, the key to getting this working really well on the large VMs is
> finding the lock-holding cpu -quickly-.  What I think is happening is
> that we go through a relatively long process to get to that one right
> vcpu.  I guess I need to find a faster way to get there.

pvspinlocks will find the right one, every time.  Otherwise I see no way
to do this.
diff mbox

Patch

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fbf1fd0..8551f57 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4820,6 +4820,24 @@  void __sched yield(void)
 }
 EXPORT_SYMBOL(yield);
 
+/*
+ * Tests preconditions required for sched_class::yield_to().
+ */
+static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p,
+					 struct rq *p_rq)
+{
+	if (!curr->sched_class->yield_to_task)
+		return false;
+
+	if (curr->sched_class != p->sched_class)
+		return false;
+
+	if (task_running(p_rq, p) || p->state)
+		return false;
+
+	return true;
+}
+
 /**
  * yield_to - yield the current processor to another thread in
  * your thread group, or accelerate that thread toward the
@@ -4844,20 +4862,24 @@  bool __sched yield_to(struct task_struct *p, bool preempt)
 
 again:
 	p_rq = task_rq(p);
+
+	/* optimistic test to avoid taking locks */
+	if (!__yield_to_candidate(curr, p, p_rq))
+		goto out_irq;
+
+	/* if next buddy is set, assume yield is in progress */
+	if (p_rq->cfs.next)
+		goto out_irq;
+
 	double_rq_lock(rq, p_rq);
 	while (task_rq(p) != p_rq) {
 		double_rq_unlock(rq, p_rq);
 		goto again;
 	}
 
-	if (!curr->sched_class->yield_to_task)
-		goto out;
-
-	if (curr->sched_class != p->sched_class)
-		goto out;
-
-	if (task_running(p_rq, p) || p->state)
-		goto out;
+	/* validate state, holding p_rq ensures p's state cannot change */
+	if (!__yield_to_candidate(curr, p, p_rq))
+		goto out_unlock;
 
 	yielded = curr->sched_class->yield_to_task(rq, p, preempt);
 	if (yielded) {
@@ -4877,8 +4899,9 @@  again:
 		rq->skip_clock_update = 0;
 	}
 
-out:
+out_unlock:
 	double_rq_unlock(rq, p_rq);
+out_irq:
 	local_irq_restore(flags);
 
 	if (yielded)