diff mbox

[RFC,0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

Message ID 1348490165.11847.58.camel@twins (mailing list archive)
State New, archived
Headers show

Commit Message

Peter Zijlstra Sept. 24, 2012, 12:36 p.m. UTC
On Mon, 2012-09-24 at 17:22 +0530, Raghavendra K T wrote:
> On 09/24/2012 05:04 PM, Peter Zijlstra wrote:
> > On Fri, 2012-09-21 at 17:29 +0530, Raghavendra K T wrote:
> >> In some special scenarios like #vcpu<= #pcpu, PLE handler may
> >> prove very costly, because there is no need to iterate over vcpus
> >> and do unsuccessful yield_to burning CPU.
> >
> > What's the costly thing? The vm-exit, the yield (which should be a nop
> > if its the only task there) or something else entirely?
> >
> Both vmexit and yield_to() actually,
> 
> because unsuccessful yield_to() overall is costly in PLE handler.
> 
> This is because when we have large guests, say 32/16 vcpus, and one
> vcpu is holding lock, rest of the vcpus waiting for the lock, when they
> do PL-exit, each of the vcpu try to iterate over rest of vcpu list in
> the VM and try to do directed yield (unsuccessful). (O(n^2) tries).
> 
> this results is fairly high amount of cpu burning and double run queue
> lock contention.
> 
> (if they were spinning probably lock progress would have been faster).
> As Avi/Chegu Vinod had felt it is better to avoid vmexit itself, which
> seems little complex to achieve currently.

OK, so the vmexit stays and we need to improve yield_to.

How about something like the below, that would allow breaking out of the
for-each-vcpu loop and simply going back into the vm, right?

---
 kernel/sched/core.c | 25 +++++++++++++++++++------
 1 file changed, 19 insertions(+), 6 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Raghavendra K T Sept. 24, 2012, 1:29 p.m. UTC | #1
On 09/24/2012 06:06 PM, Peter Zijlstra wrote:
> On Mon, 2012-09-24 at 17:22 +0530, Raghavendra K T wrote:
>> On 09/24/2012 05:04 PM, Peter Zijlstra wrote:
>>> On Fri, 2012-09-21 at 17:29 +0530, Raghavendra K T wrote:
>>>> In some special scenarios like #vcpu<= #pcpu, PLE handler may
>>>> prove very costly, because there is no need to iterate over vcpus
>>>> and do unsuccessful yield_to burning CPU.
>>>
>>> What's the costly thing? The vm-exit, the yield (which should be a nop
>>> if its the only task there) or something else entirely?
>>>
>> Both vmexit and yield_to() actually,
>>
>> because unsuccessful yield_to() overall is costly in PLE handler.
>>
>> This is because when we have large guests, say 32/16 vcpus, and one
>> vcpu is holding lock, rest of the vcpus waiting for the lock, when they
>> do PL-exit, each of the vcpu try to iterate over rest of vcpu list in
>> the VM and try to do directed yield (unsuccessful). (O(n^2) tries).
>>
>> this results is fairly high amount of cpu burning and double run queue
>> lock contention.
>>
>> (if they were spinning probably lock progress would have been faster).
>> As Avi/Chegu Vinod had felt it is better to avoid vmexit itself, which
>> seems little complex to achieve currently.
>
> OK, so the vmexit stays and we need to improve yield_to.
>
> How about something like the below, that would allow breaking out of the
> for-each-vcpu loop and simply going back into the vm, right?
>
> ---
>   kernel/sched/core.c | 25 +++++++++++++++++++------
>   1 file changed, 19 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b38f00e..5d5b355 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4272,7 +4272,10 @@ EXPORT_SYMBOL(yield);
>    * It's the caller's job to ensure that the target task struct
>    * can't go away on us before we can do any checks.
>    *
> - * Returns true if we indeed boosted the target task.
> + * Returns:
> + *   true (>0) if we indeed boosted the target task.
> + *   false (0) if we failed to boost the target.
> + *   -ESRCH if there's no task to yield to.
>    */
>   bool __sched yield_to(struct task_struct *p, bool preempt)
>   {
> @@ -4284,6 +4287,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
>   	local_irq_save(flags);
>   	rq = this_rq();
>
> +	/*
> +	 * If we're the only runnable task on the rq, there's absolutely no
> +	 * point in yielding.
> +	 */
> +	if (rq->nr_running == 1) {
> +		yielded = -ESRCH;
> +		goto out_irq;
> +	}
> +
>   again:
>   	p_rq = task_rq(p);
>   	double_rq_lock(rq, p_rq);
> @@ -4293,13 +4305,13 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
>   	}
>
>   	if (!curr->sched_class->yield_to_task)
> -		goto out;
> +		goto out_unlock;
>
>   	if (curr->sched_class != p->sched_class)
> -		goto out;
> +		goto out_unlock;
>
>   	if (task_running(p_rq, p) || p->state)
> -		goto out;
> +		goto out_unlock;
>
>   	yielded = curr->sched_class->yield_to_task(rq, p, preempt);
>   	if (yielded) {
> @@ -4312,11 +4324,12 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
>   			resched_task(p_rq->curr);
>   	}
>
> -out:
> +out_unlock:
>   	double_rq_unlock(rq, p_rq);
> +out_irq:
>   	local_irq_restore(flags);
>
> -	if (yielded)
> +	if (yielded>  0)
>   		schedule();
>
>   	return yielded;
>
>

Yes, I think this is a nice idea. Any future users of yield_to
also would benefit from this. we will have to iterate only till first
attempt to yield_to.

I 'll run the test with this patch.

However Rik had a genuine concern in the cases where runqueue is not
equally distributed and lockholder might actually be on a different run 
queue but not running.

Do you think instead of using rq->nr_running, we could get a global 
sense of load using avenrun (something like avenrun/num_onlinecpus)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Sept. 24, 2012, 1:54 p.m. UTC | #2
On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:
> However Rik had a genuine concern in the cases where runqueue is not
> equally distributed and lockholder might actually be on a different run 
> queue but not running.

Load should eventually get distributed equally -- that's what the
load-balancer is for -- so this is a temporary situation.

We already try and favour the non running vcpu in this case, that's what
yield_to_task_fair() is about. If its still not eligible to run, tough
luck.

> Do you think instead of using rq->nr_running, we could get a global 
> sense of load using avenrun (something like avenrun/num_onlinecpus) 

To what purpose? Also, global stuff is expensive, so you should try and
stay away from it as hard as you possibly can.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Raghavendra K T Sept. 24, 2012, 2:16 p.m. UTC | #3
On 09/24/2012 07:24 PM, Peter Zijlstra wrote:
> On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:
>> However Rik had a genuine concern in the cases where runqueue is not
>> equally distributed and lockholder might actually be on a different run
>> queue but not running.
>
> Load should eventually get distributed equally -- that's what the
> load-balancer is for -- so this is a temporary situation.
>
> We already try and favour the non running vcpu in this case, that's what
> yield_to_task_fair() is about. If its still not eligible to run, tough
> luck.

Yes, I agree.

>
>> Do you think instead of using rq->nr_running, we could get a global
>> sense of load using avenrun (something like avenrun/num_onlinecpus)
>
> To what purpose? Also, global stuff is expensive, so you should try and
> stay away from it as hard as you possibly can.

Yes, that concern only had made me to fall back to rq->nr_running.

Will come back with the result soon.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity Sept. 24, 2012, 3:51 p.m. UTC | #4
On 09/24/2012 03:54 PM, Peter Zijlstra wrote:
> On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:
>> However Rik had a genuine concern in the cases where runqueue is not
>> equally distributed and lockholder might actually be on a different run 
>> queue but not running.
> 
> Load should eventually get distributed equally -- that's what the
> load-balancer is for -- so this is a temporary situation.

What's the expected latency?  This is the whole problem.  Eventually the
scheduler would pick the lock holder as well, the problem is that it's
in the millisecond scale while lock hold times are in the microsecond
scale, leading to a 1000x slowdown.

If we want to yield, we really want to boost someone.

> We already try and favour the non running vcpu in this case, that's what
> yield_to_task_fair() is about. If its still not eligible to run, tough
> luck.

Crazy idea: instead of yielding, just run that other vcpu in the thread
that would otherwise spin.  I can see about a million objections to this
already though.

>> Do you think instead of using rq->nr_running, we could get a global 
>> sense of load using avenrun (something like avenrun/num_onlinecpus) 
> 
> To what purpose? Also, global stuff is expensive, so you should try and
> stay away from it as hard as you possibly can.

Spinning is also expensive.  How about we do the global stuff every N
times, to amortize the cost (and reduce contention)?
Peter Zijlstra Sept. 24, 2012, 4:03 p.m. UTC | #5
On Mon, 2012-09-24 at 17:51 +0200, Avi Kivity wrote:
> On 09/24/2012 03:54 PM, Peter Zijlstra wrote:
> > On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:
> >> However Rik had a genuine concern in the cases where runqueue is not
> >> equally distributed and lockholder might actually be on a different run 
> >> queue but not running.
> > 
> > Load should eventually get distributed equally -- that's what the
> > load-balancer is for -- so this is a temporary situation.
> 
> What's the expected latency?  This is the whole problem.  Eventually the
> scheduler would pick the lock holder as well, the problem is that it's
> in the millisecond scale while lock hold times are in the microsecond
> scale, leading to a 1000x slowdown.

Yeah I know.. Heisenberg's uncertainty applied to SMP computing becomes
something like accurate or fast, never both.

> If we want to yield, we really want to boost someone.

Now if only you knew which someone ;-) This non-modified guest nonsense
is such a snake pit.. but you know how I feel about all that.

> > We already try and favour the non running vcpu in this case, that's what
> > yield_to_task_fair() is about. If its still not eligible to run, tough
> > luck.
> 
> Crazy idea: instead of yielding, just run that other vcpu in the thread
> that would otherwise spin.  I can see about a million objections to this
> already though.

Yah.. you want me to list a few? :-) It would require synchronization
with the other cpu to pull its task -- one really wants to avoid it also
running it.

Do this at a high enough frequency and you're dead too.

Anyway, you can do this inside the KVM stuff, simply flip the vcpu state
associated with a vcpu thread and use the preemption notifiers to sort
things against the scheduler or somesuch.

> >> Do you think instead of using rq->nr_running, we could get a global 
> >> sense of load using avenrun (something like avenrun/num_onlinecpus) 
> > 
> > To what purpose? Also, global stuff is expensive, so you should try and
> > stay away from it as hard as you possibly can.
> 
> Spinning is also expensive.  How about we do the global stuff every N
> times, to amortize the cost (and reduce contention)?

Nah, spinning isn't expensive, its a waste of time, similar end result
for someone who wants to do useful work though, but not the same cause.

Pick N and I'll come up with a scenario for which its wrong ;-)

Anyway, its an ugly problem and one I really want to contain inside the
insanity that created it (virt), lets not taint the rest of the kernel
more than we need to. 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity Sept. 24, 2012, 4:20 p.m. UTC | #6
On 09/24/2012 06:03 PM, Peter Zijlstra wrote:
> On Mon, 2012-09-24 at 17:51 +0200, Avi Kivity wrote:
>> On 09/24/2012 03:54 PM, Peter Zijlstra wrote:
>> > On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:
>> >> However Rik had a genuine concern in the cases where runqueue is not
>> >> equally distributed and lockholder might actually be on a different run 
>> >> queue but not running.
>> > 
>> > Load should eventually get distributed equally -- that's what the
>> > load-balancer is for -- so this is a temporary situation.
>> 
>> What's the expected latency?  This is the whole problem.  Eventually the
>> scheduler would pick the lock holder as well, the problem is that it's
>> in the millisecond scale while lock hold times are in the microsecond
>> scale, leading to a 1000x slowdown.
> 
> Yeah I know.. Heisenberg's uncertainty applied to SMP computing becomes
> something like accurate or fast, never both.
> 
>> If we want to yield, we really want to boost someone.
> 
> Now if only you knew which someone ;-) This non-modified guest nonsense
> is such a snake pit.. but you know how I feel about all that.

Actually if I knew that in addition to boosting someone, I also unboost
myself enough to be preempted, it wouldn't matter.  While boosting the
lock holder is good, the main point is not spinning and doing useful
work instead.  We can detect spinners and avoid boosting them.

That's the motivation for the "donate vruntime" approach I wanted earlier.

> 
>> > We already try and favour the non running vcpu in this case, that's what
>> > yield_to_task_fair() is about. If its still not eligible to run, tough
>> > luck.
>> 
>> Crazy idea: instead of yielding, just run that other vcpu in the thread
>> that would otherwise spin.  I can see about a million objections to this
>> already though.
> 
> Yah.. you want me to list a few? :-) It would require synchronization
> with the other cpu to pull its task -- one really wants to avoid it also
> running it.

Yeah, it's quite a horrible idea.

> 
> Do this at a high enough frequency and you're dead too.
> 
> Anyway, you can do this inside the KVM stuff, simply flip the vcpu state
> associated with a vcpu thread and use the preemption notifiers to sort
> things against the scheduler or somesuch.

That's what I thought when I wrote this, but I can't, I might be
preempted in random kvm code.  So my state includes the host stack and
registers.  Maybe we can special-case when we interrupt guest mode.

> 
>> >> Do you think instead of using rq->nr_running, we could get a global 
>> >> sense of load using avenrun (something like avenrun/num_onlinecpus) 
>> > 
>> > To what purpose? Also, global stuff is expensive, so you should try and
>> > stay away from it as hard as you possibly can.
>> 
>> Spinning is also expensive.  How about we do the global stuff every N
>> times, to amortize the cost (and reduce contention)?
> 
> Nah, spinning isn't expensive, its a waste of time, similar end result
> for someone who wants to do useful work though, but not the same cause.
> 
> Pick N and I'll come up with a scenario for which its wrong ;-)

Sure.  But if it's rare enough, then that's okay for us.

> Anyway, its an ugly problem and one I really want to contain inside the
> insanity that created it (virt), lets not taint the rest of the kernel
> more than we need to. 

Agreed.  Though given that postgres and others use userspace spinlocks,
maybe it's not just virt.
Raghavendra K T Sept. 25, 2012, 1:40 p.m. UTC | #7
On 09/24/2012 07:46 PM, Raghavendra K T wrote:
> On 09/24/2012 07:24 PM, Peter Zijlstra wrote:
>> On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:
>>> However Rik had a genuine concern in the cases where runqueue is not
>>> equally distributed and lockholder might actually be on a different run
>>> queue but not running.
>>
>> Load should eventually get distributed equally -- that's what the
>> load-balancer is for -- so this is a temporary situation.
>>
>> We already try and favour the non running vcpu in this case, that's what
>> yield_to_task_fair() is about. If its still not eligible to run, tough
>> luck.
>
> Yes, I agree.
>
>>
>>> Do you think instead of using rq->nr_running, we could get a global
>>> sense of load using avenrun (something like avenrun/num_onlinecpus)
>>
>> To what purpose? Also, global stuff is expensive, so you should try and
>> stay away from it as hard as you possibly can.
>
> Yes, that concern only had made me to fall back to rq->nr_running.
>
> Will come back with the result soon.

Got the result with the patches:
So here is the result,

Tried this on a 32 core ple box with HT disabled. 32 guest vcpus with
1x and 2x overcommits

Base = 3.6.0-rc5 + ple handler optimization patches
A = Base + checking rq_running in vcpu_on_spin() patch
B = Base + checking rq->nr_running in sched/core
C = Base - PLE

---+-----------+-----------+-----------+-----------+
    |    Ebizzy result (rec/sec higher is better)   |
---+-----------+-----------+-----------+-----------+
    |    Base   |     A     |      B    |     C     |
---+-----------+-----------+-----------+-----------+
1x | 2374.1250 | 7273.7500 | 5690.8750 |  7364.3750|
2x | 2536.2500 | 2458.5000 | 2426.3750 |    48.5000|
---+-----------+-----------+-----------+-----------+

    % improvements w.r.t BASE
---+------------+------------+------------+
    |      A     |    B       |     C      |
---+------------+------------+------------+
1x | 206.37603  |  139.70410 |  210.19323 |
2x | -3.06555   |  -4.33218  |  -98.08773 |
---+------------+------------+------------+

we are getting the benefit of almost PLE disabled case with this
approach. With patch B, we have dropped a bit in gain.
(because we still would iterate vcpus until we decide to do a directed
yield).





--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Jones Sept. 26, 2012, 12:57 p.m. UTC | #8
On Mon, Sep 24, 2012 at 02:36:05PM +0200, Peter Zijlstra wrote:
> On Mon, 2012-09-24 at 17:22 +0530, Raghavendra K T wrote:
> > On 09/24/2012 05:04 PM, Peter Zijlstra wrote:
> > > On Fri, 2012-09-21 at 17:29 +0530, Raghavendra K T wrote:
> > >> In some special scenarios like #vcpu<= #pcpu, PLE handler may
> > >> prove very costly, because there is no need to iterate over vcpus
> > >> and do unsuccessful yield_to burning CPU.
> > >
> > > What's the costly thing? The vm-exit, the yield (which should be a nop
> > > if its the only task there) or something else entirely?
> > >
> > Both vmexit and yield_to() actually,
> > 
> > because unsuccessful yield_to() overall is costly in PLE handler.
> > 
> > This is because when we have large guests, say 32/16 vcpus, and one
> > vcpu is holding lock, rest of the vcpus waiting for the lock, when they
> > do PL-exit, each of the vcpu try to iterate over rest of vcpu list in
> > the VM and try to do directed yield (unsuccessful). (O(n^2) tries).
> > 
> > this results is fairly high amount of cpu burning and double run queue
> > lock contention.
> > 
> > (if they were spinning probably lock progress would have been faster).
> > As Avi/Chegu Vinod had felt it is better to avoid vmexit itself, which
> > seems little complex to achieve currently.
> 
> OK, so the vmexit stays and we need to improve yield_to.

Can't we do this check sooner as well, as it only requires per-cpu data?
If we do it way back in kvm_vcpu_on_spin, then we avoid get_pid_task()
and a bunch of read barriers from kvm_for_each_vcpu. Also, moving the test
into kvm code would allow us to do other kvm things as a result of the
check in order to avoid some vmexits. It looks like we should be able to
avoid some without much complexity by just making a per-vm ple_window
variable, and then, when we hit the nr_running == 1 condition, also doing
vmcs_write32(PLE_WINDOW, (kvm->ple_window += PLE_WINDOW_BUMP))
Reset the window to the default value when we successfully yield (and
maybe we should limit the number of bumps).

Drew

> 
> How about something like the below, that would allow breaking out of the
> for-each-vcpu loop and simply going back into the vm, right?
> 
> ---
>  kernel/sched/core.c | 25 +++++++++++++++++++------
>  1 file changed, 19 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b38f00e..5d5b355 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4272,7 +4272,10 @@ EXPORT_SYMBOL(yield);
>   * It's the caller's job to ensure that the target task struct
>   * can't go away on us before we can do any checks.
>   *
> - * Returns true if we indeed boosted the target task.
> + * Returns:
> + *   true (>0) if we indeed boosted the target task.
> + *   false (0) if we failed to boost the target.
> + *   -ESRCH if there's no task to yield to.
>   */
>  bool __sched yield_to(struct task_struct *p, bool preempt)
>  {
> @@ -4284,6 +4287,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
>  	local_irq_save(flags);
>  	rq = this_rq();
>  
> +	/*
> +	 * If we're the only runnable task on the rq, there's absolutely no
> +	 * point in yielding.
> +	 */
> +	if (rq->nr_running == 1) {
> +		yielded = -ESRCH;
> +		goto out_irq;
> +	}
> +
>  again:
>  	p_rq = task_rq(p);
>  	double_rq_lock(rq, p_rq);
> @@ -4293,13 +4305,13 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
>  	}
>  
>  	if (!curr->sched_class->yield_to_task)
> -		goto out;
> +		goto out_unlock;
>  
>  	if (curr->sched_class != p->sched_class)
> -		goto out;
> +		goto out_unlock;
>  
>  	if (task_running(p_rq, p) || p->state)
> -		goto out;
> +		goto out_unlock;
>  
>  	yielded = curr->sched_class->yield_to_task(rq, p, preempt);
>  	if (yielded) {
> @@ -4312,11 +4324,12 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
>  			resched_task(p_rq->curr);
>  	}
>  
> -out:
> +out_unlock:
>  	double_rq_unlock(rq, p_rq);
> +out_irq:
>  	local_irq_restore(flags);
>  
> -	if (yielded)
> +	if (yielded > 0)
>  		schedule();
>  
>  	return yielded;
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Jones Sept. 26, 2012, 1:20 p.m. UTC | #9
On Mon, Sep 24, 2012 at 06:20:12PM +0200, Avi Kivity wrote:
> On 09/24/2012 06:03 PM, Peter Zijlstra wrote:
> > On Mon, 2012-09-24 at 17:51 +0200, Avi Kivity wrote:
> >> On 09/24/2012 03:54 PM, Peter Zijlstra wrote:
> >> > On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:
> >> >> However Rik had a genuine concern in the cases where runqueue is not
> >> >> equally distributed and lockholder might actually be on a different run 
> >> >> queue but not running.
> >> > 
> >> > Load should eventually get distributed equally -- that's what the
> >> > load-balancer is for -- so this is a temporary situation.
> >> 
> >> What's the expected latency?  This is the whole problem.  Eventually the
> >> scheduler would pick the lock holder as well, the problem is that it's
> >> in the millisecond scale while lock hold times are in the microsecond
> >> scale, leading to a 1000x slowdown.
> > 
> > Yeah I know.. Heisenberg's uncertainty applied to SMP computing becomes
> > something like accurate or fast, never both.
> > 
> >> If we want to yield, we really want to boost someone.
> > 
> > Now if only you knew which someone ;-) This non-modified guest nonsense
> > is such a snake pit.. but you know how I feel about all that.
> 
> Actually if I knew that in addition to boosting someone, I also unboost
> myself enough to be preempted, it wouldn't matter.  While boosting the
> lock holder is good, the main point is not spinning and doing useful
> work instead.  We can detect spinners and avoid boosting them.
> 
> That's the motivation for the "donate vruntime" approach I wanted earlier.

I'll probably get shot for the suggestion, but doesn't this problem merit
another scheduler class? We want FIFO order for a special class of tasks,
"spinners". Wouldn't a clean solution be to promote a task's scheduler
class to the spinner class when we PLE (or come from some special syscall
for userspace spinlocks?)? That class would be higher priority than the
fair class and would schedule in FIFO order, but it would only run its
tasks for short periods before switching. Also, after each task is run
its scheduler class would get reset down to its original class (fair).
At least at first thought this looks to me to be cleaner than the next
and skip hinting, plus it helps guarantee that the lock holder gets
scheduled before the tasks waiting on that lock.

Drew

> 
> > 
> >> > We already try and favour the non running vcpu in this case, that's what
> >> > yield_to_task_fair() is about. If its still not eligible to run, tough
> >> > luck.
> >> 
> >> Crazy idea: instead of yielding, just run that other vcpu in the thread
> >> that would otherwise spin.  I can see about a million objections to this
> >> already though.
> > 
> > Yah.. you want me to list a few? :-) It would require synchronization
> > with the other cpu to pull its task -- one really wants to avoid it also
> > running it.
> 
> Yeah, it's quite a horrible idea.
> 
> > 
> > Do this at a high enough frequency and you're dead too.
> > 
> > Anyway, you can do this inside the KVM stuff, simply flip the vcpu state
> > associated with a vcpu thread and use the preemption notifiers to sort
> > things against the scheduler or somesuch.
> 
> That's what I thought when I wrote this, but I can't, I might be
> preempted in random kvm code.  So my state includes the host stack and
> registers.  Maybe we can special-case when we interrupt guest mode.
> 
> > 
> >> >> Do you think instead of using rq->nr_running, we could get a global 
> >> >> sense of load using avenrun (something like avenrun/num_onlinecpus) 
> >> > 
> >> > To what purpose? Also, global stuff is expensive, so you should try and
> >> > stay away from it as hard as you possibly can.
> >> 
> >> Spinning is also expensive.  How about we do the global stuff every N
> >> times, to amortize the cost (and reduce contention)?
> > 
> > Nah, spinning isn't expensive, its a waste of time, similar end result
> > for someone who wants to do useful work though, but not the same cause.
> > 
> > Pick N and I'll come up with a scenario for which its wrong ;-)
> 
> Sure.  But if it's rare enough, then that's okay for us.
> 
> > Anyway, its an ugly problem and one I really want to contain inside the
> > insanity that created it (virt), lets not taint the rest of the kernel
> > more than we need to. 
> 
> Agreed.  Though given that postgres and others use userspace spinlocks,
> maybe it's not just virt.
> 
> -- 
> error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Sept. 26, 2012, 1:26 p.m. UTC | #10
On Wed, 2012-09-26 at 15:20 +0200, Andrew Jones wrote:
> Wouldn't a clean solution be to promote a task's scheduler
> class to the spinner class when we PLE (or come from some special
> syscall
> for userspace spinlocks?)? 

Userspace spinlocks are typically employed to avoid syscalls..

> That class would be higher priority than the
> fair class and would schedule in FIFO order, but it would only run its
> tasks for short periods before switching. 

Since lock hold times aren't limited, esp. for things like userspace
'spin' locks, you've got a very good denial of service / opportunity for
abuse right there.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Jones Sept. 26, 2012, 1:39 p.m. UTC | #11
On Wed, Sep 26, 2012 at 03:26:11PM +0200, Peter Zijlstra wrote:
> On Wed, 2012-09-26 at 15:20 +0200, Andrew Jones wrote:
> > Wouldn't a clean solution be to promote a task's scheduler
> > class to the spinner class when we PLE (or come from some special
> > syscall
> > for userspace spinlocks?)? 
> 
> Userspace spinlocks are typically employed to avoid syscalls..

I'm guessing there could be a slow path - spin N times and then give
up and yield.

> 
> > That class would be higher priority than the
> > fair class and would schedule in FIFO order, but it would only run its
> > tasks for short periods before switching. 
> 
> Since lock hold times aren't limited, esp. for things like userspace
> 'spin' locks, you've got a very good denial of service / opportunity for
> abuse right there.

Maybe add some throttling to avoid overuse/maliciousness?

> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Sept. 26, 2012, 1:45 p.m. UTC | #12
On Wed, 2012-09-26 at 15:39 +0200, Andrew Jones wrote:
> On Wed, Sep 26, 2012 at 03:26:11PM +0200, Peter Zijlstra wrote:
> > On Wed, 2012-09-26 at 15:20 +0200, Andrew Jones wrote:
> > > Wouldn't a clean solution be to promote a task's scheduler
> > > class to the spinner class when we PLE (or come from some special
> > > syscall
> > > for userspace spinlocks?)? 
> > 
> > Userspace spinlocks are typically employed to avoid syscalls..
> 
> I'm guessing there could be a slow path - spin N times and then give
> up and yield.

Much better they should do a blocking futex call or so, once you do the
syscall you're in kernel space anyway and have paid the transition cost.

> > 
> > > That class would be higher priority than the
> > > fair class and would schedule in FIFO order, but it would only run its
> > > tasks for short periods before switching. 
> > 
> > Since lock hold times aren't limited, esp. for things like userspace
> > 'spin' locks, you've got a very good denial of service / opportunity for
> > abuse right there.
> 
> Maybe add some throttling to avoid overuse/maliciousness?

At which point you're pretty much back to where you started.

A much better approach is using things like priority inheritance, which
can be extended to cover the fair class just fine..

Also note that user-space spinning is inherently prone to live-locks
when combined with the static priority RT scheduling classes.

In general its a very bad idea..
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity Sept. 27, 2012, 8:36 a.m. UTC | #13
On 09/25/2012 03:40 PM, Raghavendra K T wrote:
> On 09/24/2012 07:46 PM, Raghavendra K T wrote:
>> On 09/24/2012 07:24 PM, Peter Zijlstra wrote:
>>> On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:
>>>> However Rik had a genuine concern in the cases where runqueue is not
>>>> equally distributed and lockholder might actually be on a different run
>>>> queue but not running.
>>>
>>> Load should eventually get distributed equally -- that's what the
>>> load-balancer is for -- so this is a temporary situation.
>>>
>>> We already try and favour the non running vcpu in this case, that's what
>>> yield_to_task_fair() is about. If its still not eligible to run, tough
>>> luck.
>>
>> Yes, I agree.
>>
>>>
>>>> Do you think instead of using rq->nr_running, we could get a global
>>>> sense of load using avenrun (something like avenrun/num_onlinecpus)
>>>
>>> To what purpose? Also, global stuff is expensive, so you should try and
>>> stay away from it as hard as you possibly can.
>>
>> Yes, that concern only had made me to fall back to rq->nr_running.
>>
>> Will come back with the result soon.
> 
> Got the result with the patches:
> So here is the result,
> 
> Tried this on a 32 core ple box with HT disabled. 32 guest vcpus with
> 1x and 2x overcommits
> 
> Base = 3.6.0-rc5 + ple handler optimization patches
> A = Base + checking rq_running in vcpu_on_spin() patch
> B = Base + checking rq->nr_running in sched/core
> C = Base - PLE
> 
> ---+-----------+-----------+-----------+-----------+
>    |    Ebizzy result (rec/sec higher is better)   |
> ---+-----------+-----------+-----------+-----------+
>    |    Base   |     A     |      B    |     C     |
> ---+-----------+-----------+-----------+-----------+
> 1x | 2374.1250 | 7273.7500 | 5690.8750 |  7364.3750|
> 2x | 2536.2500 | 2458.5000 | 2426.3750 |    48.5000|
> ---+-----------+-----------+-----------+-----------+
> 
>    % improvements w.r.t BASE
> ---+------------+------------+------------+
>    |      A     |    B       |     C      |
> ---+------------+------------+------------+
> 1x | 206.37603  |  139.70410 |  210.19323 |
> 2x | -3.06555   |  -4.33218  |  -98.08773 |
> ---+------------+------------+------------+
> 
> we are getting the benefit of almost PLE disabled case with this
> approach. With patch B, we have dropped a bit in gain.
> (because we still would iterate vcpus until we decide to do a directed
> yield).

This gives us a good case for tracking preemption on a per-vm basis.  As
long as we aren't preempted, we can keep the PLE window high, and also
return immediately from the handler without looking for candidates.
Raghavendra K T Sept. 27, 2012, 10:21 a.m. UTC | #14
On 09/26/2012 06:27 PM, Andrew Jones wrote:
> On Mon, Sep 24, 2012 at 02:36:05PM +0200, Peter Zijlstra wrote:
>> On Mon, 2012-09-24 at 17:22 +0530, Raghavendra K T wrote:
>>> On 09/24/2012 05:04 PM, Peter Zijlstra wrote:
>>>> On Fri, 2012-09-21 at 17:29 +0530, Raghavendra K T wrote:
>>>>> In some special scenarios like #vcpu<= #pcpu, PLE handler may
>>>>> prove very costly, because there is no need to iterate over vcpus
>>>>> and do unsuccessful yield_to burning CPU.
>>>>
>>>> What's the costly thing? The vm-exit, the yield (which should be a nop
>>>> if its the only task there) or something else entirely?
>>>>
>>> Both vmexit and yield_to() actually,
>>>
>>> because unsuccessful yield_to() overall is costly in PLE handler.
>>>
>>> This is because when we have large guests, say 32/16 vcpus, and one
>>> vcpu is holding lock, rest of the vcpus waiting for the lock, when they
>>> do PL-exit, each of the vcpu try to iterate over rest of vcpu list in
>>> the VM and try to do directed yield (unsuccessful). (O(n^2) tries).
>>>
>>> this results is fairly high amount of cpu burning and double run queue
>>> lock contention.
>>>
>>> (if they were spinning probably lock progress would have been faster).
>>> As Avi/Chegu Vinod had felt it is better to avoid vmexit itself, which
>>> seems little complex to achieve currently.
>>
>> OK, so the vmexit stays and we need to improve yield_to.
>
> Can't we do this check sooner as well, as it only requires per-cpu data?
> If we do it way back in kvm_vcpu_on_spin, then we avoid get_pid_task()
> and a bunch of read barriers from kvm_for_each_vcpu. Also, moving the test
> into kvm code would allow us to do other kvm things as a result of the
> check in order to avoid some vmexits. It looks like we should be able to
> avoid some without much complexity by just making a per-vm ple_window
> variable, and then, when we hit the nr_running == 1 condition, also doing
> vmcs_write32(PLE_WINDOW, (kvm->ple_window += PLE_WINDOW_BUMP))
> Reset the window to the default value when we successfully yield (and
> maybe we should limit the number of bumps).

We indeed checked early in original undercommit patch and it has given
result closer to PLE disabled case. But Agree with Peter that it is ugly 
to export nr_running info to ple handler.

Looking at the result and comparing result of A and C,
> Base = 3.6.0-rc5 + ple handler optimization patches
> A = Base + checking rq_running in vcpu_on_spin() patch
> B = Base + checking rq->nr_running in sched/core
> C = Base - PLE
>
>    % improvements w.r.t BASE
> ---+------------+------------+------------+
>    |      A     |    B       |     C      |
> ---+------------+------------+------------+
> 1x | 206.37603  |  139.70410 |  210.19323 |

I have a feeling that vmexit has not caused significant overhead
compared to iterating over vcpus in PLE handler.. Does it not sound so?

But
> vmcs_write32(PLE_WINDOW, (kvm->ple_window += PLE_WINDOW_BUMP))

is worth trying. I will have to see it eventually.





--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Raghavendra K T Sept. 27, 2012, 11:23 a.m. UTC | #15
On 09/27/2012 02:06 PM, Avi Kivity wrote:
> On 09/25/2012 03:40 PM, Raghavendra K T wrote:
>> On 09/24/2012 07:46 PM, Raghavendra K T wrote:
>>> On 09/24/2012 07:24 PM, Peter Zijlstra wrote:
>>>> On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:
>>>>> However Rik had a genuine concern in the cases where runqueue is not
>>>>> equally distributed and lockholder might actually be on a different run
>>>>> queue but not running.
>>>>
>>>> Load should eventually get distributed equally -- that's what the
>>>> load-balancer is for -- so this is a temporary situation.
>>>>
>>>> We already try and favour the non running vcpu in this case, that's what
>>>> yield_to_task_fair() is about. If its still not eligible to run, tough
>>>> luck.
>>>
>>> Yes, I agree.
>>>
>>>>
>>>>> Do you think instead of using rq->nr_running, we could get a global
>>>>> sense of load using avenrun (something like avenrun/num_onlinecpus)
>>>>
>>>> To what purpose? Also, global stuff is expensive, so you should try and
>>>> stay away from it as hard as you possibly can.
>>>
>>> Yes, that concern only had made me to fall back to rq->nr_running.
>>>
>>> Will come back with the result soon.
>>
>> Got the result with the patches:
>> So here is the result,
>>
>> Tried this on a 32 core ple box with HT disabled. 32 guest vcpus with
>> 1x and 2x overcommits
>>
>> Base = 3.6.0-rc5 + ple handler optimization patches
>> A = Base + checking rq_running in vcpu_on_spin() patch
>> B = Base + checking rq->nr_running in sched/core
>> C = Base - PLE
>>
>> ---+-----------+-----------+-----------+-----------+
>>     |    Ebizzy result (rec/sec higher is better)   |
>> ---+-----------+-----------+-----------+-----------+
>>     |    Base   |     A     |      B    |     C     |
>> ---+-----------+-----------+-----------+-----------+
>> 1x | 2374.1250 | 7273.7500 | 5690.8750 |  7364.3750|
>> 2x | 2536.2500 | 2458.5000 | 2426.3750 |    48.5000|
>> ---+-----------+-----------+-----------+-----------+
>>
>>     % improvements w.r.t BASE
>> ---+------------+------------+------------+
>>     |      A     |    B       |     C      |
>> ---+------------+------------+------------+
>> 1x | 206.37603  |  139.70410 |  210.19323 |
>> 2x | -3.06555   |  -4.33218  |  -98.08773 |
>> ---+------------+------------+------------+
>>
>> we are getting the benefit of almost PLE disabled case with this
>> approach. With patch B, we have dropped a bit in gain.
>> (because we still would iterate vcpus until we decide to do a directed
>> yield).
>
> This gives us a good case for tracking preemption on a per-vm basis.  As
> long as we aren't preempted, we can keep the PLE window high, and also
> return immediately from the handler without looking for candidates.

1) So do you think, deferring preemption patch ( Vatsa was mentioning
long back)  is also another thing worth trying, so we reduce the chance
of LHP.

IIRC, with defer preemption :
we will have hook in spinlock/unlock path to measure depth of lock held,
and shared with host scheduler (may be via MSRs now).
Host scheduler 'prefers' not to preempt lock holding vcpu. (or rather
give say one chance.

2) looking at the result (comparing A & C) , I do feel we have
significant in iterating over vcpus (when compared to even vmexit)
so We still would need undercommit fix sugested by PeterZ (improving by
140%). ?

So looking back at threads/ discussions so far, I am trying to
summarize, the discussions so far. I feel, at least here are the few
potential candidates to go in:

1) Avoiding double runqueue lock overhead  (Andrew Theurer/ PeterZ)
2) Dynamically changing PLE window (Avi/Andrew/Chegu)
3) preempt_notify handler to identify preempted VCPUs (Avi)
4) Avoiding iterating over VCPUs in undercommit scenario. (Raghu/PeterZ)
5) Avoiding unnecessary spinning in overcommit scenario (Raghu/Rik)
6) Pv spinlock
7) Jiannan's proposed improvements
8) Defer preemption patches

Did we miss anything (or added extra?)

So here are my action items:
- I plan to repost this series with what PeterZ, Rik suggested with
performance analysis.
- I ll go back and explore on (3) and (6) ..

Please Let me know..






--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity Sept. 27, 2012, 12:03 p.m. UTC | #16
On 09/27/2012 01:23 PM, Raghavendra K T wrote:
>>
>> This gives us a good case for tracking preemption on a per-vm basis.  As
>> long as we aren't preempted, we can keep the PLE window high, and also
>> return immediately from the handler without looking for candidates.
> 
> 1) So do you think, deferring preemption patch ( Vatsa was mentioning
> long back)  is also another thing worth trying, so we reduce the chance
> of LHP.

Yes, we have to keep it in mind.  It will be useful for fine grained
locks, not so much so coarse locks or IPIs.

I would still of course prefer a PLE solution, but if we can't get it to
work we can consider preemption deferral.

> 
> IIRC, with defer preemption :
> we will have hook in spinlock/unlock path to measure depth of lock held,
> and shared with host scheduler (may be via MSRs now).
> Host scheduler 'prefers' not to preempt lock holding vcpu. (or rather
> give say one chance.

A downside is that we have to do that even when undercommitted.

Also there may be a lot of false positives (deferred preemptions even
when there is no contention).

> 
> 2) looking at the result (comparing A & C) , I do feel we have
> significant in iterating over vcpus (when compared to even vmexit)
> so We still would need undercommit fix sugested by PeterZ (improving by
> 140%). ?

Looking only at the current runqueue?  My worry is that it misses a lot
of cases.  Maybe try the current runqueue first and then others.

Or were you referring to something else?

> 
> So looking back at threads/ discussions so far, I am trying to
> summarize, the discussions so far. I feel, at least here are the few
> potential candidates to go in:
> 
> 1) Avoiding double runqueue lock overhead  (Andrew Theurer/ PeterZ)
> 2) Dynamically changing PLE window (Avi/Andrew/Chegu)
> 3) preempt_notify handler to identify preempted VCPUs (Avi)
> 4) Avoiding iterating over VCPUs in undercommit scenario. (Raghu/PeterZ)
> 5) Avoiding unnecessary spinning in overcommit scenario (Raghu/Rik)
> 6) Pv spinlock
> 7) Jiannan's proposed improvements
> 8) Defer preemption patches
> 
> Did we miss anything (or added extra?)
> 
> So here are my action items:
> - I plan to repost this series with what PeterZ, Rik suggested with
> performance analysis.
> - I ll go back and explore on (3) and (6) ..
> 
> Please Let me know..

Undoubtedly we'll think of more stuff.  But this looks like a good start.
Andrew Theurer Sept. 27, 2012, 12:25 p.m. UTC | #17
On Thu, 2012-09-27 at 14:03 +0200, Avi Kivity wrote:
> On 09/27/2012 01:23 PM, Raghavendra K T wrote:
> >>
> >> This gives us a good case for tracking preemption on a per-vm basis.  As
> >> long as we aren't preempted, we can keep the PLE window high, and also
> >> return immediately from the handler without looking for candidates.
> > 
> > 1) So do you think, deferring preemption patch ( Vatsa was mentioning
> > long back)  is also another thing worth trying, so we reduce the chance
> > of LHP.
> 
> Yes, we have to keep it in mind.  It will be useful for fine grained
> locks, not so much so coarse locks or IPIs.
> 
> I would still of course prefer a PLE solution, but if we can't get it to
> work we can consider preemption deferral.
> 
> > 
> > IIRC, with defer preemption :
> > we will have hook in spinlock/unlock path to measure depth of lock held,
> > and shared with host scheduler (may be via MSRs now).
> > Host scheduler 'prefers' not to preempt lock holding vcpu. (or rather
> > give say one chance.
> 
> A downside is that we have to do that even when undercommitted.
> 
> Also there may be a lot of false positives (deferred preemptions even
> when there is no contention).
> 
> > 
> > 2) looking at the result (comparing A & C) , I do feel we have
> > significant in iterating over vcpus (when compared to even vmexit)
> > so We still would need undercommit fix sugested by PeterZ (improving by
> > 140%). ?
> 
> Looking only at the current runqueue?  My worry is that it misses a lot
> of cases.  Maybe try the current runqueue first and then others.
> 
> Or were you referring to something else?
> 
> > 
> > So looking back at threads/ discussions so far, I am trying to
> > summarize, the discussions so far. I feel, at least here are the few
> > potential candidates to go in:
> > 
> > 1) Avoiding double runqueue lock overhead  (Andrew Theurer/ PeterZ)
> > 2) Dynamically changing PLE window (Avi/Andrew/Chegu)
> > 3) preempt_notify handler to identify preempted VCPUs (Avi)
> > 4) Avoiding iterating over VCPUs in undercommit scenario. (Raghu/PeterZ)
> > 5) Avoiding unnecessary spinning in overcommit scenario (Raghu/Rik)
> > 6) Pv spinlock
> > 7) Jiannan's proposed improvements
> > 8) Defer preemption patches
> > 
> > Did we miss anything (or added extra?)
> > 
> > So here are my action items:
> > - I plan to repost this series with what PeterZ, Rik suggested with
> > performance analysis.
> > - I ll go back and explore on (3) and (6) ..
> > 
> > Please Let me know..
> 
> Undoubtedly we'll think of more stuff.  But this looks like a good start.

9) lazy gang-like scheduling with PLE to cover the non-gang-like
exceptions  (/me runs and hides from scheduler folks)

-Andrew Theurer

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Raghavendra K T Oct. 3, 2012, 2:29 p.m. UTC | #18
* Avi Kivity <avi@redhat.com> [2012-09-27 14:03:59]:

> On 09/27/2012 01:23 PM, Raghavendra K T wrote:
> >>
[...]
> > 2) looking at the result (comparing A & C) , I do feel we have
> > significant in iterating over vcpus (when compared to even vmexit)
> > so We still would need undercommit fix sugested by PeterZ (improving by
> > 140%). ?
> 
> Looking only at the current runqueue?  My worry is that it misses a lot
> of cases.  Maybe try the current runqueue first and then others.
> 

Okay. Do you mean we can have something like

+       if (rq->nr_running == 1 && p_rq->nr_running == 1) {
+               yielded = -ESRCH;
+               goto out_irq;
+       }

in the Peter's patch ?

( I thought lot about && or || . Both seem to have their own cons ).
But that should be only when we have short term imbalance, as PeterZ
told.

I am experimenting all these for V2 patch. Will come back with analysis
and patch.

> Or were you referring to something else?
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity Oct. 3, 2012, 5:25 p.m. UTC | #19
On 10/03/2012 04:29 PM, Raghavendra K T wrote:
> * Avi Kivity <avi@redhat.com> [2012-09-27 14:03:59]:
> 
>> On 09/27/2012 01:23 PM, Raghavendra K T wrote:
>> >>
> [...]
>> > 2) looking at the result (comparing A & C) , I do feel we have
>> > significant in iterating over vcpus (when compared to even vmexit)
>> > so We still would need undercommit fix sugested by PeterZ (improving by
>> > 140%). ?
>> 
>> Looking only at the current runqueue?  My worry is that it misses a lot
>> of cases.  Maybe try the current runqueue first and then others.
>> 
> 
> Okay. Do you mean we can have something like
> 
> +       if (rq->nr_running == 1 && p_rq->nr_running == 1) {
> +               yielded = -ESRCH;
> +               goto out_irq;
> +       }
> 
> in the Peter's patch ?
> 
> ( I thought lot about && or || . Both seem to have their own cons ).
> But that should be only when we have short term imbalance, as PeterZ
> told.

I'm missing the context.  What is p_rq?

What I mean was:

  if can_yield_to_process_in_current_rq
     do that
  else if can_yield_to_process_in_other_rq
     do that
  else
     return -ESRCH
Raghavendra K T Oct. 4, 2012, 10:56 a.m. UTC | #20
On 10/03/2012 10:55 PM, Avi Kivity wrote:
> On 10/03/2012 04:29 PM, Raghavendra K T wrote:
>> * Avi Kivity <avi@redhat.com> [2012-09-27 14:03:59]:
>>
>>> On 09/27/2012 01:23 PM, Raghavendra K T wrote:
>>>>>
>> [...]
>>>> 2) looking at the result (comparing A & C) , I do feel we have
>>>> significant in iterating over vcpus (when compared to even vmexit)
>>>> so We still would need undercommit fix sugested by PeterZ (improving by
>>>> 140%). ?
>>>
>>> Looking only at the current runqueue?  My worry is that it misses a lot
>>> of cases.  Maybe try the current runqueue first and then others.
>>>
>>
>> Okay. Do you mean we can have something like
>>
>> +       if (rq->nr_running == 1 && p_rq->nr_running == 1) {
>> +               yielded = -ESRCH;
>> +               goto out_irq;
>> +       }
>>
>> in the Peter's patch ?
>>
>> ( I thought lot about && or || . Both seem to have their own cons ).
>> But that should be only when we have short term imbalance, as PeterZ
>> told.
>
> I'm missing the context.  What is p_rq?

p_rq is the run queue of target vcpu.
What I was trying below was to address Rik concern. Suppose
rq of source vcpu has one task, but target probably has two task,
with a eligible vcpu waiting to be scheduled.

>
> What I mean was:
>
>    if can_yield_to_process_in_current_rq
>       do that
>    else if can_yield_to_process_in_other_rq
>       do that
>    else
>       return -ESRCH

I think you are saying we have to check the run queue of the
source vcpu, if we have a vcpu belonging to same VM and try yield to 
that? ignoring whatever the target vcpu we received for yield_to.

Or is it that kvm_vcpu_yield_to should now check the vcpus of same vm
belonging to same run queue first. If we don't succeed, go again for
a vcpu in different runqueue.
Does it add more overhead especially in <= 1x scenario?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity Oct. 4, 2012, 12:44 p.m. UTC | #21
On 10/04/2012 12:56 PM, Raghavendra K T wrote:
> On 10/03/2012 10:55 PM, Avi Kivity wrote:
>> On 10/03/2012 04:29 PM, Raghavendra K T wrote:
>>> * Avi Kivity <avi@redhat.com> [2012-09-27 14:03:59]:
>>>
>>>> On 09/27/2012 01:23 PM, Raghavendra K T wrote:
>>>>>>
>>> [...]
>>>>> 2) looking at the result (comparing A & C) , I do feel we have
>>>>> significant in iterating over vcpus (when compared to even vmexit)
>>>>> so We still would need undercommit fix sugested by PeterZ
>>>>> (improving by
>>>>> 140%). ?
>>>>
>>>> Looking only at the current runqueue?  My worry is that it misses a lot
>>>> of cases.  Maybe try the current runqueue first and then others.
>>>>
>>>
>>> Okay. Do you mean we can have something like
>>>
>>> +       if (rq->nr_running == 1 && p_rq->nr_running == 1) {
>>> +               yielded = -ESRCH;
>>> +               goto out_irq;
>>> +       }
>>>
>>> in the Peter's patch ?
>>>
>>> ( I thought lot about && or || . Both seem to have their own cons ).
>>> But that should be only when we have short term imbalance, as PeterZ
>>> told.
>>
>> I'm missing the context.  What is p_rq?
> 
> p_rq is the run queue of target vcpu.
> What I was trying below was to address Rik concern. Suppose
> rq of source vcpu has one task, but target probably has two task,
> with a eligible vcpu waiting to be scheduled.
> 
>>
>> What I mean was:
>>
>>    if can_yield_to_process_in_current_rq
>>       do that
>>    else if can_yield_to_process_in_other_rq
>>       do that
>>    else
>>       return -ESRCH
> 
> I think you are saying we have to check the run queue of the
> source vcpu, if we have a vcpu belonging to same VM and try yield to
> that? ignoring whatever the target vcpu we received for yield_to.
> 
> Or is it that kvm_vcpu_yield_to should now check the vcpus of same vm
> belonging to same run queue first. If we don't succeed, go again for
> a vcpu in different runqueue.

Right.  Prioritize vcpus that are cheap to yield to.  But may return bad
results if all vcpus on the current runqueue are spinners, so probably
not a good idea.

> Does it add more overhead especially in <= 1x scenario?

The current runqueue should have just our vcpu in that case, so low
overhead.  But it's a bad idea due to the above scenario.
Raghavendra K T Oct. 5, 2012, 9:04 a.m. UTC | #22
On 10/04/2012 06:14 PM, Avi Kivity wrote:
> On 10/04/2012 12:56 PM, Raghavendra K T wrote:
>> On 10/03/2012 10:55 PM, Avi Kivity wrote:
>>> On 10/03/2012 04:29 PM, Raghavendra K T wrote:
>>>> * Avi Kivity <avi@redhat.com> [2012-09-27 14:03:59]:
>>>>
>>>>> On 09/27/2012 01:23 PM, Raghavendra K T wrote:
>>>>>>>
>>>> [...]
>>>>>> 2) looking at the result (comparing A & C) , I do feel we have
>>>>>> significant in iterating over vcpus (when compared to even vmexit)
>>>>>> so We still would need undercommit fix sugested by PeterZ
>>>>>> (improving by
>>>>>> 140%). ?
>>>>>
>>>>> Looking only at the current runqueue?  My worry is that it misses a lot
>>>>> of cases.  Maybe try the current runqueue first and then others.
>>>>>
>>>>
>>>> Okay. Do you mean we can have something like
>>>>
>>>> +       if (rq->nr_running == 1 && p_rq->nr_running == 1) {
>>>> +               yielded = -ESRCH;
>>>> +               goto out_irq;
>>>> +       }
>>>>
>>>> in the Peter's patch ?
>>>>
>>>> ( I thought lot about && or || . Both seem to have their own cons ).
>>>> But that should be only when we have short term imbalance, as PeterZ
>>>> told.
>>>
>>> I'm missing the context.  What is p_rq?
>>
>> p_rq is the run queue of target vcpu.
>> What I was trying below was to address Rik concern. Suppose
>> rq of source vcpu has one task, but target probably has two task,
>> with a eligible vcpu waiting to be scheduled.
>>
>>>
>>> What I mean was:
>>>
>>>     if can_yield_to_process_in_current_rq
>>>        do that
>>>     else if can_yield_to_process_in_other_rq
>>>        do that
>>>     else
>>>        return -ESRCH
>>
>> I think you are saying we have to check the run queue of the
>> source vcpu, if we have a vcpu belonging to same VM and try yield to
>> that? ignoring whatever the target vcpu we received for yield_to.
>>
>> Or is it that kvm_vcpu_yield_to should now check the vcpus of same vm
>> belonging to same run queue first. If we don't succeed, go again for
>> a vcpu in different runqueue.
>
> Right.  Prioritize vcpus that are cheap to yield to.  But may return bad
> results if all vcpus on the current runqueue are spinners, so probably
> not a good idea.

Okay. 'll drop vcpu from same rq idea now.

>
>> Does it add more overhead especially in <= 1x scenario?
>
> The current runqueue should have just our vcpu in that case, so low
> overhead.  But it's a bad idea due to the above scenario.
>

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b38f00e..5d5b355 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4272,7 +4272,10 @@  EXPORT_SYMBOL(yield);
  * It's the caller's job to ensure that the target task struct
  * can't go away on us before we can do any checks.
  *
- * Returns true if we indeed boosted the target task.
+ * Returns:
+ *   true (>0) if we indeed boosted the target task.
+ *   false (0) if we failed to boost the target.
+ *   -ESRCH if there's no task to yield to.
  */
 bool __sched yield_to(struct task_struct *p, bool preempt)
 {
@@ -4284,6 +4287,15 @@  bool __sched yield_to(struct task_struct *p, bool preempt)
 	local_irq_save(flags);
 	rq = this_rq();
 
+	/*
+	 * If we're the only runnable task on the rq, there's absolutely no
+	 * point in yielding.
+	 */
+	if (rq->nr_running == 1) {
+		yielded = -ESRCH;
+		goto out_irq;
+	}
+
 again:
 	p_rq = task_rq(p);
 	double_rq_lock(rq, p_rq);
@@ -4293,13 +4305,13 @@  bool __sched yield_to(struct task_struct *p, bool preempt)
 	}
 
 	if (!curr->sched_class->yield_to_task)
-		goto out;
+		goto out_unlock;
 
 	if (curr->sched_class != p->sched_class)
-		goto out;
+		goto out_unlock;
 
 	if (task_running(p_rq, p) || p->state)
-		goto out;
+		goto out_unlock;
 
 	yielded = curr->sched_class->yield_to_task(rq, p, preempt);
 	if (yielded) {
@@ -4312,11 +4324,12 @@  bool __sched yield_to(struct task_struct *p, bool preempt)
 			resched_task(p_rq->curr);
 	}
 
-out:
+out_unlock:
 	double_rq_unlock(rq, p_rq);
+out_irq:
 	local_irq_restore(flags);
 
-	if (yielded)
+	if (yielded > 0)
 		schedule();
 
 	return yielded;